我想用 petastorm
以另一种方式,我需要通过以下方式之一告诉它我的Parquet地板文件存储在哪里: hdfs://some_hdfs_cluster/user/yevgeni/parquet8
,或 file:///tmp/mydataset
,或 s3://bucket/mydataset
,或 gs://bucket/mydataset
. 因为我在databricks上,并且给定了其他约束,所以我的选择是使用 file:///
选项。
但是,我不知道如何指定我的Parquet文件的位置。我总是被拒绝说那句话 Path does not exist:
####以下是我正在做的:
# save spark df to parquet
dbutils.fs.rm('dbfs:/mnt/team01/assembled_train.parquet', recurse=True)
assembled_train.write.parquet('dbfs:/mnt/team01/assembled_train')
# look at files
display(dbutils.fs.ls('mnt/team01/assembled_train/'))
# results
path name size
dbfs:/mnt/team01/assembled_train/_SUCCESS _SUCCESS 0
dbfs:/mnt/team01/assembled_train/_committed_2150262571233317067 _committed_2150262571233317067 856
dbfs:/mnt/team01/assembled_train/_started_2150262571233317067 _started_2150262571233317067 0
dbfs:/mnt/team01/assembled_train/part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet 578991
dbfs:/mnt/team01/assembled_train/part-00001-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035358-1-c000.snappy.parquet part-00001-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035358-1-c000.snappy.parquet 579640
dbfs:/mnt/team01/assembled_train/part-00002-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035359-1-c000.snappy.parquet part-00002-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035359-1-c000.snappy.parquet 580675
dbfs:/mnt/team01/assembled_train/part-00003-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035360-1-c000.snappy.parquet part-00003-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035360-1-c000.snappy.parquet 579483
dbfs:/mnt/team01/assembled_train/part-00004-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035361-1-c000.snappy.parquet part-00004-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035361-1-c000.snappy.parquet 578807
dbfs:/mnt/team01/assembled_train/part-00005-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035362-1-c000.snappy.parquet part-00005-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035362-1-c000.snappy.parquet 580942
dbfs:/mnt/team01/assembled_train/part-00006-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035363-1-c000.snappy.parquet part-00006-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035363-1-c000.snappy.parquet 579202
dbfs:/mnt/team01/assembled_train/part-00007-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035364-1-c000.snappy.parquet part-00007-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035364-1-c000.snappy.parquet 579810
使用文件结构中的基本Dataframe加载进行测试时,如下所示:
df1 = spark.read.option("header", "true").parquet('file:///mnt/team01/assembled_train/part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet')```
我得到的文件不存在。
1条答案
按热度按时间o4tp2gmn1#
只需按原样指定路径,无需使用“file://”:
如果这不起作用,请尝试中的方法https://docs.databricks.com/applications/machine-learning/load-data/petastorm.html#configure-缓存目录