pyspark:读取sparkDataframe中的多个xml文件(s3路径列表)

qhhrdooz  于 2021-05-27  发布在  Spark
关注(0)|答案(3)|浏览(586)

如问题所示,我在一个列表中有一个s3路径列表

s3_paths = ["s3a://somebucket/1/file1.xml", "s3a://somebucket/3/file2.xml"]

我正在使用pyspark,想知道如何将所有这些xml文件加载到dataframe中?类似于下面的示例。

df = spark.read.format("com.databricks.spark.xml").option("rowTag", "head").load(s3_paths)

我能够读取单个文件,但希望找到加载所有文件的最佳方法。

nkoocmlb

nkoocmlb1#

您可以检查以下github repo。
https://github.com/databricks/spark-xml

fhity93d

fhity93d2#

@jxc在问题评论中的回答是最佳解决方案:

df = spark.read.format("com.databricks.spark.xml")\
               .option("rowTag", "head")\
               .load(','.join(s3_paths))

下面是一个使用玩具数据集的示例:

fnames = ['books_part1.xml','books_part2.xml'] # part1 -> ids bk101-bk106, part2 -> ids bk107-bk112

df = spark.read.format('xml') \
              .option('rowTag','book')\
              .load(','.join(fnames))

df.show()

# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+

# |  _id|              author|         description|          genre|price|publish_date|               title|

# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+

# |bk101|Gambardella, Matthew|An in-depth look ...|       Computer|44.95|  2000-10-01|XML Developer's G...|

# |bk102|          Ralls, Kim|A former architec...|        Fantasy| 5.95|  2000-12-16|       Midnight Rain|

# |bk103|         Corets, Eva|After the collaps...|        Fantasy| 5.95|  2000-11-17|     Maeve Ascendant|

# |bk104|         Corets, Eva|In post-apocalyps...|        Fantasy| 5.95|  2001-03-10|     Oberon's Legacy|

# |bk105|         Corets, Eva|The two daughters...|        Fantasy| 5.95|  2001-09-10|  The Sundered Grail|

# |bk106|    Randall, Cynthia|When Carla meets ...|        Romance| 4.95|  2000-09-02|         Lover Birds|

# |bk107|      Thurman, Paula|A deep sea diver ...|        Romance| 4.95|  2000-11-02|       Splish Splash|

# |bk108|       Knorr, Stefan|An anthology of h...|         Horror| 4.95|  2000-12-06|     Creepy Crawlies|

# |bk109|        Kress, Peter|After an inadvert...|Science Fiction| 6.95|  2000-11-02|        Paradox Lost|

# |bk110|        O'Brien, Tim|Microsoft's .NET ...|       Computer|36.95|  2000-12-09|Microsoft .NET: T...|

# |bk111|        O'Brien, Tim|The Microsoft MSX...|       Computer|36.95|  2000-12-01|MSXML3: A Compreh...|

# |bk112|         Galos, Mike|Microsoft Visual ...|       Computer|49.95|  2001-04-16|Visual Studio 7: ...|

# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+
34gzjxbg

34gzjxbg3#

把清单打开

s3_paths = ["s3a://somebucket/1/file1.xml", "s3a://somebucket/3/file2.xml"]

df = spark.read.format("com.databricks.spark.xml").option("rowTag", "head").load(*s3_paths)

相关问题