对子目录中的xml文件使用spark.read.from(“xml”).option(“recursiveFileLookup”，“true”)

xyhw6mcr 于 2023-03-30 发布在 Apache

关注(0)|答案(1)|浏览(130)

我想递归加载所有的文件，在xml格式到我的dataframe在一个目录，有额外的子目录。与其他文件格式（txt，parquet，..）的代码似乎工作。

df = (
    spark.read
    .format("xml")
    .option("rowTag", "library")
    .option("wholetext", "true")
    .option("recursiveFileLookup","true")
    .option("pathGlobFilter", "*.xml")
    .load("path/to/dir")
)

我已经用不同的文件格式测试了这段代码，但是没有找到xml文件。

apache-spark

来源：https://stackoverflow.com/questions/75887789/using-spark-read-fromxml-optionrecursivefilelookup-true-for-xml-files

1条答案

按热度按时间

rm5edbpk1#

看来我马上就找到了答案，虽然不一定完全令人满意，基本上我找到了两种可能：
1.将格式从“xml”更改为“text”。
这允许递归阅读，但不幸的是，XML文件的内容不能像以前那样很好地读入。

df = (
    spark.read
    .format("text")
    .option("rowTag", "library")
    .option("wholetext", "true")
    .option("recursiveFileLookup","true")
    .option("pathGlobFilter", "*.xml")
    .load("path/to/dir")
)

1.在加载选项处将一个glob模式附加到路径。

df = (
    spark.read
    .format("xml")
    .option("rowTag", "library")
    .option("wholetext", "true")
    .load("path/to/dir/**/*.xml")
)

这使得“recursiveFileLookup”和“pathGlobFilter”这两个选项变得不必要。

**在glob模式中递归搜索所有目录和

*.xml搜索以. xml结尾的文件。

赞(0）回复(0）举报 2023-03-30

我来回答

对子目录中的xml文件使用spark.read.from(“xml”).option(“recursiveFileLookup”，“true”)

1条答案

相关问题

热门标签

最新问答