在scala中使用spark streaming从文件夹流式处理时，如何读入包括子文件夹在内的所有文件？

2hh7jdfx 于 2021-07-13 发布在 Spark

关注(0)|答案(1)|浏览(323)

我有一些文件，我想流使用Spark结构化流。结构如下：

myFolder
└── subFolderOne
    ├── fileOne.gz
    ├── fileTwo.gz
    └── fileThree.gz
└── subFolderTwo
    ├── fileFour.gz
    ├── fileFive.gz
    ├── fileSix.gz

当我只做以下操作时，它就起作用了：

val df = spark
  .readStream
  .format("json")
  .schema(schema)
  .option("maxFilesPerTrigger", 1)
  .json("/myFolder/subFolderOne/")     <-------

但我想从根的层次来读： /myFolder/ 这样它就可以选择任意数量的子文件夹中的所有文件。这可能吗？
我使用的是spark 2.4.5和scala 2.11.6

scala apache-spark spark-streaming

来源：https://stackoverflow.com/questions/66374300/how-do-i-read-in-all-files-including-subfolders-when-streaming-from-folder-using

1条答案

按热度按时间

9udxz4iz1#

所以，结果就这么简单：
之前：

.json("/myFolder/")

之后

.json("/myFolder/*")

赞(0）回复(0）举报 2021-07-13

我来回答

在scala中使用spark streaming从文件夹流式处理时，如何读入包括子文件夹在内的所有文件？

1条答案

相关问题

热门标签

最新问答