如何在Scala中将文件名保存在元组中

zu0ti5jz  于 2022-11-09  发布在  Scala
关注(0)|答案(1)|浏览(150)

我有一个包含许多文本文件的文件夹,我必须在一个RDD中读取这些文件,并保存上面有单词的文件名
示例:

doc1.txt :
" hello my name sam "

doc2.txt :

"hello world"

我需要传递文件夹路径,结果是:
(您好,doc1),(my,doc1),(world,doc2),.....等
我试过这个:

val rddWhole = spark.sparkContext.wholeTextFiles("C:/tmp/files/*")
  rddWhole.foreach(f=>{
    println(f._1+"=>"+f._2)
  })

但它是将文件中的整个文本作为一个字符串处理,有人知道如何解决它吗?

a8jjtwal

a8jjtwal1#

根据我的假设,您希望提取文件中的每个单词,并将其与该单词包含在其中的文件名结合在一起。正如您所提到的,Spark以单个字符串的形式为您提供文件的全部内容。例如,如果这是文件内容:

hello
my name    is
John Doe

您得到的值将是:

val fileString = "hello\nmy name    is\nJohn Doe"

对吗?因此,您需要将字符串值拆分为任意数量的空格或换行符,如下所示:

val wordsSeparated = fileString.split("\\s+|\\n+") // \\s means space, \\n means new line (in regexes, character escaping and stuff)

所以在最后,你将需要这样的东西:

rddWhole.foreach { f => 
  f._2.split("\\s+|\\n+").foreach(word => println(f._1 + " => " + word))
}

这将是一个结果:

file:/tmp/spark-test/two.txt => and
file:/tmp/spark-test/two.txt => this
file:/tmp/spark-test/two.txt => would
file:/tmp/spark-test/one.txt => so
file:/tmp/spark-test/one.txt => hello
file:/tmp/spark-test/one.txt => my
file:/tmp/spark-test/one.txt => name
file:/tmp/spark-test/one.txt => is
file:/tmp/spark-test/one.txt => John
file:/tmp/spark-test/one.txt => Doe
file:/tmp/spark-test/two.txt => be
file:/tmp/spark-test/two.txt => the
file:/tmp/spark-test/two.txt => second
file:/tmp/spark-test/two.txt => text
file:/tmp/spark-test/two.txt => file

相关问题