apachespark中使用java的多行输入

wsewodh2 于 2021-06-02 发布在 Hadoop

关注(0)|答案(1)|浏览(419)

我看了其他类似的问题已经在这个网站上，但没有得到一个满意的答案。
我对apachespark和hadoop完全是个新手。我的问题是，我有一个输入文件（35gb），其中包含了网上购物网站商品的多行评论。文件中给出的信息如下所示：

productId: C58500585F
product:  Nun Toy
product/price: 5.99
userId: A3NM6WTIAE
profileName: Heather
helpfulness: 0/1
score: 2.0
time: 1624609
summary: not very much fun
text: Bought it for a relative. Was not impressive.

这是一段回顾。有成千上万个这样的块被空行隔开。这里我需要的是productid、userid和score，所以我已经过滤了javardd以获得我需要的行。所以它看起来如下所示：

productId: C58500585F
userId: A3NM6WTIAE
score: 2.0

代码：

SparkConf conf = new SparkConf().setAppName("org.spark.program").setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);

JavaRDD<String> input = context.textFile("path");

JavaRDD<String> requiredLines = input.filter(new Function<String, Boolean>() {
public Boolean call(String s) throws Exception {
if(s.contains("productId") ||  s.contains("UserId") || s.contains("score") ||  s.isEmpty() ) {
        return false;
    }
    return true;
}
});

现在，我需要把这三行作为一对（键，值）的一部分来读，我不知道怎么读。两组评论之间只有一个空行。
我看了几个网站，但没有找到解决我的问题。有人能帮我吗？谢谢！如果你需要更多的信息，请告诉我。

hadoop mapreduce apache-spark multiline

来源：https://stackoverflow.com/questions/40037883/multi-line-input-in-apache-spark-using-java

1条答案

按热度按时间

vwhgwdsa1#

继续我之前的评论， textinputformat.record.delimiter 可以在这里使用。如果唯一的分隔符是空行，则该值应设置为 "\n\n" .
考虑以下测试数据：

productId: C58500585F
product:  Nun Toy
product/price: 5.99
userId: A3NM6WTIAE
profileName: Heather
helpfulness: 0/1
score: 2.0
time: 1624609
summary: not very much fun
text: Bought it for a relative. Was not impressive.

productId: ABCDEDFG
product:  Teddy Bear
product/price: 6.50
userId: A3NM6WTIAE
profileName: Heather
helpfulness: 0/1
score: 2.0
time: 1624609
summary: not very much fun
text: Second comment.

productId: 12345689
product:  Hot Wheels
product/price: 12.00
userId: JJ
profileName: JJ
helpfulness: 1/1
score: 4.0
time: 1624609
summary: Summarized
text: Some text

然后代码（在scala中）看起来像：

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "\n\n")
val raw = sc.newAPIHadoopFile("test.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)

val data = raw.map(e => {
  val m = e._2.toString
    .split("\n")
    .map(_.split(":", 2))
    .filter(_.size == 2)
    .map(e => (e(0), e(1).trim))
    .toMap

  (m("productId"), m("userId"), m("score").toDouble)
})

输出为：

data.foreach(println)
(C58500585F,A3NM6WTIAE,2.0)
(ABCDEDFG,A3NM6WTIAE,2.0)
(12345689,JJ,4.0)

我不确定你想要输出什么，所以我把它变成了一个三元素元组。另外，如果需要的话，解析逻辑肯定可以变得更有效，但这应该会给您一些工作。

赞(0）回复(0）举报 2021-06-03

我来回答

apachespark中使用java的多行输入

1条答案

相关问题

热门标签

最新问答