我在研究spark/scala,我需要按列上的特定字段过滤rdd,在这种情况下, user
.
我想返回用户的rdd ["Joe","Plank","Willy"]
但似乎不知道怎么做
这是我的rdd:
2020-03-01T00:00:05Z my.local5.url {"request_method":"GET","request_length":281,"user":"Joe"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Willy"}
2020-03-01T00:00:05Z my.local6.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Tracy"}
2020-03-01T00:00:05Z my.local6.url {"request_method":"GET","request_length":281,"user":"Roger"}
预期产量:
2020-03-01T00:00:05Z my.local5.url {"request_method":"GET","request_length":281,"user":"Joe"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Willy"}
2020-03-01T00:00:05Z my.local6.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
我使用spark提取了rdd,如下所示(伪代码):
val sparkConf = new SparkConf().setAppName("MyApp")
master.foreach(sparkConf.setMaster)
val sc = new SparkContext(sparkConf)
val rdd = sc.textFile(inputDir)
rdd.filter(_.contains("\"user\":\"THE_ARRAY_OF_NAMES_"))
1条答案
按热度按时间inkz8wg91#
使用Dataframe更容易。
使用from\u json函数可以将该json列转换为多个列
但是如果你想要一个rdd方案,请告诉我。
用rdd版本编辑:记住这是manny的方法之一
要查看结果是否正确,可以使用: