使用spark/scala用json字段过滤rdd的csv

xfb7svmp 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(547)

我在研究spark/scala，我需要按列上的特定字段过滤rdd，在这种情况下， user .
我想返回用户的rdd ["Joe","Plank","Willy"] 但似乎不知道怎么做
这是我的rdd：

2020-03-01T00:00:05Z    my.local5.url   {"request_method":"GET","request_length":281,"user":"Joe"}
2020-03-01T00:00:05Z    my.local2.url   {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z    my.local2.url   {"request_method":"GET","request_length":281,"user":"Willy"}
2020-03-01T00:00:05Z    my.local6.url   {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z    my.local2.url   {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z    my.local2.url   {"request_method":"GET","request_length":281,"user":"Tracy"}
2020-03-01T00:00:05Z    my.local6.url   {"request_method":"GET","request_length":281,"user":"Roger"}

预期产量：

2020-03-01T00:00:05Z    my.local5.url   {"request_method":"GET","request_length":281,"user":"Joe"}
2020-03-01T00:00:05Z    my.local2.url   {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z    my.local2.url   {"request_method":"GET","request_length":281,"user":"Willy"}
2020-03-01T00:00:05Z    my.local6.url   {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z    my.local2.url   {"request_method":"GET","request_length":281,"user":"Plank"}

我使用spark提取了rdd，如下所示（伪代码）：

val sparkConf = new SparkConf().setAppName("MyApp")
master.foreach(sparkConf.setMaster)
val sc = new SparkContext(sparkConf)

val rdd = sc.textFile(inputDir)
rdd.filter(_.contains("\"user\":\"THE_ARRAY_OF_NAMES_"))

scala apache-spark pyspark apache-spark-sql spark-streaming

来源：https://stackoverflow.com/questions/62088886/filter-rdds-csv-with-json-field-using-spark-scala

1条答案

按热度按时间

inkz8wg91#

使用Dataframe更容易。
使用from\u json函数可以将该json列转换为多个列

val jsonSchema = StructType(Array(
    StructField("request_method",StringType,true),
    StructField("request_length",IntegerType,true),
    StructField("user",StringType,true)
  ))

val myDf = spark.read.option("header", "true").csv(path)
val formatedDf = myDf.withColumn("formated_json", from_json($"column_name", jsonSchema)
.select($"formated_json.*")
.where($"user".isin("Joe","Plank","Willy")

formatedDf.show

但是如果你想要一个rdd方案，请告诉我。
用rdd版本编辑：记住这是manny的方法之一

//Define a regex pattern
val Pattern = """(?i)"user":"([a-zA-Z]+)"""".r
//Define a Set with your filtered values
val userSet = Set("Joe","Plank","Willy")
//Filter only the values you want
val filteredRdd = rdd.filter( x => {
    //Extract the user using the pattern we just declared
    val user = for(m <- Pattern.findFirstMatchIn(x)) yield m.group(1)
    //If the user variable is equal with one of your set values then this statement will return true and based on that the row will be kept
    userSet(user.getOrElse(""))
})

要查看结果是否正确，可以使用：

filteredRdd.collect().foreach(println)

赞(0）回复(0）举报 2021-05-27

我来回答

使用spark/scala用json字段过滤rdd的csv

1条答案

相关问题

热门标签

最新问答