dataframe—如何从spark、scala中的df string列中只提取字符串的一部分

rt4zxlrg  于 2021-07-13  发布在  Spark
关注(0)|答案(2)|浏览(339)

在dataframe中,我有一列包含以下数据

('Rated 3.0', "RATED\n \nWent there for a quick bite with friends.\nThe ambience had more of corporate feel. I would say it was unique.\nTried nachos, pasta churros and lasagne.\n\nNachos were pathetic.( Seriously don't order)\nPasta was okayish.\nLasagne was good.\nNutella churros were the best.\nOverall an okayish experience!\nPeace ??"), ('Rated 4.0', "RATED\n  First of all, a big thanks to the staff of this Cafe. Very polite and courteous.\n\nI was there 15mins before their closing time. Without any discomfort or hesitation, the staff welcomed me with a warm smile and said they're still open, though they were preparing to close the cafe for the day.\n\nQuickly ordered the Thai green curry, which is served with rice. They got it for me within 10mins, hot and freshly made.\n\nIt was tasty with the taste of coconut milk. Not very spicy, it was mild spicy.\n\nI saw they had yummy looking dessert menu, should go there to try them out!\n\nA good spacious place to hang out for coffee, pastas, pizza or Thai food.")

我需要外卖 Rated 3.0 每个记录的一部分。这是stringtype列。如何删除额外的数据并提取它?

wrrgggsh

wrrgggsh1#

如果每行格式为 Rated x.x 你可以简单地使用 substring 功能。

scala> df.select(substring('value,3,9)).show
+----------------------+
|substring(value, 3, 9)|
+----------------------+
|             Rated 3.0|
+----------------------+

如果你有多个“利率”在一行中,你可以尝试使用 regexp_replace 并替换以下值:

(' to "
', to ":
") to "

此外,你应该补充 { 在弦和 } 最后。所以格式如下所示。

{
    "a": "b",
    "c": "d"
}

由于这一点,您将创建json字符串,并在下一步中使用 from_json 函数创建数组/结构并获取这些值。

ojsjcaue

ojsjcaue2#

我的解决办法是:假设这个问题有两条记录。
//正在创建列表//

val mytestList=List(("""Rated 3.0, RATED Went there for a quick bite with friends.The ambience had more of corporate feel. I would say it was unique.Tried nachos, pasta churros and lasagne.Nachos were pathetic.( Seriously don't order)Pasta was okayish.Lasagne was good.Nutella churros were the best.Overall an okayish experience!Peace ??"""), 
("""Rated 4.0, RATED  First of all, a big thanks to the staff of this Cafe. Very polite and courteous.I was there 15mins before their closing time. Without any discomfort or hesitation, the staff welcomed me with a warm smile and said they're still open, though they were preparing to close the cafe for the day.Quickly ordered the Thai green curry, which is served with rice. They got it for me within 10mins, hot and freshly made.It was tasty with the taste of coconut milk. Not very spicy, it was mild spicy.I saw they had yummy looking dessert menu, should go there to try them out!A good spacious place to hang out for coffee, pastas, pizza or Thai food."""))

//正在将列表加载到rdd//

val rdd = spark.sparkContext.parallelize(mytestList)

//强制架构列名//

val DF1 = rdd.toDF("Rating")

//解决方案1

DF1.withColumn("tmp", split($"Rating", ",")).select($"tmp".getItem(0).as("col1")).show()
+---------+
|     col1|
+---------+
|Rated 3.0|
|Rated 4.0|
+---------+

//解决方案2删除/删除其他

DF1.withColumn("tmp", split(col("Rating"), ",").getItem(0)).drop("Rating").show()

+---------+
|      tmp|
+---------+
|Rated 3.0|
|Rated 4.0|
+---------+

相关问题