我正在尝试用pyspark处理文本数据（twitter tweets）。emojis和特殊字符正确显示为红色，但“\n”、“&amp”似乎已转义。spark不认识他们。可能其他人也是。我的spark df中的一个示例tweet如下所示：
“大家好\n\n进展如何？保重&amp；“享受”
我想把它们读对。这些文件存储为Parquet地板，我是这样读的：

tweets = spark.read.format('parquet')\
.option('header', 'True')\
.option('encoding', 'utf-8')\
.load(path)

下面是一些示例输入数据，我从原始的jsonl文件中获取（稍后我将数据存储为parquet）。
“全文”：“rt@ourwaroncancer:我们的联邦hpv疫苗接种教育运动在哪里？！我们的联邦肺癌筛查计划在哪(和\u2026“
“全文”：“\u2b55\ufe0f#hpv是最重要的致病原因

宫颈癌，但它不只是导致宫颈癌（见图\ud83d\udc47）\n\u2b55\ufe0f这意味着它们是可以预防的。”

直接从jsonl文件读取会导致相同的识别问题。

tweets = spark.read.\
.option('encoding', 'utf-8')\
.json(path)

spark如何正确识别它们？先谢谢你。

下面的代码可能有助于解决您的问题，
输入：

"Hello everyone\n\nHow is it going? ? Take care &amp; enjoy"

"full_text": "RT @OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026 &amp;"
"full_text": "\u2b55\ufe0f#HPV is the most important cause of #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \n\u2b55\ufe0fThat means they can be PREVENTED @theNCI @NCIprevention @AmericanCancer @cancereu @uicc @IARCWHO @EuropeanCancer @KanserSavascisi @AUTF_DEKANLIK @OncoAlert"

解决问题的代码：

from pyspark.sql.functions import *

df=spark.read.csv("file:///home/sathya/Desktop/stackoverflo/raw-data/input.tweet")

df1=df.withColumn("cleandata",regexp_replace('_c0', '&amp;|\\\\n', ''))
df1.select("cleandata").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cleandata                                                                                                                                                                                                                                                                                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Hello everyoneHow is it going? ? Take care & enjoy                                                                                                                                                                                                                                                                          |
|"full_text": "RT @OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026 &"                                                                                                                                                           |
|"full_text": "\u2b55\ufe0f#HPV is the most important cause of #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \u2b55\ufe0fThat means they can be PREVENTED @theNCI @NCIprevention @AmericanCancer @cancereu @uicc @IARCWHO @EuropeanCancer @KanserSavascisi @AUTF_DEKANLIK @OncoAlert"|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

spark不识别字符串中的新行和等

宫颈癌，但它不只是导致宫颈癌（见图\ud83d\udc47）\n\u2b55\ufe0f这意味着它们是可以预防的。”

1条答案

相关问题

热门标签

最新问答