pyspark 如何解码URL格式的列?

flvtvl50  于 12个月前  发布在  Spark
关注(0)|答案(2)|浏览(83)

你知道如何在PySpark中解码下面的“campaign”列吗?此列中的记录是URL格式的字符串:

+--------------------+------------------------+
|user_id             |campaign                |
+--------------------+------------------------+
|alskd9239as23093    |MM+%7C+Cons%C3%B3rcios+%|
|lfifsf093039388     |Aquisi%C3%A7%C3%A3o+%7C |
|kasd877191kdsd999   |Aquisi%C3%A7%C3%A3o+%7C |
+--------------------+------------------------+

我知道可以用Python中的urllib库来做到这一点。但是,我的数据集很大,将其转换为pandas框架需要很长时间。如何使用Spark DataFrame?

6yt4nkrj

6yt4nkrj1#

不需要转换为中间pandas的字符串,你可以使用pyspark用户定义函数(udf)来unquote引用的字符串:

from urllib.parse import unquote

df.withColumn('campaign', F.udf(unquote, F.StringType())('campaign'))

如果campaign列中有null值,那么在取消引用字符串之前必须进行null检查:

f = lambda s: unquote(s) if s else s
df.withColumn('campaign',  F.udf(f, F.StringType())('campaign'))
+-----------------+-----------------+
|          user_id|         campaign|
+-----------------+-----------------+
| alskd9239as23093|MM+|+Consórcios+%|
|  lfifsf093039388|      Aquisição+||
|kasd877191kdsd999|      Aquisição+||
+-----------------+-----------------+
zengzsys

zengzsys2#

这些应该是工作:

Spark 3.5+

F.url_decode('campaign')

Spark 3.4+

F.expr("url_decode(campaign)")

然而,Spark说你的值“MM+%7C+Cons%C3%B3rcios+%”格式不正确:
[CANNOT_DECODE_URL]无法解码提供的URL:MM+%7C+Cons%C3%B3rcios+%.请检查您输入的网址是否正确,然后重试。
当我尝试不使用该行时,它有效:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('lfifsf093039388', 'Aquisi%C3%A7%C3%A3o+%7C'),
     ('kasd877191kdsd999', 'Aquisi%C3%A7%C3%A3o+%7C')],
    ['user_id', 'campaign'])

df.withColumn('campaign', F.url_decode('campaign')).show()
# +-----------------+-----------+
# |          user_id|   campaign|
# +-----------------+-----------+
# |  lfifsf093039388|Aquisição ||
# |kasd877191kdsd999|Aquisição ||
# +-----------------+-----------+

相关问题