使用pyspark解析spark df中的url字符串

t1rydlwq  于 2021-05-19  发布在  Spark
关注(0)|答案(2)|浏览(500)

我需要解析列中的url字符串 refererurl 在Spark测向。数据如下所示:

refererurl
https://www.delish.com/cooking/recipes/t678
https://www.delish.com/food/recipes/a463/
https://www.delish.com/cooking/recipes/g877

我只对后来发生的事感兴趣 delish.com . 期望输出为:

content
cooking
food
cooking

我试过:

data.withColumn("content", fn.regexp_extract('refererurl', 'param1=(\d)', 2))

返回所有空值

ttvkxqim

ttvkxqim1#

另一种解决问题的方法是使用split和element_at函数,以防我们知道字符串的位置始终保持不变。

df = spark.createDataFrame([(1,"https://www.delish.com/cooking/recipes/t678"), (2,"https://www.delish.com/food/recipes/a463/"),(3,"https://www.delish.com/cooking/recipes/g877")],[ "col1","col2"])
df.show(truncate=False)
df = df.withColumn("splited_col", F.split("col2", "/"))
df = df.withColumn("content", F.element_at(F.col('splited_col'), 4).alias('content'))
df.show(truncate=False)

输入

+----+-------------------------------------------+
|col1|col2                                       |
+----+-------------------------------------------+
|1   |https://www.delish.com/cooking/recipes/t678|
|2   |https://www.delish.com/food/recipes/a463/  |
|3   |https://www.delish.com/cooking/recipes/g877|
+----+-------------------------------------------+

输出

+----+-------------------------------------------+--------------------------------------------------+-------+

|col1|col2                                       |splited_col                                       |content|
+----+-------------------------------------------+--------------------------------------------------+-------+
|1   |https://www.delish.com/cooking/recipes/t678|[https:, , www.delish.com, cooking, recipes, t678]|cooking|
|2   |https://www.delish.com/food/recipes/a463/  |[https:, , www.delish.com, food, recipes, a463, ] |food   |
|3   |https://www.delish.com/cooking/recipes/g877|[https:, , www.delish.com, cooking, recipes, g877]|cooking|
+----+-------------------------------------------+--------------------------------------------------+-------+
gcuhipw9

gcuhipw92#

您可以使用parse\u url获取url的路径,然后使用regexp\u extract获取路径的第一级:

df.withColumn("content", fn.expr("regexp_extract(parse_url(refererurl, 'PATH'),'/(.*?)/')")) \
    .show(truncate=False)

输出:

+-------------------------------------------+-------+
|refererurl                                 |content|
+-------------------------------------------+-------+
|https://www.delish.com/cooking/recipes/t678|cooking|
|https://www.delish.com/food/recipes/a463/  |food   |
|https://www.delish.com/cooking/recipes/g877|cooking|
+-------------------------------------------+-------+

相关问题