在pyspark中将字符串列拆分为多列

h79rfbju  于 2023-01-25  发布在  Spark
关注(0)|答案(2)|浏览(226)

我有一个表如下-

+------+---------------------------------+---------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cus_id|cus_nm                           |pur_region     |purchase_dt  |pur_details                                                                                                                                                                                                                            |
+------+---------------------------------+---------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0121  |Johnny                           |USA            |2023-01-12   |[{product_id=XA8096521JKAZ42F123, product_name=luxury_watch_collection_rolex_GZ, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0137  |Kevin J Brown                    |USA            |2022-05-31   |[{product_id=XA14567JKR700135126, product_name=luxury_watch_collection_rolex_LA, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0168  |Patrikson                        |UK             |2022-11-08   |[{product_id=XAHJYZK906423623571, product_name=luxury_watch_collection_gucci_09, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0365  |Ryan Ray                         |USA            |2021-10-12   |[{product_id=XAOPLKR7520HJV00109, product_name=luxury_watch_collection_vancleef, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|2600  |Jay                              |AUS            |2022-11-11   |[{product_id=XA096534987GGHJLRAC, product_name=sports_eyewear, description=athlete sports sun glasses, check=sale_item, sale_price_gap=BOGO 20% off, sale_vendor=mrporter.com, action=report}]                                         |
+------+---------------------------------+---------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

此表的架构为-

root
 |-- cus_id: string (nullable = true)
 |-- cus_nm: string (nullable = true)
 |-- pur_region: string (nullable = true)
 |-- purchase_dt: string (nullable = true)
 |-- pur_details: string (nullable = true)

我想拆分列pur_details,并将checksale_price_gap作为单独的列提取出来。注意,pur_details可能有也可能没有checksale_price_gap,因此如果它在pur_details中不存在,那么新列的值应该为空。
预期输出示例-

+------+---------------------------------+---------------+-------------+----------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cus_id|cus_nm                           |pur_region     |purchase_dt  |check     |sale_price_gap       |pur_details                                                                                                                                                                                                                            |
+------+---------------------------------+---------------+-------------+----------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0121  |Johnny                           |USA            |2023-01-12   |sale_item |upto 30% on_sale     |[{product_id=XA8096521JKAZ42F123, product_name=luxury_watch_collection_rolex_GZ, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0137  |Kevin J Brown                    |USA            |2022-05-31   |sale_item |upto 30% on_sale     |[{product_id=XA14567JKR700135126, product_name=luxury_watch_collection_rolex_LA, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0168  |Patrikson                        |UK             |2022-11-08   |sale_item |upto 30% on_sale     |[{product_id=XAHJYZK906423623571, product_name=luxury_watch_collection_gucci_09, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0365  |Ryan Ray                         |USA            |2021-10-12   |sale_item |upto 30% on_sale     |[{product_id=XAOPLKR7520HJV00109, product_name=luxury_watch_collection_vancleef, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|2600  |Jay                              |AUS            |2022-11-11   |sale_item |BOGO 20% off         |[{product_id=XA096534987GGHJLRAC, product_name=sports_eyewear, description=athlete sports sun glasses, check=sale_item, sale_price_gap=BOGO 20% off, sale_vendor=mrporter.com, action=report}]                                         |
+------+---------------------------------+---------------+-------------+----------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

当我有数百万行时,有人能用最好和有效的方法帮助我吗?
先谢了。

mzmfm0qo

mzmfm0qo1#

假设除了那些单独的字段之外,字符串上没有,,我希望这段代码能帮助您获得实现所需的想法:

df.withColumn("pur_details_split", split(col("pur_details"), ","))
    .withColumn("check", element_at(split(element_at(filter(col("pur_details_split"), x => trim(x).startsWith("check")), 1), "="), 2))
    .withColumn("sale_price_gap", element_at(split(element_at(filter(col("pur_details_split"), x => trim(x).startsWith("sale_price_gap")), 1), "="), 2))
    .select("check", "sale_price_gap")
    .show(false)

+---------+----------------+
|check    |sale_price_gap  |
+---------+----------------+
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|null            |
+---------+----------------+

我不记得Spark API的python等价物了,但我敢肯定它们与Scala的等价物相似。
Ps 2.我从原始 Dataframe 的最后一条记录中删除了sale_price_gap,以支持不存在值的情况。

w3nuxt5m

w3nuxt5m2#

我尝试使用regexp_extract,它减少了代码,而且可能更快。
df.withColumn("check",F.regexp_extract("pur_details", "check=([^,}]*)(.*)" ,1)).withColumn("sale_price_gap",F.regexp_extract("pur_details", "sale_price_gap=([^,}]*)(.*)" ,1)).show()
您可以在此处测试正则表达式check regexp

+--------------------+-------+---------+----------------+
|         pur_details|cust_id|    check|  sale_price_gap|
+--------------------+-------+---------+----------------+
|[{product_id=XA80...|   0121|sale_item|upto 30% on_sale|
|[{product_id=XA14...|   0137|sale_item|upto 30% on_sale|
|[{product_id=XAHJ...|   0168|sale_item|upto 30% on_sale|
|[{product_id=XAOP...|   0365|sale_item|upto 30% on_sale|
|[{product_id=XA09...|   2600|sale_item|    BOGO 20% off|
+--------------------+-------+---------+----------------+

相关问题