我有一个表如下-

+------+---------------------------------+---------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cus_id|cus_nm                           |pur_region     |purchase_dt  |pur_details                                                                                                                                                                                                                            |
+------+---------------------------------+---------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0121  |Johnny                           |USA            |2023-01-12   |[{product_id=XA8096521JKAZ42F123, product_name=luxury_watch_collection_rolex_GZ, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0137  |Kevin J Brown                    |USA            |2022-05-31   |[{product_id=XA14567JKR700135126, product_name=luxury_watch_collection_rolex_LA, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0168  |Patrikson                        |UK             |2022-11-08   |[{product_id=XAHJYZK906423623571, product_name=luxury_watch_collection_gucci_09, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0365  |Ryan Ray                         |USA            |2021-10-12   |[{product_id=XAOPLKR7520HJV00109, product_name=luxury_watch_collection_vancleef, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|2600  |Jay                              |AUS            |2022-11-11   |[{product_id=XA096534987GGHJLRAC, product_name=sports_eyewear, description=athlete sports sun glasses, check=sale_item, sale_price_gap=BOGO 20% off, sale_vendor=mrporter.com, action=report}]                                         |
+------+---------------------------------+---------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

此表的架构为-

root
 |-- cus_id: string (nullable = true)
 |-- cus_nm: string (nullable = true)
 |-- pur_region: string (nullable = true)
 |-- purchase_dt: string (nullable = true)
 |-- pur_details: string (nullable = true)

我想拆分列pur_details，并将check和sale_price_gap作为单独的列提取出来。注意，pur_details可能有也可能没有check和sale_price_gap，因此如果它在pur_details中不存在，那么新列的值应该为空。
预期输出示例-

+------+---------------------------------+---------------+-------------+----------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cus_id|cus_nm                           |pur_region     |purchase_dt  |check     |sale_price_gap       |pur_details                                                                                                                                                                                                                            |
+------+---------------------------------+---------------+-------------+----------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0121  |Johnny                           |USA            |2023-01-12   |sale_item |upto 30% on_sale     |[{product_id=XA8096521JKAZ42F123, product_name=luxury_watch_collection_rolex_GZ, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0137  |Kevin J Brown                    |USA            |2022-05-31   |sale_item |upto 30% on_sale     |[{product_id=XA14567JKR700135126, product_name=luxury_watch_collection_rolex_LA, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0168  |Patrikson                        |UK             |2022-11-08   |sale_item |upto 30% on_sale     |[{product_id=XAHJYZK906423623571, product_name=luxury_watch_collection_gucci_09, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0365  |Ryan Ray                         |USA            |2021-10-12   |sale_item |upto 30% on_sale     |[{product_id=XAOPLKR7520HJV00109, product_name=luxury_watch_collection_vancleef, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|2600  |Jay                              |AUS            |2022-11-11   |sale_item |BOGO 20% off         |[{product_id=XA096534987GGHJLRAC, product_name=sports_eyewear, description=athlete sports sun glasses, check=sale_item, sale_price_gap=BOGO 20% off, sale_vendor=mrporter.com, action=report}]                                         |
+------+---------------------------------+---------------+-------------+----------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

当我有数百万行时，有人能用最好和有效的方法帮助我吗？
先谢了。

假设除了那些单独的字段之外，字符串上没有,，我希望这段代码能帮助您获得实现所需的想法：

df.withColumn("pur_details_split", split(col("pur_details"), ","))
    .withColumn("check", element_at(split(element_at(filter(col("pur_details_split"), x => trim(x).startsWith("check")), 1), "="), 2))
    .withColumn("sale_price_gap", element_at(split(element_at(filter(col("pur_details_split"), x => trim(x).startsWith("sale_price_gap")), 1), "="), 2))
    .select("check", "sale_price_gap")
    .show(false)

+---------+----------------+
|check    |sale_price_gap  |
+---------+----------------+
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|null            |
+---------+----------------+

我不记得Spark API的python等价物了，但我敢肯定它们与Scala的等价物相似。
Ps 2.我从原始 Dataframe 的最后一条记录中删除了sale_price_gap，以支持不存在值的情况。

2条答案

按热度按时间

mzmfm0qo1#

赞(0）回复(0）举报 2023-01-25

w3nuxt5m2#

我尝试使用regexp_extract，它减少了代码，而且可能更快。
df.withColumn("check",F.regexp_extract("pur_details", "check=([^,}]*)(.*)" ,1)).withColumn("sale_price_gap",F.regexp_extract("pur_details", "sale_price_gap=([^,}]*)(.*)" ,1)).show()
您可以在此处测试正则表达式check regexp

+--------------------+-------+---------+----------------+
|         pur_details|cust_id|    check|  sale_price_gap|
+--------------------+-------+---------+----------------+
|[{product_id=XA80...|   0121|sale_item|upto 30% on_sale|
|[{product_id=XA14...|   0137|sale_item|upto 30% on_sale|
|[{product_id=XAHJ...|   0168|sale_item|upto 30% on_sale|
|[{product_id=XAOP...|   0365|sale_item|upto 30% on_sale|
|[{product_id=XA09...|   2600|sale_item|    BOGO 20% off|
+--------------------+-------+---------+----------------+

在pyspark中将字符串列拆分为多列

2条答案

相关问题

热门标签

最新问答