我有一个表如下-
+------+---------------------------------+---------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cus_id|cus_nm |pur_region |purchase_dt |pur_details |
+------+---------------------------------+---------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0121 |Johnny |USA |2023-01-12 |[{product_id=XA8096521JKAZ42F123, product_name=luxury_watch_collection_rolex_GZ, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0137 |Kevin J Brown |USA |2022-05-31 |[{product_id=XA14567JKR700135126, product_name=luxury_watch_collection_rolex_LA, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0168 |Patrikson |UK |2022-11-08 |[{product_id=XAHJYZK906423623571, product_name=luxury_watch_collection_gucci_09, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0365 |Ryan Ray |USA |2021-10-12 |[{product_id=XAOPLKR7520HJV00109, product_name=luxury_watch_collection_vancleef, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|2600 |Jay |AUS |2022-11-11 |[{product_id=XA096534987GGHJLRAC, product_name=sports_eyewear, description=athlete sports sun glasses, check=sale_item, sale_price_gap=BOGO 20% off, sale_vendor=mrporter.com, action=report}] |
+------+---------------------------------+---------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
此表的架构为-
root
|-- cus_id: string (nullable = true)
|-- cus_nm: string (nullable = true)
|-- pur_region: string (nullable = true)
|-- purchase_dt: string (nullable = true)
|-- pur_details: string (nullable = true)
我想拆分列pur_details
,并将check
和sale_price_gap
作为单独的列提取出来。注意,pur_details
可能有也可能没有check
和sale_price_gap
,因此如果它在pur_details
中不存在,那么新列的值应该为空。
预期输出示例-
+------+---------------------------------+---------------+-------------+----------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cus_id|cus_nm |pur_region |purchase_dt |check |sale_price_gap |pur_details |
+------+---------------------------------+---------------+-------------+----------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0121 |Johnny |USA |2023-01-12 |sale_item |upto 30% on_sale |[{product_id=XA8096521JKAZ42F123, product_name=luxury_watch_collection_rolex_GZ, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0137 |Kevin J Brown |USA |2022-05-31 |sale_item |upto 30% on_sale |[{product_id=XA14567JKR700135126, product_name=luxury_watch_collection_rolex_LA, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0168 |Patrikson |UK |2022-11-08 |sale_item |upto 30% on_sale |[{product_id=XAHJYZK906423623571, product_name=luxury_watch_collection_gucci_09, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0365 |Ryan Ray |USA |2021-10-12 |sale_item |upto 30% on_sale |[{product_id=XAOPLKR7520HJV00109, product_name=luxury_watch_collection_vancleef, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|2600 |Jay |AUS |2022-11-11 |sale_item |BOGO 20% off |[{product_id=XA096534987GGHJLRAC, product_name=sports_eyewear, description=athlete sports sun glasses, check=sale_item, sale_price_gap=BOGO 20% off, sale_vendor=mrporter.com, action=report}] |
+------+---------------------------------+---------------+-------------+----------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
当我有数百万行时,有人能用最好和有效的方法帮助我吗?
先谢了。
2条答案
按热度按时间mzmfm0qo1#
假设除了那些单独的字段之外,字符串上没有
,
,我希望这段代码能帮助您获得实现所需的想法:我不记得Spark API的python等价物了,但我敢肯定它们与Scala的等价物相似。
Ps 2.我从原始 Dataframe 的最后一条记录中删除了
sale_price_gap
,以支持不存在值的情况。w3nuxt5m2#
我尝试使用regexp_extract,它减少了代码,而且可能更快。
df.withColumn("check",F.regexp_extract("pur_details", "check=([^,}]*)(.*)" ,1)).withColumn("sale_price_gap",F.regexp_extract("pur_details", "sale_price_gap=([^,}]*)(.*)" ,1)).show()
您可以在此处测试正则表达式check regexp