PySpark基于正则表达式模式列表重命名多个列

wmvff8tz 于 2022-12-22 发布在 Spark

关注(0)|答案(1)|浏览(103)

我有一个如下所示的 Dataframe 。我想根据正则表达式模式重命名列。

patterns = ["price-usd-([0-9]+)", "list_price_([0-9]+)", "price_per_([0-9]+)_units", "pricefor([0-9]+)", "([0-9]+)_plus_price", "break_price_([0-9]+)", "price_break_pricing_([a-z]+)"]

基于上述模式，我想重命名 Dataframe 中的列如下。

------------------------------------------------------------------------------------------------------------------------------------------
| item_name | price-usd-1 | break_price_7  |    pricefor5  |  price_per_9_units | price_break_pricing_a |  2_plus_price  | list_price_8  |
------------------------------------------------------------------------------------------------------------------------------------------
| Samsung Z |   10000     |         5      |    9000       |         10         |          7000         |      4         |       21      |
| Moto G4   |   12000     |         10     |    10000      |         20         |          6000         |      3         |       43      |
| Mi 4i     |   15000     |         8      |    12000      |         20         |         10000         |      5         |       25      |
| Moto G3   |   20000     |         5      |    18000      |         12         |         15000         |      10        |       15      |
------------------------------------------------------------------------------------------------------------------------------------------

输出：

----------------------------------------------------------------------------------------------------------------------
| item_name |    price_1  |    price_7     |     price_5   |       price_9      |  price_a   |   price_2  |  price_8 |   
----------------------------------------------------------------------------------------------------------------------
| Samsung Z |   10000     |         5      |    9000       |         10         |  7000      |    4       |    21    |
| Moto G4   |   12000     |         10     |    10000      |         20         |  6000      |    3       |    43    |
| Mi 4i     |   15000     |         8      |    12000      |         20         |  10000     |    5       |    25    |
| Moto G3   |   20000     |         5      |    18000      |         12         |  15000     |    10      |    15    |
----------------------------------------------------------------------------------------------------------------------

pyspark

来源：https://stackoverflow.com/questions/74868256/pyspark-rename-multiple-columns-based-on-regex-pattern-list

1条答案

按热度按时间

icnyk63a1#

我会跟你一样。我会用正则表达式来提取值，然后重命名。
数据类型

df=spark.createDataFrame ([('Samsung Z ',   10000  ,    5    ,    9000  ,    10  ,  7000   ,   20  , 'amazon.com') ,
 ('Moto G4'   ,   12000  ,    10   ,    10000 ,    20  ,  6000   ,   50  , 'ebay.com' )  ,
('Mi 4i '    ,   15000  ,    8    ,    12000 ,    20  ,  10000  ,   25  ,' deals.com')   ,
( 'Moto G3'   ,   20000  ,    5    ,    18000 ,    12  ,  15000  ,   30  , 'ebay.com' ) ] ,
  ('item_name' , ' price-usd-1'  ,  'break_price_7 '  ,  'pricefor5  ' ,  'price_per_9_units' , 'price_3' , 'price_break_pricing_a6' ,     '2_plus_price' ))

+----------+------------+--------------+-----------+-----------------+-------+----------------------+------------+
| item_name| price-usd-1|break_price_7 |pricefor5  |price_per_9_units|price_3|price_break_pricing_a6|2_plus_price|
+----------+------------+--------------+-----------+-----------------+-------+----------------------+------------+
|Samsung Z |       10000|             5|       9000|               10|   7000|                    20|  amazon.com|
|   Moto G4|       12000|            10|      10000|               20|   6000|                    50|    ebay.com|
|    Mi 4i |       15000|             8|      12000|               20|  10000|                    25|   deals.com|
|   Moto G3|       20000|             5|      18000|               12|  15000|                    30|    ebay.com|
+----------+------------+--------------+-----------+-----------------+-------+----------------------+------------+

解决方案

import re
x = ['_'.join(sorted(re.findall(r'price|\d', x),reverse=True)) for x in df.columns if x!='item_name']#extract price and digits into a list, and concat

df.toDF('item_name',*x).show()#Pass new names into df

+----------+-------+-------+-------+-------+-------+-------+----------+
| item_name|price_1|price_7|price_5|price_9|price_3|price_6|   price_2|
+----------+-------+-------+-------+-------+-------+-------+----------+
|Samsung Z |  10000|      5|   9000|     10|   7000|     20|amazon.com|
|   Moto G4|  12000|     10|  10000|     20|   6000|     50|  ebay.com|
|    Mi 4i |  15000|      8|  12000|     20|  10000|     25| deals.com|
|   Moto G3|  20000|      5|  18000|     12|  15000|     30|  ebay.com|
+----------+-------+-------+-------+-------+-------+-------+----------+

赞(0）回复(0）举报 2022-12-22

我来回答

PySpark基于正则表达式模式列表重命名多个列

1条答案

相关问题

热门标签

最新问答