我可以使用regexp\u replace或其他等效工具用一行代码替换pysparkDataframe列中的多个值吗?
以下是创建我的Dataframe的代码:
from pyspark import SparkContext, SparkConf, SQLContext
from datetime import datetime
sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)
data1 = [
('George', datetime(2010, 3, 24, 3, 19, 58), 13),
('George', datetime(2020, 9, 24, 3, 19, 6), 8),
('George', datetime(2009, 12, 12, 17, 21, 30), 5),
('Micheal', datetime(2010, 11, 22, 13, 29, 40), 12),
('Maggie', datetime(2010, 2, 8, 3, 31, 23), 8),
('Ravi', datetime(2009, 1, 1, 4, 19, 47), 2),
('Xien', datetime(2010, 3, 2, 4, 33, 51), 3),
]
df1 = sqlContext.createDataFrame(data1, ['name', 'trial_start_time', 'purchase_time'])
df1.show(truncate=False)
以下是Dataframe:
+-------+-------------------+-------------+
|name |trial_start_time |purchase_time|
+-------+-------------------+-------------+
|George |2010-03-24 07:19:58|13 |
|George |2020-09-24 07:19:06|8 |
|George |2009-12-12 22:21:30|5 |
|Micheal|2010-11-22 18:29:40|12 |
|Maggie |2010-02-08 08:31:23|8 |
|Ravi |2009-01-01 09:19:47|2 |
|Xien |2010-03-02 09:33:51|3 |
+-------+-------------------+-------------+
下面是替换一个字符串的工作示例:
from pyspark.sql.functions import regexp_replace, regexp_extract, col
df1.withColumn("name", regexp_replace('name', "Ravi", "Ravi_renamed")).show()
以下是输出:
+------------+-------------------+-------------+
| name| trial_start_time|purchase_time|
+------------+-------------------+-------------+
| George|2010-03-24 07:19:58| 13|
| George|2020-09-24 07:19:06| 8|
| George|2009-12-12 22:21:30| 5|
| Micheal|2010-11-22 18:29:40| 12|
| Maggie|2010-02-08 08:31:23| 8|
|Ravi_renamed|2009-01-01 09:19:47| 2|
| Xien|2010-03-02 09:33:51| 3|
+------------+-------------------+-------------+
在pandas中,我可以用lambda表达式替换一行代码中的多个字符串:
df1[name].apply(lambda x: x.replace('George','George_renamed1').replace('Ravi', 'Ravi_renamed2')
我不确定这是否可以在pyspark中用regexp\u替换。也许是另一种选择?当我读到在pyspark中使用lambda表达式时,似乎我必须创建udf函数(看起来有点长)。但我很好奇,我是否可以简单地在一行代码中对多个字符串运行某种类型的regex表达式。
1条答案
按热度按时间xjreopfe1#
这就是你要找的:
使用when()(最可读)
使用mapping expr(不太明确,但如果有很多值要替换,就很方便)
或者如果你已经有一个列表可以使用。
name_changes = ['George', 'George_renamed1', 'Ravi', 'Ravi_renamed2']
```str()[1:-1] to convert list to string and remove [ ]
df1 = df1.withColumn('name', expr(f'coalesce(map({str(name_changes)[1:-1]})[name], name)'))
mapping_expr = create_map([lit(x) for x in name_changes])
df1 = df1.withColumn('name', coalesce(mapping_expr[df1['name']], 'name'))
df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name],name)")).show()
+---------------+-------------------+-------------+
| name| trial_start_time|purchase_time|
+---------------+-------------------+-------------+
|George_renamed1|2010-03-24 03:19:58| 13|
|George_renamed1|2020-09-24 03:19:06| 8|
|George_renamed1|2009-12-12 17:21:30| 5|
| Micheal|2010-11-22 13:29:40| 12|
| Maggie|2010-02-08 03:31:23| 8|
| Ravi_renamed2|2009-01-01 04:19:47| 2|
| Xien|2010-03-02 04:33:51| 3|
+---------------+-------------------+-------------+