替换pysparkDataframe中列名中的字符

64jmpszr  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(379)

我在pyspark中有一个如下的dataframew

df = spark.createDataFrame([(2,'john',1,1),
                            (2,'john',1,2),
                            (3,'pete',8,3),
                            (3,'pete',8,4),
                            (5,'steve',9,5)],
                           ['id','/na/me','val/ue', 'rank/'])

df.show()

+---+------+------+-----+
| id|/na/me|val/ue|rank/|
+---+------+------+-----+
|  2|  john|     1|    1|
|  2|  john|     1|    2|
|  3|  pete|     8|    3|
|  3|  pete|     8|    4|
|  5| steve|     9|    5|
+---+------+------+-----+

现在在这个数据框中,我想替换列名称,其中 / 在阴囊下 _ . 但是如果 / 位于列名的开头或结尾,然后删除 / 但不要替换为 _ .
我做了如下的事情

for name in df.schema.names:
  df = df.withColumnRenamed(name, name.replace('/', '_'))

>>> df
DataFrame[id: bigint, _na_me: string, val_ue: bigint, rank_: bigint]

>>>df.show()
+---+------+------+-----+
| id|_na_me|val_ue|rank_|
+---+------+------+-----+
|  2|  john|     1|    1|
|  2|  john|     1|    2|
|  3|  pete|     8|    3|
|  3|  pete|     8|    4|
|  5| steve|     9|    5|
+---+------+------+-----+

我怎样才能达到我想要的结果呢

+---+------+------+-----+
| id| na_me|val_ue| rank|
+---+------+------+-----+
|  2|  john|     1|    1|
|  2|  john|     1|    2|
|  3|  pete|     8|    3|
|  3|  pete|     8|    4|
|  5| steve|     9|    5|
+---+------+------+-----+
mbzjlibv

mbzjlibv1#

尝试 regular expression 以python方式替换(re.sub)。

import re
cols=[re.sub(r'(^_|_$)','',f.replace("/","_")) for f in df.columns]

df = spark.createDataFrame([(2,'john',1,1),
                            (2,'john',1,2),
                            (3,'pete',8,3),
                            (3,'pete',8,4),
                            (5,'steve',9,5)],
                           ['id','/na/me','val/ue', 'rank/'])

df.toDF(*cols).show()

# +---+-----+------+----+

# | id|na_me|val_ue|rank|

# +---+-----+------+----+

# |  2| john|     1|   1|

# |  2| john|     1|   2|

# |  3| pete|     8|   3|

# |  3| pete|     8|   4|

# |  5|steve|     9|   5|

# +---+-----+------+----+

# or using for loop on schema.names

for name in df.schema.names:
  df = df.withColumnRenamed(name, re.sub(r'(^_|_$)','',name.replace('/', '_')))

df.show()

# +---+-----+------+----+

# | id|na_me|val_ue|rank|

# +---+-----+------+----+

# |  2| john|     1|   1|

# |  2| john|     1|   2|

# |  3| pete|     8|   3|

# |  3| pete|     8|   4|

# |  5|steve|     9|   5|

# +---+-----+------+----+

相关问题