如何将pysparkDataframe列拆分为两列(下面的示例)?

d6kp6zgx  于 2021-05-29  发布在  Spark
关注(0)|答案(3)|浏览(481)

该列在一行中多次使用分隔符,因此 split 不是那么简单。
拆分时,在这种情况下只需考虑第一个分隔符。
到目前为止,我正在这样做。
不过,我觉得能有更好的解决办法吗?

testdf= spark.createDataFrame([("Dog", "meat,bread,milk"), ("Cat", "mouse,fish")],["Animal", "Food"])

testdf.show()

+------+---------------+
|Animal|           Food|
+------+---------------+
|   Dog|meat,bread,milk|
|   Cat|     mouse,fish|
+------+---------------+

testdf.withColumn("Food1", split(col("Food"), ",").getItem(0))\
        .withColumn("Food2",expr("regexp_replace(Food, Food1, '')"))\
        .withColumn("Food2",expr("substring(Food2, 2)")).show()

+------+---------------+-----+----------+
|Animal|           Food|Food1|     Food2|
+------+---------------+-----+----------+
|   Dog|meat,bread,milk| meat|bread,milk|
|   Cat|     mouse,fish|mouse|      fish|
+------+---------------+-----+----------+
szqfcxe2

szqfcxe21#

我只想用 string functions ,没有理由使用regex。

from pyspark.sql import functions as F

testdf\
      .withColumn("Food1", F.expr("""substring(Food,1,instr(Food,',')-1)"""))\
      .withColumn("Food2", F.expr("""substring(Food,instr(Food,',')+1,length(Food))""")).show()

# +------+---------------+-----+----------+

# |Animal|           Food|Food1|     Food2|

# +------+---------------+-----+----------+

# |   Dog|meat,bread,milk| meat|bread,milk|

# |   Cat|     mouse,fish|mouse|      fish|

# +------+---------------+-----+----------+*
ktca8awb

ktca8awb2#

稍微不同的方法是使用切片和修剪:

from pyspark.sql.functions import expr, split

df.withColumn("food_ar", split("food", ",")) \
  .select( \
         df.Animal,
         df.Food,
         expr("food_ar[0]").alias("Food1"),
         expr("trim('[]', string(slice(food_ar, 2, size(food_ar) - 1)))").alias("Food2"))

# +------+---------------+-----+----------+

# |Animal|           Food|Food1|     Food2|

# +------+---------------+-----+----------+

# |   Dog|meat,bread,milk| meat|bread,milk|

# |   Cat|     mouse,fish|mouse|      fish|

# +------+---------------+-----+----------+

首次使用 split 首先生成数组。接下来,我们使用singlesparksql访问器访问这些项 a[0] 填充头部和 slice 一起 trim 对于数组的尾部。

zkure5ic

zkure5ic3#

一种使用正则表达式从列表中只拆分第一个匹配项的方法

testdf.withColumn('Food1',f.split('Food',"(?<=^[^,]*)\\,")[0]).\
       withColumn('Food2',f.split('Food',"(?<=^[^,]*)\\,")[1]).show()

+------+---------------+-----+----------+
|Animal|           Food|Food1|     Food2|
+------+---------------+-----+----------+
|   Dog|meat,bread,milk| meat|bread,milk|
|   Cat|     mouse,fish|mouse|      fish|
+------+---------------+-----+----------+

相关问题