pyspark 如何在列表中创建列并将表中列数组中的值赋给它们?

djp7away  于 2023-03-28  发布在  Spark
关注(0)|答案(2)|浏览(147)

我有一个如下所示的表,最后一列包含数组值

| job_id| timestamp                   |item_values   |
|:----  |:---------------------------:|:------------:|
| job1  | 2022-02-15T23:10:00.000+0000|[0.2,3.4,13.2]|       
| Three | 2022-02-15T23:20:00.000+0000|[0.1,2.9,11.2]|
| job2  | 2022-02-15T23:30:00.000+0000|[1.2,3.1,16.0]|
| job3  | 2022-02-15T23:40:00.000+0000|[0.4,0.4,16.2]|       
| job4  | 2022-02-15T23:50:00.000+0000|[0.7,8.4,11.2]|
| job5  | 2022-02-15T24:00:00.000+0000|[0.3,1.5,19.1]|  
| job6  | 2022-02-15T24:10:00.000+0000|[0.7,7.4,13.2]|

这里有一个item_names的列表,如下所示

[item1,item2,item3]

我想要这样的输出

| job_id| timestamp                  |item1|item2|item3|
|:----  |:---------------------------:|:----|-----|-----|
| job1  | 2022-02-15T23:10:00.000+0000|0.2  | 3.4 | 13.2|    
| Three | 2022-02-15T23:20:00.000+0000|0.1  | 2.9 | 11.2|
| job2  | 2022-02-15T23:30:00.000+0000|1.2  | 3.1 | 16.0|
| job3  | 2022-02-15T23:40:00.000+0000|0.4  | 0.4 | 16.2|  
| job4  | 2022-02-15T23:50:00.000+0000|0.7  | 8.4 | 11.2|
| job5  | 2022-02-15T24:00:00.000+0000|0.3  | 1.5 | 19.1|
| job6  | 2022-02-15T24:10:00.000+0000|0.7  | 7.4 | 13.2|

如何使用pyspark实现此操作?

rdrgkggo

rdrgkggo1#

我在我的环境中复制了同样的东西。我得到了这个输出。

定义项名称列表,并使用此代码通过enumerate为每个项名称创建新列。

from pyspark.sql.functions import col, explode

# Create sample data

d123 = [("job1", "2022-02-15T23:10:00.000+0000", [0.2,3.4,13.2]),
("Three", "2022-02-15T23:20:00.000+0000", [0.1,2.9,11.2]),
("job2", "2022-02-15T23:30:00.000+0000", [1.2,3.1,16.0]),
("job3", "2022-02-15T23:40:00.000+0000", [0.4,0.4,16.2]),
("job4", "2022-02-15T23:50:00.000+0000", [0.7,8.4,11.2]),
("job5", "2022-02-15T24:00:00.000+0000", [0.3,1.5,19.1]),
("job6", "2022-02-15T24:10:00.000+0000", [0.7,7.4,13.2])]

df = spark.createDataFrame(d123, ["job_id", "timestamp", "item_values"])

#list of item_names
item_nam12 = ["item1", "item2", "item3"]

for i, item_n12 in  enumerate(item_nam12):
    df = df.withColumn(item_n12, col("item_values")[i])
    
df = df.drop("item_values")
df.show()

输出:

ujv3wf0j

ujv3wf0j2#

使用getItem()从数组列中按索引提取值

from pyspark.sql import functions as F

for i in range(3):
  df = df.withColumn("item" + str(i+1), F.col("item_values").getItem(i))

相关问题