python Pyspark Dataframe -如何基于2列中的数据在dataframe中添加多列

zyfwsgd6  于 2022-12-21  发布在  Python
关注(0)|答案(1)|浏览(160)

我有一个pyspark Dataframe 要求,我需要输入:
以下是场景:

df1 schema:

root
  |-- applianceName: string (nullable = true)
  |-- customer: string (nullable = true)
  |-- daysAgo: integer (nullable = true)
  |-- countAnomaliesByDay: long (nullable = true)

Sample Data:
applianceName | customer | daysAgo| countAnomaliesByDay
app1           cust1       0        100
app1           cust1       1        200
app1           cust1       2       300
app1           cust1       3       400
app1           cust1       4       500
app1           cust1       5       600
app1           cust1       6       700

In df1 schema, I need to add columns - day0,day1,day2,day3,day4,day5,day6 as shown below :

applianceName | customer | day0 | day1| day2 | day3 | day4 | day5| day6
app1            cust1      100     200  300    400    500    600   700  

i.e. column day0 - will have countAnomaliesByDay when daysAgo =0, column day1 - will have countAnomaliesByDay when daysAgo =1 and so on.

我该如何实现这一目标?
蒂亚!

rt4zxlrg

rt4zxlrg1#

我希望,这对你的解决方案有用。我使用pyspark的pivot函数来执行这个,

import findspark
findspark.init()
findspark.find()
from pyspark.sql import *
from pyspark.sql.types import IntegerType, StringType, StructType, StructField

# create a Spark Session
spark = SparkSession.builder.appName('StackOverflowMultiple').getOrCreate()
newDF=[
       StructField('applianceName',StringType(),True),
       StructField('customer',StringType(),True),
       StructField('daysAgo',StringType(),True),
       StructField('countAnomaliesByDay',IntegerType(),True)
       ]
finalStruct=StructType(fields=newDF)
df = spark.read.csv('./pyspark_add_multiple_cols.csv', schema=finalStruct, header=True)
df_pivot = df.groupBy('applianceName', 'customer', 'daysAgo') \
    .sum('countAnomaliesByDay') \
    .groupBy('applianceName', 'customer') \
    .pivot('daysAgo') \
    .sum('sum(countAnomaliesByDay)')
df_pivot.show(truncate=False)

相关问题