python Pyspark Dataframe -如何基于2列中的数据在dataframe中添加多列

zyfwsgd6 于 2022-12-21 发布在 Python

关注(0)|答案(1)|浏览(161)

我有一个pyspark Dataframe 要求，我需要输入：
以下是场景：

df1 schema:

root
  |-- applianceName: string (nullable = true)
  |-- customer: string (nullable = true)
  |-- daysAgo: integer (nullable = true)
  |-- countAnomaliesByDay: long (nullable = true)

Sample Data:
applianceName | customer | daysAgo| countAnomaliesByDay
app1           cust1       0        100
app1           cust1       1        200
app1           cust1       2       300
app1           cust1       3       400
app1           cust1       4       500
app1           cust1       5       600
app1           cust1       6       700

In df1 schema, I need to add columns - day0,day1,day2,day3,day4,day5,day6 as shown below :

applianceName | customer | day0 | day1| day2 | day3 | day4 | day5| day6
app1            cust1      100     200  300    400    500    600   700  

i.e. column day0 - will have countAnomaliesByDay when daysAgo =0, column day1 - will have countAnomaliesByDay when daysAgo =1 and so on.

我该如何实现这一目标？
蒂亚！

python

来源：https://stackoverflow.com/questions/74872551/pyspark-dataframe-how-to-add-multiple-columns-in-dataframe-based-on-data-in-2

1条答案

按热度按时间

rt4zxlrg1#

我希望，这对你的解决方案有用。我使用pyspark的pivot函数来执行这个，

import findspark
findspark.init()
findspark.find()
from pyspark.sql import *
from pyspark.sql.types import IntegerType, StringType, StructType, StructField

# create a Spark Session
spark = SparkSession.builder.appName('StackOverflowMultiple').getOrCreate()
newDF=[
       StructField('applianceName',StringType(),True),
       StructField('customer',StringType(),True),
       StructField('daysAgo',StringType(),True),
       StructField('countAnomaliesByDay',IntegerType(),True)
       ]
finalStruct=StructType(fields=newDF)
df = spark.read.csv('./pyspark_add_multiple_cols.csv', schema=finalStruct, header=True)
df_pivot = df.groupBy('applianceName', 'customer', 'daysAgo') \
    .sum('countAnomaliesByDay') \
    .groupBy('applianceName', 'customer') \
    .pivot('daysAgo') \
    .sum('sum(countAnomaliesByDay)')
df_pivot.show(truncate=False)

赞(0）回复(0）举报 2022-12-21

我来回答

python Pyspark Dataframe -如何基于2列中的数据在dataframe中添加多列

1条答案

相关问题

热门标签

最新问答