如何在Pyspark中使用groupBy的舍入函数?

iswrvxsc  于 2022-12-22  发布在  Spark
关注(0)|答案(1)|浏览(115)

我们如何在pyspark中使用Round函数和Group by?我有一个spark Dataframe ,需要通过它使用group by和Round函数生成结果?

data1 = [{'Name':'Jhon','ID':21.528,'Add':'USA','ID_2':'30.90'},
{'Name':'Joe','ID':3.69,'Add':'USA','ID_2':'12.80'},
{'Name':'Tina','ID':2.48,'Add':'IND','ID_2':'11.07'},
{'Name':'Jhon','ID':22.22, 'Add':'USA','ID_2':'34.87'},
{'Name':'Joe','ID':5.33,'Add':'INA','ID_2':'56.89'}]

a = sc.parallelize(data1)

在SQL查询中将类似于

select count(ID) as newid, count(ID_2) as secondaryid, round(([newid]+
[secondaryid])/[newid]* 200,1) AS [NEW_PERCENTAGE] FROM DATA1
groupby Name
cetgtptt

cetgtptt1#

您不能在groupby中使用round,您需要随后创建一个新列:

import pyspark.sql.functions as F

df = spark.createDataFrame(a)

(df.groupby('Name')
   .agg(
     F.count('ID').alias('newid'),
     F.count('ID_2').alias('secondaryid')
   )
   .withColumn('NEW_PERCENTAGE', F.round(200 * (F.col('newid') + F.col('secondaryid')) / F.col('newid'), 1))
).show()

+----+-----+-----------+--------------+
|Name|newid|secondaryid|NEW_PERCENTAGE|
+----+-----+-----------+--------------+
| Joe|    2|          2|         400.0|
|Tina|    1|          1|         400.0|
|Jhon|    2|          2|         400.0|
+----+-----+-----------+--------------+

相关问题