我的目标是Map每个范围并Map到它的中间部分(15000-25000->20000)。我整理了数据,把它浓缩到这个专栏里。如何在列本身上应用所需的转换\n将结果Map到另一列?我在网上找不到任何可以理解的关于Pypark的数据。。
kadbb4591#
Spark<2.4
df.withColumn('id',monotonically_increasing_id()).\ withColumn('val',explode('jobsalary')).\ groupBy('id').agg(avg('val').alias('mid')).\ drop('id').show() # +-----------+-------+ # | jobsalary| mid | # +-----------+-------+ # |15000-25000| 20000| # +-----------+-------+
qoefvg9y2#
使用高阶函数 aggregate 从 spark-2.4 例子:
aggregate
spark-2.4
df=spark.createDataFrame([('15000-25000',)],['jobsalary']) from pyspark.sql.functions import * df.withColumn("mid",expr('cast(aggregate(cast(split(jobsalary,"-") as array<int>),0,(acc,x) -> acc+x)/size(cast(split(jobsalary,"-") as array<int>)) as int)')).show() # +-----------+-------+ # | jobsalary| mid | # +-----------+-------+ # |15000-25000| 20000| # +-----------+-------+
2条答案
按热度按时间kadbb4591#
Spark<2.4
qoefvg9y2#
使用高阶函数
aggregate
从spark-2.4
例子: