我有一个Dataframe,有父\标识、服务\标识、产品\关系\标识、产品\名称字段,如下表所示,我想分配标识字段,请注意一个父\u id有多个服务\u id一个服务标识有多个产品名称id生成应遵循以下模式父级--1.n子级1--1.n.1子级2--1.n.2子级3--1.n.3子级4--1.n.4我们如何以一种既考虑性能又考虑大数据的方式来实现这个逻辑?
0qx6xfy61#
scala实现
import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ val parentWindowSpec = Window.orderBy("parent_id") val childWindowSpec = Window.partitionBy( "parent_version", "service_id" ).orderBy("product_relation_id") val df = spark.read.options( Map("inferSchema"->"true","delimiter"->",","header"->"true") ).csv("product.csv") val df2 = df.withColumn( "parent_version", dense_rank.over(parentWindowSpec) ).withColumn( "child_version",row_number.over(childWindowSpec) - 1) val df3 = df2.withColumn("id", when(col("product_name") === lit("Parent"), concat(lit("1."), col("parent_version"))) .otherwise(concat(lit("1."), col("parent_version"),lit("."),col("child_version"))) ).drop("parent_version").drop("child_version")
输出:
scala> df3.show 21/03/26 11:55:01 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. +---------+----------+-------------------+------------+-----+ |parent_id|service_id|product_relation_id|product_name| id| +---------+----------+-------------------+------------+-----+ | 100| 1| 1-A| Parent| 1.1| | 100| 1| 1-A| Child1|1.1.1| | 100| 1| 1-A| Child2|1.1.2| | 100| 1| 1-A| Child3|1.1.3| | 100| 1| 1-A| Child4|1.1.4| | 100| 2| 1-B| Parent| 1.1| | 100| 2| 1-B| Child1|1.1.1| | 100| 2| 1-B| Child2|1.1.2| | 100| 2| 1-B| Child3|1.1.3| | 100| 2| 1-B| Child4|1.1.4| | 100| 3| 1-C| Parent| 1.1| | 100| 3| 1-C| Child1|1.1.1| | 100| 3| 1-C| Child2|1.1.2| | 100| 3| 1-C| Child3|1.1.3| | 100| 3| 1-C| Child4|1.1.4| | 200| 5| 1-D| Parent| 1.2| | 200| 5| 1-D| Child1|1.2.1| | 200| 5| 1-D| Child2|1.2.2| | 200| 5| 1-D| Child3|1.2.3| | 200| 5| 1-D| Child4|1.2.4| +---------+----------+-------------------+------------+-----+ only showing top 20 rows
1条答案
按热度按时间0qx6xfy61#
scala实现
输出: