pyspark索引到唯一值

deyfvvtc  于 2021-05-17  发布在  Spark
关注(0)|答案(1)|浏览(439)

您需要为唯一值指定索引。stringindexer不合适,因为它考虑了字符的频率。
如何进行因式分解
这样地:

activity_start activity_end       activity_start_code  activity_end_code
0           Stage_0      Stage_3                    0                  0
1           Stage_3      Stage_5                    1                  1
2           Stage_5      Stage_2                    2                  2
3           Stage_2      Stage_7                    3                  3
4           Stage_7          end                    4                  4
5           Stage_0      Stage_2                    0                  2
6           Stage_2      Stage_4                    3                  5
7           Stage_4      Stage_3                    5                  0
8           Stage_3      Stage_8                    1                  6
9           Stage_8          end                    6                  4
43          Stage_0      Stage_2                    0                  2
44          Stage_2      Stage_5                    3                  1
45          Stage_5      Stage_7                    2                  3
46          Stage_7          end                    4                  4
457         Stage_2      Stage_3                    3                  0
458         Stage_3      Stage_8                    1                  6
459         Stage_8          end                    6                  4
lvmkulzt

lvmkulzt1#

这可以通过降低到rdd级别来实现。如果事先知道唯一行的id,则可以进一步简化该过程。以下是示例代码:

rdd = sc.parallelize([(f"Stage{random.randint(0,5)}",f"Stage{random.randint(0,5)}") for _ in range(20)])
schema =StructType([
    StructField("activity_start",StringType(),False),
    StructField("activity_end",StringType(),False)
])
df = spark.createDataFrame(rdd,schema)

df.show()

+--------------+------------+
|activity_start|activity_end|
+--------------+------------+
|        Stage1|      Stage0|
|        Stage1|      Stage5|
|        Stage2|      Stage5|
|        Stage5|      Stage2|
|        Stage2|      Stage0|
|        Stage0|      Stage3|
|        Stage1|      Stage5|
|        Stage1|      Stage0|
|        Stage4|      Stage0|
|        Stage4|      Stage0|
|        Stage5|      Stage5|
|        Stage1|      Stage3|
|        Stage3|      Stage3|
|        Stage5|      Stage1|
|        Stage2|      Stage5|
|        Stage2|      Stage5|
|        Stage3|      Stage3|
|        Stage5|      Stage5|
|        Stage3|      Stage3|
|        Stage3|      Stage3|
+--------------+------------+

acitivity_start = (
    df.select("activity_start")
    .distinct()
    .rdd
    .map(lambda x:(x.activity_start,int(x.activity_start[5:][0])))
)

acitivity_end = (
    df.select("activity_end")
    .distinct()
    .rdd
    .map(lambda x:(x.activity_end,int(x.activity_end[5:][0])+1))

)

schema_start = StructType([
    StructField("activity_start",StringType(),False),
    StructField("start_code",IntegerType(),False)
])

schema_end = StructType([
    StructField("activity_end",StringType(),False),
    StructField("end_code",IntegerType(),False)
])

start_df = spark.createDataFrame(acitivity_start,schema_start)
end_df = spark.createDataFrame(acitivity_end,schema_end)
df.join(start_df,["activity_start"], "left").join(end_df,["activity_end"],"left").show()

+------------+--------------+----------+--------+
|activity_end|activity_start|start_code|end_code|
+------------+--------------+----------+--------+
|      Stage5|        Stage5|         5|       6|
|      Stage5|        Stage5|         5|       6|
|      Stage5|        Stage2|         2|       6|
|      Stage5|        Stage2|         2|       6|
|      Stage5|        Stage2|         2|       6|
|      Stage5|        Stage1|         1|       6|
|      Stage5|        Stage1|         1|       6|
|      Stage3|        Stage3|         3|       4|
|      Stage3|        Stage3|         3|       4|
|      Stage3|        Stage3|         3|       4|
|      Stage3|        Stage3|         3|       4|
|      Stage3|        Stage1|         1|       4|
|      Stage3|        Stage0|         0|       4|
|      Stage2|        Stage5|         5|       3|
|      Stage1|        Stage5|         5|       2|
|      Stage0|        Stage2|         2|       1|
|      Stage0|        Stage4|         4|       1|
|      Stage0|        Stage4|         4|       1|
|      Stage0|        Stage1|         1|       1|
|      Stage0|        Stage1|         1|       1|
+------------+--------------+----------+--------+

相关问题