在PySpark中不使用RDD.collectAsMap进行Map

gdx19jrr  于 2023-10-15  发布在  Spark
关注(0)|答案(1)|浏览(102)

我买了两个三明治。其中一个字符串由MaterialNumberHierarchyNumber(18位)以及其他近200列组成,第二个是由HierarchyNumberDescription组成的Map文件。我的任务是将第一个 Dataframe 中的HierarchyNumber按照特定的数字编号(Level 1-Level 5)分为5个级别,我通过以下方式实现

df = df_material.withColumn("Level1", substring(col("HierarchyNumber"), 1, 3)) \
.withColumn("Level2", substring(col("HierarchyNumber"), 1, 6)) \
.withColumn("Level3", substring(col("HierarchyNumber"), 1, 10)) \
.withColumn("Level4", substring(col("HierarchyNumber"), 1, 14)) \
.withColumn("Level5", col("HierarchyNumber"))

现在我想做的是用第二个字符串中的适当的Description来替换数字代码的子串,比如1级的3位代码,2级的6位代码。我一直使用RDD.collectAsMap,但它没有列入白名单,我需要另一个使用PySpark的解决方案。有什么建议吗?
第一张table是这样的
| 材料编号|层次数|col3|..col100|
| --|--|--|--|
| 1 | 001915762470792000 |||
| 2 | 003411831611043000 |||
| 3 | 002653955128061000 |||
在我最初的转变之后
| 材料编号|层次数|L1| L2| L3| L4| L5|..col100|
| --|--|--|--|--|--|--|--|
| 1 | 001915762470792000 | 001 | 001915 | 0019157624 | 00191576247079 | 001915762470792000 ||
| 2 | 003411831611043000 | 003 | 003411 | 0034118316 | 00341183161104 | 003411831611043000 ||
| 3 | 002653955128061000 | 002 | 002653 | 0026539551 | 00265395512806 | 002653955128061000 ||
Map框架(df 2)看起来像这样
| 层次数|描述|
| --|--|
| 001 |品牌|
| 002 |子品牌|
| 001915 |实体|
| | |
以此类推。然后我想把L1-L5中的数字替换成它的描述

kokeuurv

kokeuurv1#

您可以像这样将df2df连接到每个级别:

# we use coalesce and a left join to fetch the description if it exists,
# keep the number otherwise
# If all numbers have a description, you may omit coalesce and use an inner join
for i in range(1, 6):
    level = f'Level{i}'
    renamed_df2 = df2.withColumnRenamed('hierarchynumber', level)
    df = df.join(renamed_df2, f'Level{i}', "left")\
           .withColumn(level, coalesce(col('description'), col(level)))\
           .drop('description')

df.show()
+------------------+--------------+----------+------+--------+--------------+------------------+
|            Level5|        Level4|    Level3|Level2|  Level1|materialnumber|   hierarchynumber|
+------------------+--------------+----------+------+--------+--------------+------------------+
|001915762470792000|00191576247079|0019157624|entity|   brand|             1|001915762470792000|
|003411831611043000|00341183161104|0034118316|003411|     003|             2|003411831611043000|
|002653955128061000|00265395512806|0026539551|002653|subbrand|             3|002653955128061000|
+------------------+--------------+----------+------+--------+--------------+------------------+

相关问题