在特定排序逻辑上重新排序PySpark Dataframe 列

6yt4nkrj 于 2022-11-01 发布在 Spark

关注(0)|答案(2)|浏览(202)

我有一个PySpark Dataframe ，列顺序如下。我需要按照“分支”对它进行排序。我该怎么做？df.select(sorted(df.columns))似乎没有按照我想要的方式工作。

现有列顺序：

store_id,
store_name,
month_1_branch_A_profit,
month_1_branch_B_profit,
month_1_branch_C_profit,
month_1_branch_D_profit,
month_2_branch_A_profit,
month_2_branch_B_profit,
month_2_branch_C_profit,
month_2_branch_D_profit,
.
.
month_12_branch_A_profit,
month_12_branch_B_profit,
month_12_branch_C_profit,
month_12_branch_D_profit

所需的列顺序：

store_id,
store_name,
month_1_branch_A_profit,
month_2_branch_A_profit,
month_3_branch_A_profit,
month_4_branch_A_profit,
.
.
month_12_branch_A_profit,
month_1_branch_B_profit,
month_2_branch_B_profit,
month_3_branch_B_profit,
.
.
month_12_branch_B_profit,
..

pyspark

来源：https://stackoverflow.com/questions/74106880/reorder-pyspark-dataframe-columns-on-specific-sort-logic

2条答案

按热度按时间

hfsqlsce1#

您可以手动构建列列表。

col_fmt = 'month_{}_branch_{}_profit'
cols = ['store_id', 'store_name']
for branch in ['A', 'B', 'C', 'D']:
    for i in range(1, 13):
        cols.append(col_fmt.format(i, branch))
df.select(cols)

或者，我建议构建一个更好的数据框架，利用数组+结构/Map数据类型。

months - array (size 12)
  - branches: map<string, struct>
    - key: string  (branch name)
    - value: struct
      - profit: float

这样，数组就已经被“排序”了。Map顺序并不重要，它使得特定于某些月份和分支的SQL查询更容易阅读（使用 predicate 下推可能更快）

赞(0）回复(0）举报 2022-11-01

wbgh16ku2#

您可能需要使用一些python编码。在下面的脚本中，我根据下划线_拆分列名，然后根据元素[3]（分支名称）和[1]（月份值）进行排序。
输入df：

cols = ['store_id',
        'store_name',
        'month_1_branch_A_profit',
        'month_1_branch_B_profit',
        'month_1_branch_C_profit',
        'month_1_branch_D_profit',
        'month_2_branch_A_profit',
        'month_2_branch_B_profit',
        'month_2_branch_C_profit',
        'month_2_branch_D_profit',
        'month_12_branch_A_profit',
        'month_12_branch_B_profit',
        'month_12_branch_C_profit',
        'month_12_branch_D_profit']
df = spark.createDataFrame([], ','.join([f'{c} int' for c in cols]))

脚本：

branch_cols = [c for c in df.columns if c not in{'store_id', 'store_name'}]
d = {tuple(c.split('_')):c for c in branch_cols}
df = df.select(
    'store_id', 'store_name',
    *[d[c] for c in sorted(d, key=lambda x: f'{x[3]}_{int(x[1]):02}')]
)

df.printSchema()

# root

# |-- store_id: integer (nullable = true)

# |-- store_name: integer (nullable = true)

# |-- month_1_branch_A_profit: integer (nullable = true)

# |-- month_2_branch_A_profit: integer (nullable = true)

# |-- month_12_branch_A_profit: integer (nullable = true)

# |-- month_1_branch_B_profit: integer (nullable = true)

# |-- month_2_branch_B_profit: integer (nullable = true)

# |-- month_12_branch_B_profit: integer (nullable = true)

# |-- month_1_branch_C_profit: integer (nullable = true)

# |-- month_2_branch_C_profit: integer (nullable = true)

# |-- month_12_branch_C_profit: integer (nullable = true)

# |-- month_1_branch_D_profit: integer (nullable = true)

# |-- month_2_branch_D_profit: integer (nullable = true)

# |-- month_12_branch_D_profit: integer (nullable = true)

赞(0）回复(0）举报 2022-11-01

我来回答

在特定排序逻辑上重新排序PySpark Dataframe 列

2条答案

相关问题

热门标签

最新问答