在特定排序逻辑上重新排序PySpark Dataframe 列

6yt4nkrj  于 2022-11-01  发布在  Spark
关注(0)|答案(2)|浏览(202)

我有一个PySpark Dataframe ,列顺序如下。我需要按照“分支”对它进行排序。我该怎么做?df.select(sorted(df.columns))似乎没有按照我想要的方式工作。

现有列顺序:

store_id,
store_name,
month_1_branch_A_profit,
month_1_branch_B_profit,
month_1_branch_C_profit,
month_1_branch_D_profit,
month_2_branch_A_profit,
month_2_branch_B_profit,
month_2_branch_C_profit,
month_2_branch_D_profit,
.
.
month_12_branch_A_profit,
month_12_branch_B_profit,
month_12_branch_C_profit,
month_12_branch_D_profit

所需的列顺序:

store_id,
store_name,
month_1_branch_A_profit,
month_2_branch_A_profit,
month_3_branch_A_profit,
month_4_branch_A_profit,
.
.
month_12_branch_A_profit,
month_1_branch_B_profit,
month_2_branch_B_profit,
month_3_branch_B_profit,
.
.
month_12_branch_B_profit,
..
hfsqlsce

hfsqlsce1#

您可以手动构建列列表。

col_fmt = 'month_{}_branch_{}_profit'
cols = ['store_id', 'store_name']
for branch in ['A', 'B', 'C', 'D']:
    for i in range(1, 13):
        cols.append(col_fmt.format(i, branch))
df.select(cols)

或者,我建议构建一个更好的数据框架,利用数组+结构/Map数据类型。

months - array (size 12)
  - branches: map<string, struct>
    - key: string  (branch name)
    - value: struct
      - profit: float

这样,数组就已经被“排序”了。Map顺序并不重要,它使得特定于某些月份和分支的SQL查询更容易阅读(使用 predicate 下推可能更快)

wbgh16ku

wbgh16ku2#

您可能需要使用一些python编码。在下面的脚本中,我根据下划线_拆分列名,然后根据元素[3](分支名称)和[1](月份值)进行排序。
输入df:

cols = ['store_id',
        'store_name',
        'month_1_branch_A_profit',
        'month_1_branch_B_profit',
        'month_1_branch_C_profit',
        'month_1_branch_D_profit',
        'month_2_branch_A_profit',
        'month_2_branch_B_profit',
        'month_2_branch_C_profit',
        'month_2_branch_D_profit',
        'month_12_branch_A_profit',
        'month_12_branch_B_profit',
        'month_12_branch_C_profit',
        'month_12_branch_D_profit']
df = spark.createDataFrame([], ','.join([f'{c} int' for c in cols]))

脚本:

branch_cols = [c for c in df.columns if c not in{'store_id', 'store_name'}]
d = {tuple(c.split('_')):c for c in branch_cols}
df = df.select(
    'store_id', 'store_name',
    *[d[c] for c in sorted(d, key=lambda x: f'{x[3]}_{int(x[1]):02}')]
)

df.printSchema()

# root

# |-- store_id: integer (nullable = true)

# |-- store_name: integer (nullable = true)

# |-- month_1_branch_A_profit: integer (nullable = true)

# |-- month_2_branch_A_profit: integer (nullable = true)

# |-- month_12_branch_A_profit: integer (nullable = true)

# |-- month_1_branch_B_profit: integer (nullable = true)

# |-- month_2_branch_B_profit: integer (nullable = true)

# |-- month_12_branch_B_profit: integer (nullable = true)

# |-- month_1_branch_C_profit: integer (nullable = true)

# |-- month_2_branch_C_profit: integer (nullable = true)

# |-- month_12_branch_C_profit: integer (nullable = true)

# |-- month_1_branch_D_profit: integer (nullable = true)

# |-- month_2_branch_D_profit: integer (nullable = true)

# |-- month_12_branch_D_profit: integer (nullable = true)

相关问题