Pyspark DataFrame:如何将数组元素Map到列并用值格式化字符串

8gsdolmq  于 2022-10-07  发布在  Spark
关注(0)|答案(1)|浏览(126)

我有一个看起来像这样的烟火数据框:

sdf1 = sc.parallelize([["toto", "tata", ["table", "column"], "SELECT {1} FROM {0}"], "titi", "tutu", ["table", "column"], "SELECT {1} FROM {0}"]]).toDF(["table", "column", "parameters", "statement"])

+-----+------+---------------+-------------------+
|table|column|     parameters|          statement|
+-----+------+---------------+-------------------+
| toto|  tata|[table, column]|SELECT {1} FROM {0}|
| titi|  tutu|[table, column]|SELECT {1} FROM {0}|
+-----+------+---------------+-------------------+

我尝试将数组“参数”元素Map到列,最后用列中的值来格式化“语句”。

这就是我在处理转型后所期待的:

sdf2 = sc.parallelize([["toto", "tata", ["table", "column"], "SELECT {1} FROM {0}", "SELECT tata FROM toto"],["titi", "tutu", ["table", "column"], "SELECT {1} FROM {0}", "SELECT tutu FROM titi"]]).toDF(["table", "column", "parameters", "statement", "result"])

+-----+------+---------------+-------------------+---------------------+
|table|column|     parameters|          statement|               result|
+-----+------+---------------+-------------------+---------------------+
| toto|  tata|[table, column]|SELECT {1} FROM {0}|SELECT tata FROM toto|
| titi|  tutu|[table, column]|SELECT {1} FROM {0}|SELECT tutu FROM titi|
+-----+------+---------------+-------------------+---------------------+
3pmvbmvn

3pmvbmvn1#

一种使用RDD的方法。

def addParamsToQuery(param_ls, query, r):
    new_param_ls = [r[k] for k in param_ls]
    new_query = query.format(*new_param_ls)
    return new_query

columns = data_sdf.columns

data_sdf. 
    rdd. 
    map(lambda r: [r[c] for c in columns] + [addParamsToQuery(r.parameters, r.statement, r)]). 
    toDF(columns + ['result']). 
    show(truncate=False)

# +-----+------+---------------+-------------------+---------------------+

# |table|column|parameters     |statement          |result               |

# +-----+------+---------------+-------------------+---------------------+

# |toto |tata  |[table, column]|SELECT {1} FROM {0}|SELECT tata FROM toto|

# |titi |tutu  |[table, column]|SELECT {1} FROM {0}|SELECT tutu FROM titi|

# +-----+------+---------------+-------------------+---------------------+

函数addParamsToQuery使用列值创建参数值列表,并使用.format()插入到语句中。

相关问题