python 在Pyspark中更新两个嵌套框之间的值

qmelpv7a  于 12个月前  发布在  Python
关注(0)|答案(1)|浏览(71)

使用Databricks PySpark
我有两个 Dataframe
API- API_df -

+-------------+----------+-----------+
|           id|safeNumber|portfolioId|
+-------------+----------+-----------+
|GB-0-08872332|UK09208902|    1697451|
|GB-0-09326157|UK13305196|    1697451|
|GB-0-08581699|UK08891382|    1697451|
|GB-0-07803632|UK08035068|    1697451|
+-------------+----------+-----------+

creditsafe_data -

+---------------+-----------+------------+---------------+-----------+---------+-----------------+-----------------+-----------------+
|merchant_reg_id|safe_number|portfolio_id|notification_id|register_id|data_json|            _in_z|            out_z|         batch_id|
+---------------+-----------+------------+---------------+-----------+---------+-----------------+-----------------+-----------------+
|  GB-0-08697274| UK09017868|     1697451|        1008974|       1101| datadata|13-10-23 11:11:00|14-10-23 11:11:32|13-19-23 11:11:00|
|  GB-0-08872332| UK09208902|     1697451|        1203001|       2247| somedata|22-10-23 12:10:22|25-12-23 12:30:00|22-10-23 17:40:00|
+---------------+-----------+------------+---------------+-----------+---------+-----------------+-----------------+-----------------+

连接后产生的creditsafe_data df

+---------------+-----------+------------+---------------+-----------+---------+-----------------+-----------------+-----------------+
|merchant_reg_id|safe_number|portfolio_id|notification_id|register_id|data_json|            _in_z|            out_z|         batch_id|
+---------------+-----------+------------+---------------+-----------+---------+-----------------+-----------------+-----------------+
|  GB-0-00872332| UK09208902|     1697451|        1008974|       1101| datadata|13-10-23 11:11:00|14-10-23 11:11:32|13-19-23 11:11:00|
|  GB-0-09326157| UK13305196|     1697451|           null|       null|     null|             null|             null|             null|
|  GB-0-08581699| UK08891382|     1697451|           null|       null|     null|             null|             null|             null|
|  GB-0-07803632| UK08035068|     1697451|           null|       null|     null|             null|             null|             null|
+---------------+-----------+------------+---------------+-----------+---------+-----------------+-----------------+-----------------+

注意事项:
1.请注意,API_df.id如何在creditsafe_data.merchant_reg_id中显示值“GB-0-08872332”。如果是这种情况,则不要对creditsafe_data的merchant_reg_id 'GB-0-08872332'行执行任何操作
1.请注意,GB-0-09326157、GB-0-08581699、GB-0-07803632仅存在于API_df中。ID,但不存在于creditsafe_data中。merchant_reg_id,如果是这种情况,则添加GB-0-09326157、GB-0-08581699、GB-0-07803632沿着它们各自的API_df。safeNumber和API_df。portfolioId到creditsafe_data。merchant_reg_id、creditsafe_data。safe_number和creditsafe_data。Portfolio_id
1.请注意,creditsafe_data.merchant_reg_id GB-0-08697274在API_df. id中不存在。如果是这种情况,请从creditsafe_data数据框架中删除整行。因此,当creditsafe_data.display()被执行时,结果框架应该是这样的。
在pyspark中,最有效的方法是什么?请帮帮忙,我刚刚完成了我的Python计算机学位,刚刚开始探索pyspark。发现所谓的creditsafe API,并试图生成警报与免费帐户。这是一个样本测试,我尝试出于我自己的利益。请帮助,因为这将是一个动力,我在这个困难的时候裁员的新生。
编辑-尝试使用这样的东西,随意的代码-

missing_ids = api_df.select("id").subtract(creditsafe_data.select("merchant_reg_id"))
missing_data = api_df.join(missing_ids, "id", "inner")
missing_data = missing_data.select("id", "safeNumber", "portfolioId")
missing_data = missing_data.withColumnRenamed("id", "merchant_reg_id").withColumnRenamed("safeNumber", "safe_number").withColumnRenamed("portfolioId", "portfolio_id")
creditsafe_data = creditsafe_data.union(missing_data)

to_delete = creditsafe_data.select("merchant_reg_id").subtract(api_df.select("id"))
creditsafe_data = creditsafe_data.join(to_delete, "merchant_reg_id", "left_anti")
creditsafe_data.show()

但最后却得到了
分析异常:[NUM_COLUMNS_MISMATCH] UNION只能对列数相同的输入执行,但第一个输入有9列,第二个输入有3列。

mbyulnm0

mbyulnm01#

对我有效的代码是这样的--但是我可能还需要检查其他的用例。但现在,把这个作为答案

creditsafe_data = result_df.select(
    api_df.id.alias("merchant_reg_id"),
    api_df.safeNumber.alias("safe_number"),
    api_df.portfolioId.alias("portfolio_id"),
    creditsafe_data.merchant_reg_id.alias("notification_id"),
    creditsafe_data.register_id,
    creditsafe_data.data_json,
    creditsafe_data._in_z,
    creditsafe_data.out_z,
    creditsafe_data.batch_id
)

相关问题