我有两个Dataframedf1和df2,下面是每一个的内容。
df1型:
+--------------------------+------------------------+--------+
|line_item_usage_account_id|line_item_unblended_cost| name|
+--------------------------+------------------------+--------+
| 100000000001| 12.05|account1|
| 200000000001| 52.0|account2|
| 300000000003| 12.03|account3|
+--------------------------+------------------------+--------+
df2型:
+-----------+-----------------+-----------+-------+--------------+------------------------+
|accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
+-----------+-----------------+-----------+-------+--------------+------------------------+
| account1| 100000000001| cluster1| 111111| 11111111| 12.05|
| account1| 100000000001| cluster1| 666666| 55555555| 10.09|
| account1| 100000000001| cluster7| 666660| 55555551| 11.09|
| account2| 200000000001| cluster2| 222222| 22222222| 52.0|
+-----------+-----------------+-----------+-------+--------------+------------------------+
我只需要找到df1.line\u item\u usage\u account\u id中的id,如果它不在df2.accountproviderid中,则添加字段df1.line\u item\u unblended\u cost和df1.name,如下所示:
df3公司:
+-----------+-----------------+-----------+-------+--------------+------------------------+
|accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
+-----------+-----------------+-----------+-------+--------------+------------------------+
| account1| 100000000001| cluster1| 111111| 11111111| 12.05|
| account1| 100000000001| cluster1| 666666| 55555555| 10.09|
| account1| 100000000001| cluster7| 666660| 55555551| 11.09|
| account2| 200000000001| cluster2| 222222| 22222222| 52.0|
| account3| 300000000003| null | null | null | 12.03|
+-----------+-----------------+-----------+-------+--------------+------------------------+
这是Dataframe的代码,你知道怎么实现吗?
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df1 = spark.createDataFrame([
[100000000001, 12.05, 'account1'],
[200000000001, 52.00, 'account2'],
[300000000003, 12.03, 'account3']],
schema=['line_item_usage_account_id', 'line_item_unblended_cost', 'name' ])
df2 = spark.createDataFrame([
['account1', 100000000001, 'cluster1', 111111, 11111111, 12.05],
['account1', 100000000001, 'cluster1', 666666, 55555555, 10.09],
['account1', 100000000001, 'cluster7', 666660, 55555551, 11.09],
['account2', 200000000001, 'cluster2', 222222, 22222222, 52.00]],
schema=['accountname', 'accountproviderid', 'clustername', 'app_pmo', 'app_costcenter', 'line_item_unblended_cost'])
提前谢谢。
1条答案
按热度按时间lskq00tm1#
我没有安装pyspark来检查,但它可以帮上忙
它可以加入过滤,但如果你的df可能很大-需要使用另一种方法