你好,我正在创建一个通用函数或类来添加n个数据集,但我无法找到适当的逻辑来做到这一点,我把所有的代码放在下面,并突出显示我想要一些帮助的部分。如果你发现任何问题,在理解我的代码,然后请ping我。
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
data_fact = [["1", "sravan", "company 1","100"],
["2", "ojaswi", "company 1","200"],
["3", "rohith", "company 2","300"],
["4", "sridevi", "company 1","400"],
["5", "bobby", "company 1","500"]]
# specify column names
columns = ['ID', 'NAME', 'Company','Amount']
# creating a dataframe from the lists of data
df_fact = spark.createDataFrame(data_fact, columns)
Department_table = [["1", "45000", "IT"],
["2", "145000", "Manager"],
["6", "45000", "HR"],
["5", "34000", "Sales"]]
# specify column names
columns1 = ['ID', 'salary', 'department']
df_Department = spark.createDataFrame(Department_table, columns1)
Leave_Table = [["1", "Sick Leave"],
["2", "Casual leave"],
["3", "Casual leave"],
["4", "Earned Leave"],
["4", "Sick Leave"] ]
# specify column names
columns2 = ['ID', 'Leave_type']
df_Leave = spark.createDataFrame(Leave_Table, columns2)
Phone_Table = [["1", "Apple"],
["2", "Samsung"],
["3", "MI"],
["4", "Vivo"],
["4", "Apple"] ]
# specify column names
columns3 = ['ID', 'Phone_type']
df_Phone = spark.createDataFrame(Phone_Table, columns3)
Df_join = df_fact.join(df_Department,df_fact.ID ==df_Department.ID,"inner")\
.join(df_Phone,df_fact.ID ==df_Phone.ID,"inner")\
.join(df_Leave,df_fact.ID ==df_Leave.ID,"inner")\
.select(df_fact.Amount,df_Department.ID,df_Department.salary,df_Department.department,df_Phone.Phone_type,df_Leave.Leave_type)
display(Df_join)
基本上,我想把这些东西推广到n个数据集上
Df_join = df_fact.join(df_Department,df_fact.ID ==df_Department.ID,"inner")\
.join(df_Phone,df_fact.ID ==df_Phone.ID,"inner")\
.join(df_Leave,df_fact.ID ==df_Leave.ID,"inner")\
.select(df_fact.Amount,df_Department.ID,df_Department.salary,df_Department.department,df_Phone.Phone_type,df_Leave.Leave_type) ```
1条答案
按热度按时间klr1opcd1#
由于您在所有 Dataframe 中使用
inner
join,如果您想避免代码过于庞大,您可以使用functools中的.reduce()
来执行join并选择所需的列:https://docs.python.org/3/library/functools.html#functools.reduce
编辑1:如果您需要在不同的联接中指示不同的键,假定您已经重命名了列: