我有一个pyspark Dataframe “jobs”,如下所示:
jobs=
id position keywords
5663123 A ["Engineer","Quality"]
5662986 B ['Java']
5663237 C ['Art', 'Paint', 'Director']
5663066 D ["Junior","Motion","Designer"]
5663039 E ['Junior', 'Designer']
5663153 F ["Client","Specialist"]
5663266 G ['Pyhton']
我有另一个数据框命名为'人民'作为:
people=
pid skills
5662321 ["Engineer","L2"]
5663383 ["Quality","Engineer","L2"]
5662556 ["Art","Director"]
5662850 ["Junior","Motion","Designer"]
5662824 ['Designer', 'Craft', 'Junior']
5652496 ["Client","Support","Specialist"]
5662949 ["Community","Manager"]
我想做的是将人员['skills']的列表值与工作['keywords']匹配
如果匹配超过2个标记,即len(list(set(A)-set(B)))〉=2,则返回列表中people ['match']的新列中的jobs ['id']表中特定作业的ID,因为可能有多个匹配,否则为None。
最终的people df应该如下所示:
people=
pid skills match
5662321 ["Engineer","L2"] None
5663383 ["Quality","Engineer","L2"] [5663123]
5662556 ["Art","Director"] [5663237]
5662850 ["Junior","Motion","Designer"] [5663066,5663039]
5662824 ['Designer', 'Craft', 'Junior'] [5663066,5663039]
5652496 ["Client","Support","Specialist"] [5663153]
5662949 ["Community","Manager"] None
我目前有一个解决方案,但一点效率都没有,现在我逐行迭代spark Dataframe ,这对于一个大的df来说要花很多时间。
我对Pandas的解决方案也持开放态度。
1条答案
按热度按时间92dk7w1h1#
这是可行的:
第4行的withColumn只是将空列表更改为None,如需求中所述。
如果你遇到什么问题就告诉我。
输入:
作业DF:
人员DF:
输出: