我有两个大的日期框架字典框架和输入框架。我想对照输入检查字典框架。
class_dictionary = pd.DataFrame( {
'Subject' : ["qqq", "rrr", "sss", "ttt", "uuu"],
'Class' : ["A type", "B type", "C type", "C type", "A type"],
})
class_dictionary
Subject Class
0 qqq A type
1 rrr B type
2 sss C type
3 ttt C type
4 uuu A type
我的输入框是,
input_db = pd.DataFrame( {
'Obj' : ["name1", "name2", "name3", "name5", "name10million"],
'Subject List' : ["qqq, ttt, ZZZ(not in the dict)", "qqq, ttt, sss", "uuu", "rrr", "uknown"],
})
input_db
Obj Subject List
0 name1 qqq, ttt, ZZZ(not in the dict)
1 name2 qqq, ttt, sss
2 name3 uuu
3 name5 rrr
4 name10million uknown
输出应该像这样,
sample_output = pd.DataFrame( {
'Obj' : ["name1", "name2", "name10million"],
'Values' : ["qqq, ttt, ZZZ(not in the dict)", "qqq, ttt, sss", "uknown"],
'Calculated (can be different new columns)' : ["A type: qqq, C type: ttt", "A type: qqq, C Type: ttt, C type: sss", "unk"],
'Count of types' : ["2", "2", "0"]
})
sample_output
Obj Values-Calculated (can be different new columns) Count of types
0 name1 qqq, ttt, ZZZ(not in the dict) A type: qqq, C type: ttt 2
1 name2 qqq, ttt, sss A type: qqq, C Type: ttt, C type: sss 2
2 name10million uknown unk 0
我知道用python做这件事的一种非常懒惰的方法,这并不能解决我的问题。我想用pyspark做这个。
我知道它比较复杂,任何帮助都将不胜感激。谢谢您。
1条答案
按热度按时间xpcnnkqh1#
您可以拆分和分解主题列表,然后加入Dataframe并进行聚合: