查找 Dataframe 列之间共现元素的数量

krcsximq  于 2022-10-23  发布在  其他
关注(0)|答案(1)|浏览(149)

我有一个DataFrame,它有一个网站、类别和该网站的关键词。

Url  | categories                                | keywords
Espn | [sport, nba, nfl]                         | [half, touchdown, referee,  player, goal]
Tmz  | [entertainment, sport]                    | [gossip, celebrity, player]
Goal [ [sport, premier_league, champions_league] | [football, goal, stadium, player, referee]

可以使用以下代码创建:

data = [{ 'Url': 'ESPN', 'categories': ['sport', 'nba', 'nfl'] ,
         'keywords': ["half", "touchdown", "referee",  "player", "goal"] },
         { 'Url': 'TMZ', 'categories': ["entertainment", "sport"] ,
           'keywords': ["gossip", "celebrity", "player"] },
         { 'Url': 'Goal', 'categories': ["sport", "premier_league", "champions_league"] ,
           'keywords': ["football", "goal", "stadium", "player", "referee"]},
       ]

df =pd.DataFrame(data)

对于关键字列中的所有单词,我想获得与之相关的类别的频率。结果可能如下:
{half:{sport:1,nba:1,nfl:1},触地得分:{sport:1,nba:1,nfl:1},裁判:{sport:2,nba:1,nfl:1,premier_league:1,player:{sport:3,nba:1,nfl:1,premier league:1,champions_league:1},八卦:{体育:1,娱乐:1}、名人:{运动:1,体育:1、娱乐:1},目标:{short:2、premier league:1,chambers_league:,nba:1,nfl:1},体育场:{体育:1,英超联赛:1,冠军联赛:1}}

ebdffaop

ebdffaop1#

由于列包含列表,因此可以分解它们,以便为每个列表的每个元素重复一行:

result = (
    df.explode("keywords")
    .explode("categories")
    .groupby(["keywords", "categories"])
    .size()
)

相关问题