pandas 使用NLP(NLTK)识别python Dataframe 中的短语组

yc0p9oo0  于 2023-09-29  发布在  Python
关注(0)|答案(1)|浏览(120)

我有一个包含大量患者诊断信息的表。我想确定这些诊断中最常见的分组是什么,例如是“肿头综合征”和“松舌”,还是“破风”,“慢性鼻毛”和“波纹脚踝”...或一些其它组合。
数据的结构是这样的:

import pandas as pd
import numpy as np

# List of ids
ids = ['id1', 'id2', 'id3','id4','id5'] 

# List of sample sentences 
diagnosis = ["Broken Wind","Chronic Nosehair","Corrugated Ankles","Discrete Itching"]

# Create dataframe
df = pd.DataFrame({'id': ids})

# Generate list of sentences for each id
df['diagnosis'] = df['id'].apply(lambda x: np.random.choice(diagnosis, 5).tolist())

# Explode into separate rows
df = df.explode('diagnosis')

print(df)

例如,如果id2id5都包含"Broken Wind" and Chronic Nosehair",则该组合为2。如果id1, id3 and id4包含"Chronic Nosehair","Corrugated Ankles", and "Discrete Itching",则该组合为3。
目的是确定哪种组合最常见。
我想知道是否有一个nlp库,比如NLTK,或者一个方法,可以用来处理像这样存储在pandas dataframe中的数据?到目前为止,我所能找到的大部分都是面向情感分析或分析单个单词而不是短语的...

vbopmzt1

vbopmzt11#

我想说的是,你在这里试图做的不一定是一个NLP问题,而是一个更一般的frequent pattern mining问题,通常在推荐中看到。
通过使用fpgrowth algorithm in the mlxtend library并查看对每个症状或其组合的支持,您可以找到任何大小的最常见诊断组合:

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth

# Create list of diagnoses for each patient
x = df.groupby('id').apply(lambda x:list(x['diagnosis']))

# Encode to wide dataframe with column for each symptom
te = TransactionEncoder()
te_ary = te.fit(x).transform(x)
te_df = pd.DataFrame(te_ary, columns=te.columns_)

# Calculate most frequent diagnosis co-occurrences
fp_df = fpgrowth(te_df, min_support=0.01, use_colnames=True)

# Sort and show
fp_df.sort_values(by='support', ascending=False)

结果表是一个元组列表,其中support是发生组合的“事务”(这里是患者)的百分比:

| support | itemsets                                                 |
| ------- | -------------------------------------------------------- |
| 0.8     | {'Broken Wind'}                                          |
| 0.6     | {'Corrugated Ankles'}                                    |
| 0.6     | {'Chronic Nosehair'}                                     |
| 0.6     | {'Discrete Itching'}                                     |
| 0.6     | {'Corrugated Ankles', 'Broken Wind'}                     |
| 0.4     | {'Chronic Nosehair', 'Broken Wind'}                      |
| 0.4     | {'Discrete Itching', 'Chronic Nosehair'}                 |
| 0.4     | {'Discrete Itching', 'Broken Wind'}                      |
| 0.2     | {'Corrugated Ankles', 'Discrete Itching'}                |
| 0.2     | {'Discrete Itching', 'Corrugated Ankles', 'Broken Wind'} |
| 0.2     | {'Corrugated Ankles', 'Chronic Nosehair'}                |
| 0.2     | {'Chronic Nosehair', 'Discrete Itching', 'Broken Wind'}  |
| 0.2     | {'Chronic Nosehair', 'Corrugated Ankles', 'Broken Wind'} |

相关问题