问题
我有一个 Dataframe ,它由属于某个类别的文本组成。现在我想得到每个类别中最常用的n元语法(示例中的二元语法)。我设法做到了这一点,但在我看来,这方面的代码太长了。
样品代码
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
# Sample data
data = {'text':['sport sport text sample sport sport text sample', 'math math text sample math math text sample',
'politics politics text sample politics politics text sample'],
'category' : ["sport", "math", "politics"]}
df = pd.DataFrame(data)
# Get text per category
sport = [df[df['category'] == 'sport'].reset_index()['text'].iloc[0]]
math = [df[df['category'] == 'math'].reset_index()['text'].iloc[0]]
politics = [df[df['category'] == 'politics'].reset_index()['text'].iloc[0]]
# Calculate ngrams per category
n = 2
sport_ngrams = []
for sample in sport:
sport_ngrams.extend(ngrams(nltk.word_tokenize(sample), n))
sport_ngrams_df = pd.DataFrame(pd.Series(sport_ngrams).value_counts()[:10]).reset_index()
sport_ngrams_df['category'] = 'Business & Finance'
math_ngrams = []
for sample in math:
math_ngrams.extend(ngrams(nltk.word_tokenize(sample), n))
math_ngrams_df = pd.DataFrame(pd.Series(math_ngrams).value_counts()[:10]).reset_index()
math_ngrams_df['category'] = 'Computers & Internet'
politics_ngrams = []
for sample in politics:
politics_ngrams.extend(ngrams(nltk.word_tokenize(sample), n))
politics_ngrams_df = pd.DataFrame(pd.Series(politics_ngrams).value_counts()[:10]).reset_index()
politics_ngrams_df['category'] = 'Education & Reference'
# Concatenate df's
bigram_df = pd.concat([sport_ngrams_df, math_ngrams_df, politics_ngrams_df
]).rename(columns={"index": "word", 0:'count'})
bigram_df
- 产出**
| 字|计数|范畴|
| - ------| - ------| - ------|
| ("运动","运动")|第二章|商业与金融|
| ("体育"、"文字")|第二章|商业与金融|
| ('文本','示例')|第二章|商业与金融|
| ("样品"、"运动")|1个|商业与金融|
| ("数学","数学")|第二章|计算机和互联网|
| ("数学","文本")|第二章|计算机和互联网|
| ('文本','示例')|第二章|计算机和互联网|
| ("样本"、"数学")|1个|计算机和互联网|
| ("政治","政治")|第二章|教育与参考|
| ("政"、"文")|第二章|教育与参考|
| ('文本','示例')|第二章|教育与参考|
| ("样本"、"政治")|1个|教育与参考|
问题
有没有一种更有效的方法来构建n-gram,而不必分别获取文本和为每个类别创建n-gram?
谢谢你的帮助!
1条答案
按热度按时间mklgxw1f1#
当然,每个类别的处理过程都是相同的,所以你可以把它放在一个循环中: