pandas 如何基于 Dataframe 中的类别高效地构建ngram

kb5ga3dv 于 2022-12-28 发布在其他

关注(0)|答案(1)|浏览(107)

问题

我有一个 Dataframe ，它由属于某个类别的文本组成。现在我想得到每个类别中最常用的n元语法（示例中的二元语法）。我设法做到了这一点，但在我看来，这方面的代码太长了。

样品代码

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

# Sample data
data  = {'text':['sport sport text sample sport sport text sample', 'math math text sample math math text sample', 
'politics politics text sample politics politics text sample'],
'category' : ["sport", "math", "politics"]}
df = pd.DataFrame(data)

# Get text per category
sport = [df[df['category'] == 'sport'].reset_index()['text'].iloc[0]]
math = [df[df['category'] == 'math'].reset_index()['text'].iloc[0]]
politics = [df[df['category'] == 'politics'].reset_index()['text'].iloc[0]]

# Calculate ngrams per category
n = 2

sport_ngrams = []
for sample in sport:
  sport_ngrams.extend(ngrams(nltk.word_tokenize(sample), n))
sport_ngrams_df = pd.DataFrame(pd.Series(sport_ngrams).value_counts()[:10]).reset_index()
sport_ngrams_df['category'] = 'Business & Finance'

math_ngrams = []
for sample in math:
  math_ngrams.extend(ngrams(nltk.word_tokenize(sample), n))
math_ngrams_df = pd.DataFrame(pd.Series(math_ngrams).value_counts()[:10]).reset_index()
math_ngrams_df['category'] = 'Computers & Internet'

politics_ngrams = []
for sample in politics:
  politics_ngrams.extend(ngrams(nltk.word_tokenize(sample), n))
politics_ngrams_df = pd.DataFrame(pd.Series(politics_ngrams).value_counts()[:10]).reset_index()
politics_ngrams_df['category'] = 'Education & Reference'

# Concatenate df's
bigram_df = pd.concat([sport_ngrams_df, math_ngrams_df, politics_ngrams_df
                       ]).rename(columns={"index": "word", 0:'count'})

bigram_df

- 产出**

| 字|计数|范畴|
| - ------| - ------| - ------|
| （"运动"，"运动"）|第二章|商业与金融|
| （"体育"、"文字"）|第二章|商业与金融|
| （'文本'，'示例'）|第二章|商业与金融|
| （"样品"、"运动"）|1个|商业与金融|
| （"数学"，"数学"）|第二章|计算机和互联网|
| （"数学"，"文本"）|第二章|计算机和互联网|
| （'文本'，'示例'）|第二章|计算机和互联网|
| （"样本"、"数学"）|1个|计算机和互联网|
| （"政治"，"政治"）|第二章|教育与参考|
| （"政"、"文"）|第二章|教育与参考|
| （'文本'，'示例'）|第二章|教育与参考|
| （"样本"、"政治"）|1个|教育与参考|

问题

有没有一种更有效的方法来构建n-gram，而不必分别获取文本和为每个类别创建n-gram？
谢谢你的帮助!

来源：https://stackoverflow.com/questions/74932417/how-to-efficiently-build-ngrams-based-on-categories-in-a-dataframe

1条答案

按热度按时间

当然，每个类别的处理过程都是相同的，所以你可以把它放在一个循环中：

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

# Sample data
data  = {'text':['sport sport text sample sport sport text sample', 'math math text sample math math text sample', 
'politics politics text sample politics politics text sample'],
'category' : ["sport", "math", "politics"]}
df = pd.DataFrame(data)

n = 2
bigram_df = pd.DataFrame()

for categ in df['category']:
  text_categ = [df[df['category'] == categ].reset_index()['text'].iloc[0]]
  categ_ngrams = []
  for sample in text_categ:
    categ_ngrams.extend(ngrams(nltk.word_tokenize(sample), n))
    ngrams_df = pd.DataFrame(pd.Series(categ_ngrams).value_counts()[:10]).reset_index()
    ngrams_df['category'] = categ
    bigram_df = pd.concat([bigram_df, ngrams_df])

bigram_df

赞(0）回复(0）举报 2022-12-28

相关问题

热门标签

Java query python Node 开发语言 request Util 数据库 Table 后端算法 Logger Message Element Parser

最新问答

xxl-job 安全组扫描到执行器端口服务存在信息泄露漏洞
回答(1) 发布于 4个月前
xxl-job 不能和nacos兼容？
回答(3) 发布于 4个月前
xxl-job 任务执行完后无法结束，日志一直转圈
回答(3) 发布于 4个月前
xxl-job-admin页面上查看调度日志样式问题
回答(1) 发布于 4个月前
xxl-job 参数512字符限制能否去掉
回答(1) 发布于 4个月前