使用Python / Pandas / NLTK从数据框中分离英语和非英语句子

xvw2m8pv 于 2023-03-04 发布在 Python

关注(0)|答案(1)|浏览(133)

我正在使用CrisisLexT26数据集进行我的研究项目。 Dataframe 如下所示：

Tweet Text | Informativeness
local assistance neighbour boulder flood | Related
tourism singapore suffers haze blow | Related
estate chat con hiya wendy queen vive costa | Related

第1列包含一条推文文本，第2列谈到是否与自然灾害有关。
我想创建两个数据框，一个只包含英语句子，另一个包含非英语句子
示例推文1和2应该出现在第一个 Dataframe 中，推文3应该出现在另一个 Dataframe 中，因为它是一个非英语句子
我尝试使用检测库和各种nltk方法，但真的不能做到这一点。有人能帮助我吗？
https://github.com/jeyadosstimothy/ML-on-CrisisLex/blob/master/CrisisLexT26/2012_Colorado_wildfires/2012_Colorado_wildfires-tweets_labeled.csv

python-3.x

来源：https://stackoverflow.com/questions/75633699/segregate-english-and-non-english-sentences-from-a-dataframe-using-python-pand

1条答案

按热度按时间

56lgkhnf1#

from langdetect import detect
tweet_df['lang'] = tweet_df[' Tweet Text'].apply(detect)

运行需要时间，但这个工作
文本blob引发请求错误

赞(0）回复(0）举报 2023-03-04

我来回答

使用Python / Pandas / NLTK从数据框中分离英语和非英语句子

1条答案

相关问题

热门标签

最新问答