pandas 如何将文本拆分为句子并创建每行一个句子的新 Dataframe ?

rn0zuynd  于 2023-06-04  发布在  其他
关注(0)|答案(1)|浏览(149)

我有一个dataframe df,它有3个包含语音数据的列:filenamepresidenttext
我使用以下命令将文本数据拆分为句子:

# Split 'text' column into sentences and create a new 'sentences' column
df['sentences'] = df['text'].apply(lambda x: nltk.sent_tokenize(x))

但是,此代码以这样的方式标记文本,即拆分的文本仍然在“句子”列的一行中。我想创建一个名为'data_sentence'的新 Dataframe ,其中包含拆分的句子及其各自的文件名president,但每行包含一个句子。

data_sentence = pd.DataFrame(columns=['filename', 'president', 'sentnew'])

# Iterate over each row in the original DataFrame 'df'
for index, row in df.iterrows():
    filename = row['filename']
    president = row['president']
    sentences = row['sentences']
    
    # Create a temporary DataFrame for the sentences of the current row
    temp_df = pd.DataFrame({'filename': [filename] * len(sentences),
                            'president': [president] * len(sentences),
                            'sentnew': sentences})
    
    # Concatenate the temporary DataFrame with 'data_sentence'
    data_sentence = pd.concat([data_sentence, temp_df], ignore_index=True)

# Print the resulting DataFrame 'data_sentence'
print(data_sentence)

这段代码可以工作,但不能将一个句子分配给一行。
有人能帮忙吗?

hm2xizp9

hm2xizp91#

看起来你只需要explode * 句子 *:

df['sentences'] = df.pop('text').apply(lambda x: nltk.sent_tokenize(x)) # use `df.pop`

data_sentence = df.explode('sentences') # <-- add this line

输出:
| 文件名|总统|句子|
| - -----|- -----|- -----|
| file1.txt |一个|如何使用NLTK和Pandas将文本拆分为句子并创建每行一个句子的新数据框?|
| file1.txt |一个|我有一个dataframe df,它有3个包含语音数据的列:'filename',' president','text'.|

  • 使用的输入:*
import nltk
nltk.download("punkt")

df = pd.DataFrame({
    "filename": ["file1.txt"],
    "president": ["A"],
    "text": [
        "How to split text into sentences and create a new dataframe "
        "with one sentence per row using NLTK and Pandas? "
        "I have a dataframe df that has 3 columns containing "
        "speechdata: 'filename', 'president', 'text'.",
    ]
})

相关问题