python 对Pandas使用Spacy

nom7f22z  于 2022-12-02  发布在  Python
关注(0)|答案(1)|浏览(149)

I'm trying to build a multi-class text classifier using Spacy and I have built the model, but facing a problem applying it to my full dataset. The model I have built so far is in the screenshot:
Screenshot
Below is the code I used to apply to my full dataset using Pandas:

Messages = pd.read_csv('Messages.csv', encoding='cp1252')
    
Messages['Body'] = Messages['Body'].astype(str)

Messages['NLP_Result'] = nlp(Messages['Body'])._.cats

But it gives me the error:

ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'pandas.core.series.Series'>

The reason I wanted to use Pandas in this case is the dataset has 2 columns: ID and Body. I want to apply the NLP model only to the Body column, but I want the final dataset to have 3 columns: ID, Body and the NLP result like in the screenshot above.
Thanks so much
I tried Pandas apply method too, but had no luck. Code used:

Messages['NLP_Result'] = Messages['Body'].apply(nlp)._.cats

The error I got: AttributeError: 'Series' object has no attribute '_'
Expectation is to generate 3 columns as described above

qhhrdooz

qhhrdooz1#

你应该提供一个可调用到Series.apply的调用:

Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)

这里,NLP_Result列中的每个值将被赋给x变量。
nlp(x)将创建一个NLP对象,其中包含您要访问的必要属性。然后,nlp(x)._.cats将返回预期值。

import spacy
import classy classification
import csv
import pandas as pd 

with open ('Deliveries.txt', 'r') as d:
    Deliveries = d.read().splitlines()
with open ("Not Spam.txt", "r") as n:
    Not_Spam = n.read().splitlines()

data = {}
data["Deliveries"] = Deliveries
data["Not_Spam"] = Not_Spam

# NLP model
nlp = spacy.blank("en")
nlp.add pipe("text_categorizer",
    config={
        "data": data,
        "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        "device": "gpu"
    }
)

Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)

相关问题