python 如何避免在pandas dataframe中对一行已经lemmatized的句子进行lemmatizing以加快速度

cotxawn7 于 2023-08-02 发布在 Python

关注(0)|答案(2)|浏览(121)

给出：

一个简单而小巧的pandas dataframe如下：

df = pd.DataFrame(
    {
        "user_ip":       ["u7", "u3", "u1", "u9", "u4","u8", "u1", "u2", "u5"],
        "raw_sentence":  ["First sentence!", np.nan, "I go to school everyday!", "She likes chips!", "I go to school everyday!", "This is 1 sample text!", "She likes chips!", "This is the thrid sentence.", "I go to school everyday!"],
    }
  )

    user_ip    raw_sentence
0   u7         First sentence!
1   u3         NaN
2   u1         I go to school everyday! 
3   u9         She likes chips!
4   u4         I go to school everyday!     <<< duplicate >>>
5   u8         This is 1 sample text!
6   u1         She likes chips!             <<< duplicate >>>
7   u2         This is the thrid sentence.
8   u5         I go to school everyday!     <<< duplicate >>>

字符串

目标：

我想知道我是否可以避免调用map，或者考虑任何其他策略来处理那些在raw_sentence列中具有重复（完全相似）句子的行。我的目的是加速我的实现更大的pandas Dataframe （~100K行）。

[* 无效 *]解决方案：

现在，我利用.map()，使用lambda遍历每一行，并调用get_lm()函数来检索原始输入句子的词元，如下所示：

import nltk
nltk.download('all', quiet=True, raise_on_error=True,)
STOPWORDS = nltk.corpus.stopwords.words('english')
wnl = nltk.stem.WordNetLemmatizer()
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

def get_lm(input_sent:str="my text!"):
    tks = [ w for w in tokenizer.tokenize(input_sent.lower()) if not w in STOPWORDS and len(w) > 1 and not w.isnumeric() ]
    lms = [ wnl.lemmatize(w, t[0].lower()) if t[0].lower() in ['a', 's', 'r', 'n', 'v'] else wnl.lemmatize(w) for w, t in nltk.pos_tag(tks)] 
    return lms

df["lemma"] = df["raw_sentence"].map(lambda raw: get_lm(input_sent=raw), na_action='ignore')

    user_ip     raw_sentence                    lemma
0   u7          First sentence!                 [first, sentence]         <<< 1st occurrence => lemmatization OK! >>>
1   u3          NaN                             NaN                       <<< ignone None using na_action='ignore' >>>
2   u1          I go to school everyday!        [go, school, everyday]    <<< 1st occurrence => lemmatization OK! >>>
3   u9          She likes chips!                [like, chip]              <<< 1st occurrence => lemmatization OK! >>>
4   u4          I go to school everyday!        [go, school, everyday]    <<< already lemmatized, no need to do it again >>>
5   u8          This is 1 sample text!          [sample, text]            <<< 1st occurrence => lemmatization OK! >>>
6   u1          She likes chips!                [like, chip]              <<< already lemmatized, no need to do it again >>>
7   u2          This is the thrid sentence.     [thrid, sentence]         <<< 1st occurrence => lemmatization OK! >>>
8   u5          I go to school everyday!        [go, school, everyday]    <<< already lemmatized, no need to do it again >>>

型
有没有更有效的方法来解决这个问题？
干杯

python

来源：https://stackoverflow.com/questions/76814175/how-to-avoid-lemmatizing-already-lemmatized-sentences-of-a-row-in-pandas-datafra

2条答案

按热度按时间

anhgbhbe1#

一种可能性是只对专栏中的独特句子运行map。
然后使用它作为Mapdict-ionary：

mapper = dict(zip(df["raw_sentence"].drop_duplicates(),df["raw_sentence"].drop_duplicates().map(lambda raw: get_lm(input_sent=raw), na_action='ignore').rename("lemma")))

df["lemma"] = df["raw_sentence"].map(mapped)

字符串

赞(0）回复(0）举报 2023-08-02

eit6fx6z2#

不要重新发明轮子，使用functools.cache：

from functools import cache

@cache
def get_lm(input_sent:str="my text!"):
    tks = [ w for w in tokenizer.tokenize(input_sent.lower()) if not w in STOPWORDS and len(w) > 1 and not w.isnumeric() ]
    lms = [ wnl.lemmatize(w, t[0].lower()) if t[0].lower() in ['a', 's', 'r', 'n', 'v'] else wnl.lemmatize(w) for w, t in nltk.pos_tag(tks)] 
    return lms

df["lemma"] = df["raw_sentence"].map(lambda raw: get_lm(input_sent=raw), na_action='ignore')

字符串
输出量：

user_ip                 raw_sentence                   lemma
0      u7              First sentence!       [first, sentence]
1      u3                          NaN                     NaN
2      u1     I go to school everyday!  [go, school, everyday]
3      u9             She likes chips!            [like, chip]
4      u4     I go to school everyday!  [go, school, everyday]
5      u8       This is 1 sample text!          [sample, text]
6      u1             She likes chips!            [like, chip]
7      u2  This is the thrid sentence.       [thrid, sentence]
8      u5     I go to school everyday!  [go, school, everyday]

型

赞(0）回复(0）举报 2023-08-02

我来回答

python 如何避免在pandas dataframe中对一行已经lemmatized的句子进行lemmatizing以加快速度

2条答案

相关问题

热门标签

最新问答