给出:
一个简单而小巧的pandas dataframe如下:
df = pd.DataFrame(
{
"user_ip": ["u7", "u3", "u1", "u9", "u4","u8", "u1", "u2", "u5"],
"raw_sentence": ["First sentence!", np.nan, "I go to school everyday!", "She likes chips!", "I go to school everyday!", "This is 1 sample text!", "She likes chips!", "This is the thrid sentence.", "I go to school everyday!"],
}
)
user_ip raw_sentence
0 u7 First sentence!
1 u3 NaN
2 u1 I go to school everyday!
3 u9 She likes chips!
4 u4 I go to school everyday! <<< duplicate >>>
5 u8 This is 1 sample text!
6 u1 She likes chips! <<< duplicate >>>
7 u2 This is the thrid sentence.
8 u5 I go to school everyday! <<< duplicate >>>
字符串
目标:
我想知道我是否可以避免调用map
,或者考虑任何其他策略来处理那些在raw_sentence
列中具有重复(完全相似)句子的行。我的目的是加速我的实现更大的pandas Dataframe (~100K
行)。
[* 无效 *]解决方案:
现在,我利用.map()
,使用lambda
遍历每一行,并调用get_lm()
函数来检索原始输入句子的词元,如下所示:
import nltk
nltk.download('all', quiet=True, raise_on_error=True,)
STOPWORDS = nltk.corpus.stopwords.words('english')
wnl = nltk.stem.WordNetLemmatizer()
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
def get_lm(input_sent:str="my text!"):
tks = [ w for w in tokenizer.tokenize(input_sent.lower()) if not w in STOPWORDS and len(w) > 1 and not w.isnumeric() ]
lms = [ wnl.lemmatize(w, t[0].lower()) if t[0].lower() in ['a', 's', 'r', 'n', 'v'] else wnl.lemmatize(w) for w, t in nltk.pos_tag(tks)]
return lms
df["lemma"] = df["raw_sentence"].map(lambda raw: get_lm(input_sent=raw), na_action='ignore')
user_ip raw_sentence lemma
0 u7 First sentence! [first, sentence] <<< 1st occurrence => lemmatization OK! >>>
1 u3 NaN NaN <<< ignone None using na_action='ignore' >>>
2 u1 I go to school everyday! [go, school, everyday] <<< 1st occurrence => lemmatization OK! >>>
3 u9 She likes chips! [like, chip] <<< 1st occurrence => lemmatization OK! >>>
4 u4 I go to school everyday! [go, school, everyday] <<< already lemmatized, no need to do it again >>>
5 u8 This is 1 sample text! [sample, text] <<< 1st occurrence => lemmatization OK! >>>
6 u1 She likes chips! [like, chip] <<< already lemmatized, no need to do it again >>>
7 u2 This is the thrid sentence. [thrid, sentence] <<< 1st occurrence => lemmatization OK! >>>
8 u5 I go to school everyday! [go, school, everyday] <<< already lemmatized, no need to do it again >>>
型
有没有更有效的方法来解决这个问题?
干杯
2条答案
按热度按时间anhgbhbe1#
一种可能性是只对专栏中的独特句子运行
map
。然后使用它作为Map
dict
-ionary:字符串
eit6fx6z2#
不要重新发明轮子,使用
functools.cache
:字符串
输出量:
型