python-3.x 如何在有停用字词的文字中找到没有停用字词的子字串?

zysjyyx4  于 2022-11-26  发布在  Python
关注(0)|答案(1)|浏览(92)

我正在python中尝试处理一个文本子字符串匹配任务,

# query (i.e., substring without stopwords)
query = "like clutch master cylinder probably good news damn straight good news scheduled friday called thursday tow guy available dan kind strong gentleman able look day minutes alec called let know indeed master cylinder labor looking amount money times less imagined replace clutch asked look listen"

# complete text
corpus = """gene johnson does it again. the clutch pedal in my car was "acting out" last sunday so it was zero surprise when i got into the car on monday morning and was no longer able to put my gar in gear. wah wah. if one were being honest one would admit that this was a problem years in the making. let she who lives in a glass house cast the first stone. but seriously. i called gene johnson and said, "guys, my clutch is out." and then they ask me some questions and they're like, "you're clutch is not out, but your master cylinder probably is and that's good news." damn straight it's good news! they scheduled me for friday, but called on thursday because their tow guy was available (and dan is a kind and strong gentleman) and they'd be able to look at it that day. not 30 minutes after they had it, alec called to let me know that indeed, it was the master cylinder and with labor i was looking at an amount of money 5 times less than what i had imagined to replace my clutch. i asked him to look at/listen to the awful clunk/crunch/bang that had been happening for some time as well and he called me this morning after having driven it for himself (thank you!) and diagnosed the problem and got that fixed up as well. i was back in business by 2 pm. it is without hyperbole that i say, nay-reiterate, that these are the best, most honest, communicative mechanics in our fair city. my car and i are hopelessly devoted, to gene johnson."""

# stopwords list which I am using
stopwords = ['which', 'your', 'have', 'our', 'haven', 'them', 'an', 'up', 'while', 'her', 'over', 'had', "shan't", 'been', 'because', 'he', "couldn't", 'couldn', 'ours', "mustn't", 'd', "wasn't", 'same', 'won', 'yours', 'more', "you've", 'yourselves', 'will', "you'd", 'no', 'yourself', 'wouldn', 'the', 'didn', 'we', 'is', 'such', 'it', 'can', 'o', "won't", 'hadn', 'a', "shouldn't", 'why', 'hasn', 'these', 'out', 'with', 'any', "weren't", 'other', 'ma', 'once', 'down', 'nor', 'whom', 'ain', 'doesn', 'or', 'do', 'needn', 'isn', 'for', 'under', 'those', 'during', 'me', 'be', 'that', 'from', "isn't", 'has', 'themselves', 'having', 'both', 'own', 'into', 'my', 'she', 'hers', 'mightn', 'than', 'to', 'just', 'm', 'am', 'himself', "doesn't", 'should', 'mustn', "you're", "you'll", 'shouldn', 'after', 'most', "aren't", 'myself', 'and', "she's", 'this', 'll', 'ourselves', 'only', 'theirs', 'again', "hadn't", 'here', 'when', 'what', 'did', 's', 'not', 'too', 'through', 'off', 'each', 'as', "haven't", 'further', 'then', 'they', 'you', "that'll", 'of', 're', 'aren', 'y', 'now', "needn't", 'some', 'were', 'if', 'how', 'him', 'don', 'against', 'about', 'there', 'where', "wouldn't", 'who', 'are', 'at', 'on', 'all', 'few', 've', 'but', "don't", 'wasn', 'in', "didn't", 'below', "should've", 'shan', 'herself', 'i', 'weren', 'doing', 'does', 'by', 'itself', "it's", 'before', 'its', 'their', 'between', "mightn't", 'being', 't', 'above', 'so', 'very', 'was', 'until', "hasn't", 'his']

我想在corpus中匹配query字符串,并在语料库文本中找到查询的起始和结束索引,但我面临的最大问题是querycorpus文本的一个没有停用词的子串。
我试图找到连续词,但我不能想出一个正确和有效的方法,因为我需要在更大的语料库中做这件事,

words_query_order = query.split() 
words_corpus_order = corpus.split()

indices = []
sw_indices = []
for enx, w_x in enumerate(words_query_order):
    for eny, w_y in enumerate(words_corpus_order):
        if w_y not in stopwords:
            if w_x == w_y:
                indices.append(eny)
        else:
            sw_indices.append(eny)
    indices = list(np.sort(indices + sw_indices)

注意:我知道我可以很容易地从corpus文本中删除停用词并进行匹配,但这不是我在这里的用例。
非常感谢你能提供的任何帮助。

qjp7pelc

qjp7pelc1#

如果我没理解错你的问题的话,字符串对象上的内置方法find正是你所需要的。
小例子:

s = "Happy Birthday"
s2 = "py"

print(s.find(s2))

链接到文档

相关问题