regex 一种高效的方法来检查数百万个搜索查询中是否存在大量单词

roejwanj 于 2023-06-25 发布在其他

关注(0)|答案(2)|浏览(87)

1.我有一个包含5000万个搜索查询的字符串列表。[每个查询中1-500+字]。
1.我还有一个包含500个单词和短语的字符串列表，我需要返回包含任何单词或短语的搜索查询（1）的索引（2）。
目标是只保留与某个主题（电影）相关的查询，然后使用NLP对这些过滤后的查询进行聚类（stemming -> tf_idf -> pca -> kmeans）。
我尝试使用嵌套循环来过滤查询，但这需要10多个小时才能完成。

filtered = []
with open('search_logs.txt', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        query, timestamp = line.strip().split('\t')
        for word in key_words:
            if word in query:
                filtered.append(i)

我研究了使用regex（word 1）的解决方案|中文（简体）|...| wordN），但问题是我不能将查询组合成一个大字符串，因为我需要过滤不相关的查询。
UPDATE：日志和关键字示例

search_logs.txt
'query  timestamp\n'
'the dark knight    2019-02-17 19:05:12\n'
'how to do a barrel roll    2019-02-17 19:05:13\n'
'watch movies   2019-02-17 19:05:13\n'
'porn   2019-02-17 19:05:13\n'
'news   2019-02-17 19:05:14\n'
'rami malek 2019-02-17 19:05:14\n'
'Traceback (most recent call last): File "t.py" 2019-02-17 19:05:15\n'
.......... # millions of other search queries

key_words = [
    'movie',
    'movies',
    'cinema',
    'oscar',
    'oscars',
    'george lucas',
    'ben affleck',
    'netflix',
    .... # hundreds of other words and phrases
]

regex

来源：https://stackoverflow.com/questions/55783903/efficient-way-to-check-if-a-large-list-of-words-exists-in-millions-of-search-que

2条答案

按热度按时间

h79rfbju1#

集合比较- Jaccard相似度

Jaccard相似性是比较词对的比较度量：https://www.statology.org/jaccard-similarity/
我建议三种方法-
1.使用集合比较：将关键字列表保存为一个集合，然后将每个查询字符串动态转换为一个集合，并与关键字集合进行比较
例如：

# indexing the keyword list
s = set(keyword)

# pairwise comparison
idx_list = []
for i in range(len(search_arr)):
    if set(search_arr[i].split(' ')).intersection(s):
        idx_list.append(i)

这样的东西会给予你搜索的能力，但这里的成对比较至少需要O（N）。
1.因此，最好的方法是使用反向索引，我们获取搜索查询中的所有唯一词并建立一个临时索引，然后查询关键词，以获得列表索引
例如：

# search query indexing using hashmap
hmap = dict()
for i in range(len(search_list)):
    txt = search_list[i].split(' ')
    for word in txt:
        if word not in hmap:
            hmap[word] = set(i)
        else:
            hmap.add(i)

这将基本上创建您的搜索索引，可用于查询关键字作为反向索引搜索
1.如果这是不有效的，请尝试使用LSH
https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134

赞(0）回复(0）举报 2023-06-25

juud5qan2#

我建议使用FlashText，它的开发正是为了高效地完成这类任务。只要您搜索的关键字是普通字符串（而不是复杂的正则表达式），它就可以工作。

赞(0）回复(0）举报 2023-06-25