python 如何在gensim中从Word2Vec模型中完全删除单词?

3zwtqj6y  于 2023-10-15  发布在  Python
关注(0)|答案(7)|浏览(82)

一个模型,例如

from gensim.models.word2vec import Word2Vec

documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

texts = [d.lower().split() for d in documents]

w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)

可以从w2 v词汇表中删除这个词,例如。

# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433  0.08862179  0.08601206  0.05281207 -0.00673626]

>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)

# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]

# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"

但是当我们在删除graph后对其他单词进行相似性分析时,我们会看到单词graph弹出,例如。

>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]

如何在gensim中从Word 2 Vec模型中完全删除单词?

已更新

回复@vumaasha的评论:
你能给予一些细节来说明为什么你想删除一个词吗

  • 让我们在语料库中的所有单词中说出我的单词宇宙,以了解所有单词之间的密集关系。
  • 但是当我想生成相似的单词时,它应该只来自特定领域单词的子集。
  • 可以从.most_similar()生成足够多的词,然后过滤这些词,但是假设特定域的空间很小,我可能会寻找一个排名第1000位的最相似的词,这是低效的。
  • 最好将单词从单词向量中完全删除,这样.most_similar()单词就不会返回特定域之外的单词。
1l5u6lss

1l5u6lss1#

我写了一个函数,从KeyedVectors中删除不在预定义单词列表中的单词。

def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    for i in range(len(w2v.vocab)):
        word = w2v.index2entity[i]
        vec = w2v.vectors[i]
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = new_vectors
    w2v.index2entity = new_index2entity
    w2v.index2word = new_index2entity
    w2v.vectors_norm = new_vectors_norm

它重写了所有与Word2VecKeyedVectors相关的变量。
使用方法:

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")

[('beers',0.8409687876701355),
('lager',0.7733745574951172),
('Beer',0.71753990650177),
('饮料',0.668931245803833),
('lagers',0.6570086479187012),
('Yuengling_Lager',0.655455470085144),
('微酿',0.6534324884414673),
('Brooklyn_Lager',0.6501551866531372),
('泡沫',0.6497018337249756),
('brewed_beer',0.6490240097045898)]

restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")

[('lagers ',0.6570085287094116),
('wine',0.6217695474624634),
('bash',0.20583480596542358),
('computer',0.06677375733852386),
('python',0.005948573350906372)]

ltskdhd1

ltskdhd12#

没有直接的方法来做你正在寻找的东西。然而,你并没有完全迷失。方法most_similar在类WordEmbeddingsKeyedVectors中实现(查看链接)。您可以查看此方法并修改它以满足您的需要。
下面显示的行执行计算相似单词的实际逻辑,您需要将变量limited替换为与您感兴趣的单词对应的向量。那你就完了

limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
        dists = dot(limited, mean)
        if not topn:
            return dists
best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)

更新:

limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]

如果你看到这一行,这意味着如果使用restrict_vocab,它限制了词汇表中的前n个单词,只有当你按频率对词汇表进行排序时,它才有意义。如果不传递restrict_vocab,则self.vectors_norm将进入limited
方法most_similar调用另一个方法init_sims。这将修改[self.vector_norm][4]的值,如下所示

self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)

所以,你可以挑选你感兴趣的单词,准备他们的规范,并使用它来代替有限的。这应该工作

um6iljoc

um6iljoc3#

请注意,这不会修剪模型本身。它修剪相似性查找所基于的KeyedVectors对象。
假设您只想在模型中保留前5000个单词。

wv = w2v_model.wv
words_to_trim = wv.index2word[5000:]
# In op's case 
# words_to_trim = ['graph'] 
ids_to_trim = [wv.vocab[w].index for w in words_to_trim]

for w in words_to_trim:
    del wv.vocab[w]

wv.vectors = np.delete(wv.vectors, ids_to_trim, axis=0)
wv.init_sims(replace=True)

for i in sorted(ids_to_trim, reverse=True):
    del(wv.index2word[i])

这样做是因为BaseKeyedVectors class包含以下属性:self.vectors,self.vectors_norm,self.vocab,self.vector_size,self. index2word.
这样做的好处是,如果您使用save_word2vec_format()等方法编写KeyedVectors,则文件要小得多。

izkcnapc

izkcnapc4#

试过了,觉得最直接的方法如下:
1.获取文本文件格式的Word 2 Vec嵌入。
1.确定与您想要保留的词向量相对应的行。
1.写一个新的文本文件Word 2 Vec嵌入模型。
1.加载模型并享受(如果您愿意,可以保存为二进制文件,等等).
我的示例代码如下:

line_no = 0 # line0 = header
numEntities=0
targetLines = []

with open(file_entVecs_txt,'r') as fp:
    header = fp.readline() # header

    while True:
        line = fp.readline()
        if line == '': #EOF
            break
        line_no += 1

        isLatinFlag = True
        for i_l, char in enumerate(line):
            if not isLatin(char): # Care about entity that is Latin-only
                isLatinFlag = False
                break
            if char==' ': # reached separator
                ent = line[:i_l]
                break

        if not isLatinFlag:
            continue

        # Check for numbers in entity
        if re.search('\d',ent):
            continue

        # Check for entities with subheadings '#' (e.g. 'ENTITY/Stereotactic_surgery#History')
        if re.match('^ENTITY/.*#',ent):
            continue

        targetLines.append(line_no)
        numEntities += 1

# Update header with new metadata
header_new = re.sub('^\d+',str(numEntities),header,count=1)

# Generate the file
txtWrite('',file_entVecs_SHORT_txt)
txtAppend(header_new,file_entVecs_SHORT_txt)

line_no = 0
ptr = 0
with open(file_entVecs_txt,'r') as fp:
    while ptr < len(targetLines):
        target_line_no = targetLines[ptr]

        while (line_no != target_line_no):
            fp.readline()
            line_no+=1

        line = fp.readline()
        line_no+=1
        ptr+=1
        txtAppend(line,file_entVecs_SHORT_txt)

FYI. FASTINGATTEMPT我尝试了@zsozso的方法(使用@Taegyung建议的np.array修改),让它在夜间运行至少12小时,它仍然停留在从受限集合中获取新词.)。这可能是因为我有很多实体...但我的文本文件方法在一个小时内就能奏效。
密码

# [FAILED] Stuck at Building new vocab...
def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    print('Building new vocab..')

    for i in range(len(w2v.vocab)):

        if (i%int(1e6)==0) and (i!=0):
            print(f'working on {i}')

        word = w2v.index2entity[i]
        vec = np.array(w2v.vectors[i])
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    print('Assigning new vocab')
    w2v.vocab = new_vocab
    print('Assigning new vectors')
    w2v.vectors = np.array(new_vectors)
    print('Assigning new index2entity, index2word')
    w2v.index2entity = new_index2entity
    w2v.index2word = new_index2entity
    print('Assigning new vectors_norm')
    w2v.vectors_norm = np.array(new_vectors_norm)
tyg4sfes

tyg4sfes5#

与zsozso的答案相同,但对于Gensim 4:

def restrict_w2v(w2v, restricted_word_set):
    new_index_to_key = []
    new_key_to_index = {}
    new_vectors = []
    for ind, word in enumerate(w2v.index_to_key):
        if word in restricted_word_set:
            new_key_to_index[word] = len(new_index_to_key)
            new_index_to_key.append(word)
            new_vectors.append(w2v.vectors[ind])
    w2v.index_to_key = new_index_to_key
    w2v.key_to_index = new_key_to_index
    w2v.vectors = np.array(new_vectors)

使用方法:

restricted_words = ...
vectors = KeyedVectors.load_word2vec_format(input_file)
restrict_w2v(vectors, restricted_words)
vectors.save_word2vec_format(output_file)

它为我工作(Gensim 4.3.1)

irlmq6kh

irlmq6kh6#

但是当我想生成相似的单词时,它应该只来自特定领域单词的子集。
您可以使用most_similar_to_given从您选择的集合中获取最相似的单词。该方法在引擎盖下使用余弦相似性。

示例

import gensim.downloader

w2v = gensim.downloader.load('glove-twitter-50')
w2v.most_similar_to_given("hotel", ["plane", "house", "penguin"]) # yieldshouse
krugob8w

krugob8w7#

对于任何在当今时代来到这里的人,我建议使用这种方法:https://stackoverflow.com/a/74850545
用较小的词汇制作模型,这是gensim原生的,非常快,通常是很好的练习。

相关问题