python 导入Google新闻-矢量-负300.bin

zhte4eai  于 2023-01-04  发布在  Python
关注(0)|答案(5)|浏览(260)

我正在使用gensim编写代码,并且很坚韧排除代码中的ValueError错误。我终于能够压缩GoogleNews-vectors-negative300.bin.gz文件,以便在我的模型中实现它。我还尝试了gzip,但结果不成功。代码中的错误出现在最后一行。我想知道可以做些什么来修复错误。是否有任何变通方法?最后,有没有我可以参考的网站?
衷心感谢您的帮助!

import gensim
from keras import backend
from keras.layers import Dense, Input, Lambda, LSTM, TimeDistributed
from keras.layers.merge import concatenate
from keras.layers.embeddings import Embedding
from keras.models import Mode

pretrained_embeddings_path = "GoogleNews-vectors-negative300.bin"
word2vec = 
gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path, 
binary=True)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-23bd96c1d6ab> in <module>()
  1 pretrained_embeddings_path = "GoogleNews-vectors-negative300.bin"
----> 2 word2vec = 
gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path, 
binary=True)

C:\Users\green\Anaconda3\envs\py35\lib\site-
packages\gensim\models\keyedvectors.py in load_word2vec_format(cls, fname, 
fvocab, binary, encoding, unicode_errors, limit, datatype)
244                             word.append(ch)
245                     word = utils.to_unicode(b''.join(word), 
encoding=encoding, errors=unicode_errors)
--> 246                     weights = fromstring(fin.read(binary_len), 
dtype=REAL)
247                     add_word(word, weights)
248             else:

ValueError: string size must be a multiple of element size
clj7thdc

clj7thdc1#

编辑:S3 url已经停止工作,你可以d ownload the data from Kaggle或者使用this Google Drive link(从Google Drive下载文件时要小心)。
以下命令不再起作用起作用。

brew install wget

wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

这将下载GZIP压缩文件,您可以使用以下命令解压缩该文件:

gzip -d GoogleNews-vectors-negative300.bin.gz

然后可以使用以下命令获取wordVector

from gensim import models

w = models.KeyedVectors.load_word2vec_format(
    '../GoogleNews-vectors-negative300.bin', binary=True)
eyh26e7m

eyh26e7m3#

试试这个-

import gensim.downloader as api

wv = api.load('word2vec-google-news-300')

vec_king = wv['king']

此外,请访问此链接:https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

y0u0uwnf

y0u0uwnf4#

以下是对我起作用的方法。我加载了模型的一部分,而不是整个模型,因为它很大。

!pip install wget

import wget
url = 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz'
filename = wget.download(url)

f_in = gzip.open('GoogleNews-vectors-negative300.bin.gz', 'rb')
f_out = open('GoogleNews-vectors-negative300.bin', 'wb')
f_out.writelines(f_in)

import gensim
from gensim.models import Word2Vec, KeyedVectors
from sklearn.decomposition import PCA

model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, limit=100000)
qzwqbdag

qzwqbdag5#

您可以使用此URL指向Google云端硬盘下载的bin.gz文件:https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g
替代镜子(包括这里提到的S3)似乎坏了。

相关问题