我尝试使用elasticsearch,因为它可以通过python从本地json文件获取一些值和频率。json文件有多个术语,如下所示;
[{"id": "251088", "tweet": "lorem ipsum", "username": "Ahmet"},
{"id": "251059", "tweet": "bla bla bla","username": "Ali", },
...
]
json文件包含大约500k条tweets和信息。
我的目标是通过elasticsearch更快地获得术语频率。
import requests, json, os
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
i = 1
f = open("tweets_test.json")
docket_content = f.read()
# print(docket_content)
# only wait for 1 second, regardless of the client's default
es.cluster.health(wait_for_status='yellow', request_timeout=1)
es.index(index='tweets', ignore=[400, 404], doc_type='docket', id=i, body=json.loads(docket_content))
res = es.search(index="tweets", doc_type="docket", body={"query": {"match": {"tweet": "any-word"}}})
print("%d documents found" % res['hits']['total'])
for doc in res['hits']['hits']:
print("%s) %s" % (doc['_id'], doc['_source']['content']))
输出为;
0 documents found
Process finished with exit code 0
为什么不使用这个代码?
我得到术语频率的平台错了吗?
暂无答案!
目前还没有任何答案,快来回答吧!