如何使用StormCrawler在网站的ElasticSearch索引中存储自定义元标记

baubqpgj  于 2024-01-04  发布在  Apache
关注(0)|答案(1)|浏览(218)

我正在使用Stormcrawler(v 2.10)抓取内部网网站,并将数据存储在Elasticsearch(v 7.8.0)上。使用kibana进行可视化。内部网页面具有自定义Meta标签,如下所示

  1. {
  2. "settings": {
  3. "index": {
  4. "number_of_shards": 5,
  5. "number_of_replicas": 1,
  6. "refresh_interval": "5s",
  7. "default_pipeline": "timestamp"
  8. }
  9. },
  10. "mappings": {
  11. "_source": {
  12. "enabled": true
  13. },
  14. "properties": {
  15. "content": {
  16. "type": "text"
  17. },
  18. "description": {
  19. "type": "text"
  20. },
  21. "domain": {
  22. "type": "keyword"
  23. },
  24. "format": {
  25. "type": "keyword"
  26. },
  27. "keywords": {
  28. "type": "keyword"
  29. },
  30. "host": {
  31. "type": "keyword"
  32. },
  33. "title": {
  34. "type": "text"
  35. },
  36. "url": {
  37. "type": "keyword"
  38. },
  39. "timestamp": {
  40. "type": "date",
  41. "format": "date_optional_time"
  42. },
  43. "metatag": {
  44. "properties": {
  45. "article_description": {
  46. "type": "text"
  47. },
  48. "article_heading": {
  49. "type": "text"
  50. },
  51. "article_publisheddate": {
  52. "type": "date"
  53. },
  54. "article_type": {
  55. "type": "text"
  56. },
  57. "article_year": {
  58. "type": "text"
  59. }
  60. }
  61. }
  62. }
  63. }
  64. }

字符串
在jsoupfilters.json中添加了

  1. "parse.article_description": "//META[@name=\"Article_Description\"]/@content",
  2. "parse.article_heading": "//META[@name=\"Article_Heading\"]/@content",
  3. "parse.article_publisheddate": "//META[@name=\"Article_PublishedDate\"]/@content",
  4. "parse.article_type": "//META[@name=\"Article_Type\"]/@content",
  5. "parse.article_year": "//META[@name=\"Article_Year\"]/@content"


在crawler-conf.yaml添加

  1. indexer.md.mapping:
  2. - parse.title=title
  3. - parse.search=search
  4. - parse.keywords=keywords
  5. - parse.description=description
  6. - parse.article_description=metatag.article_description
  7. - parse.article_heading=metatag.article_heading
  8. - parse.article_publisheddate=metatag.article_publisheddate
  9. - parse.article_type=metatag.article_type
  10. - parse.article_year=metatag.article_year
  11. - domain
  12. - format

muk1a3rh

muk1a3rh1#

我看不出你的设置有任何明显的错误。你可以在一个URL上运行类https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/parse/JSoupFilters.java来检查提取。在命令行上测试协议的输出也很有用,参见our recent blog的例子。

相关问题