我使用的是apache storm 1.2.3和elasticsearch 7.5.0。我已经成功地从3k新闻网站上提取数据,并在grafana和kibana上可视化。我收到了大量的垃圾(如广告)的内容。我有附加ss的内容。内容谁可以建议我如何才能过滤他们。我在考虑把es中的html内容输入到python包中。如果没有,请给我建议一个好的解决方案。提前谢谢。
这是crawler-conf.yaml文件
config:
topology.workers: 1
topology.message.timeout.secs: 300
topology.max.spout.pending: 100
topology.debug: false
fetcher.threads.number: 50
# override the JVM parameters for the workers
topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"
# mandatory when using Flux
topology.kryo.register:
- com.digitalpebble.stormcrawler.Metadata
# metadata to transfer to the outlinks
# used by Fetcher for redirections, sitemapparser, etc...
# these are also persisted for the parent document (see below)
# metadata.transfer:
# - customMetadataName
# lists the metadata to persist to storage
# these are not transfered to the outlinks
metadata.persist:
- _redirTo
- error.source
- isSitemap
- isFeed
http.agent.name: "Nitesh Singh"
http.agent.version: "1.0"
http.agent.description: "built with StormCrawler Elasticsearch Archetype 1.16"
http.agent.url: "http://someorganization.com/"
http.agent.email: "nite0sh@gmail.com"
# The maximum number of bytes for returned HTTP response bodies.
# The fetched page will be trimmed to 65KB in this case
# Set -1 to disable the limit.
http.content.limit: 65536
# FetcherBolt queue dump => comment out to activate
# if a file exists on the worker machine with the corresponding port number
# the FetcherBolt will log the content of its internal queues to the logs
# fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"
parsefilters.config.file: "parsefilters.json"
urlfilters.config.file: "urlfilters.json"
# revisit a page daily (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.default: 1440
# revisit a page with a fetch error after 2 hours (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.fetch.error: 120
fetchInterval.error: -1
# text extraction for JSoupParserBolt
textextractor.include.pattern:
- DIV[id="maincontent"]
- DIV[itemprop="articleBody"]
- ARTICLE
textextractor.exclude.tags:
- STYLE
- SCRIPT
# custom fetch interval to be used when a document has the key/value in its metadata
# and has been fetched successfully (value in minutes)
# fetchInterval.FETCH_ERROR.isFeed=true: 30
# fetchInterval.isFeed=true: 10
# configuration for the classes extending AbstractIndexerBolt
# indexer.md.filter: "someKey=aValue"
indexer.url.fieldname: "url"
indexer.text.fieldname: "content"
indexer.canonical.name: "canonical"
indexer.md.mapping:
- parse.title=title
- parse.keywords=keywords
- parse.description=description
- domain=domain
# Metrics consumers:
topology.metrics.consumer.register:
- class: "org.apache.storm.metric.LoggingMetricsConsumer"
parallelism.hint: 1
1条答案
按热度按时间vdzxcuhz1#
你配置了文本提取器吗?例如
这会将文本限制为特定元素(如果找到)和/或删除排除中指定的元素。
大多数新闻网站都会使用某种形式的标签来标记主要内容。
您给出的示例是可以为其添加模式的元素。
parsefilter中可以嵌入各种样板删除库,但它们的精度差别很大。