nutch不获取任何url

p8ekf7hl  于 2021-06-02  发布在  Hadoop
关注(0)|答案(0)|浏览(224)

我正在hadoop集群(2个节点)上运行nutch2。我运行crawl命令

bin/crawl urls/seed.txt TestCrawl http://10.130.231.16:8983/solr/nutch 2

屏幕显示过滤后注入了765个URL。统计数据显示什么也没有得到。

14/05/27 01:33:44 INFO crawl.WebTableReader: Statistics for WebTable: 
14/05/27 01:33:44 INFO crawl.WebTableReader: jobs:  {db_stats-job_201405261214_0047=     {jobID=job_201405261214_0047, jobName=db_stats, counters={File Input Format Counters =  {BYTES_READ=0}, Job Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=10102,   FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1,   SLOTS_MILLIS_REDUCES=10187}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6,   MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0,   COMMITTED_HEAP_BYTES=231735296, CPU_MILLISECONDS=2570, SPLIT_RAW_BYTES=1017,   COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=313917440, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=2243407872, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1017, FILE_BYTES_WRITTEN=156962, HDFS_BYTES_WRITTEN=86}, File Output Format Counters ={BYTES_WRITTEN=86}}}}
14/05/27 01:33:44 INFO crawl.WebTableReader: TOTAL urls:    0

为什么会这样?我的正则表达式过滤器和域过滤器设置为允许所有域(我正在尝试做整个网络爬网)。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题