我正在hadoop集群(2个节点)上运行nutch2。我运行crawl命令
bin/crawl urls/seed.txt TestCrawl http://10.130.231.16:8983/solr/nutch 2
屏幕显示过滤后注入了765个URL。统计数据显示什么也没有得到。
14/05/27 01:33:44 INFO crawl.WebTableReader: Statistics for WebTable:
14/05/27 01:33:44 INFO crawl.WebTableReader: jobs: {db_stats-job_201405261214_0047= {jobID=job_201405261214_0047, jobName=db_stats, counters={File Input Format Counters = {BYTES_READ=0}, Job Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=10102, FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=10187}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=231735296, CPU_MILLISECONDS=2570, SPLIT_RAW_BYTES=1017, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=313917440, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=2243407872, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1017, FILE_BYTES_WRITTEN=156962, HDFS_BYTES_WRITTEN=86}, File Output Format Counters ={BYTES_WRITTEN=86}}}}
14/05/27 01:33:44 INFO crawl.WebTableReader: TOTAL urls: 0
为什么会这样?我的正则表达式过滤器和域过滤器设置为允许所有域(我正在尝试做整个网络爬网)。
暂无答案!
目前还没有任何答案,快来回答吧!