nutch无法删除重复项(在一个solr核心上,但在另一个核心上)

ukqbszuj  于 2021-06-03  发布在  Hadoop
关注(0)|答案(1)|浏览(355)

我有一个神经问题,我似乎无法调试。
我开始使用nutch来爬网并将我们的页面索引到solr core 1。而且效果很好。工作完成得很好。
尽管我想开始索引或页面到solr核心0,以及其他我们想索引的项目。
索引不是问题,它会爬网和索引很好。但在core 0上,索引末尾的重复数据消除任务继续失败。我得到以下错误(如下)。据我所知,schema.xml和solrconfig.xml文件在core0和core1中都有相同的内容,除了在core0中url字段不再是必需的,因为其他索引项没有url,所以id字段是所有这些文件的标准必需字段。是这个导致了问题吗?重复数据消除程序试图做什么?有什么阻碍了它?我怎么才能通过这个?谢谢!: 2013-07-26 16:55:17,797 INFO solr.SolrIndexWriter - Indexing 157 documents 2013-07-26 16:55:30,407 INFO solr.SolrMappingReader - source: content dest: content 2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: title dest: title 2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: host dest: host 2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: segment dest: segment 2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: boost dest: boost 2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: digest dest: digest 2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: url dest: id 2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: url dest: url 2013-07-26 16:55:31,590 INFO indexer.IndexingJob - Indexer: finished at 2013-07-26 16:55:31, elapsed: 00:00:19 2013-07-26 16:55:31,593 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: starting at 2013-07-26 16:55:31 2013-07-26 16:55:31,593 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://<domain>:<port>/solr/core0/ 2013-07-26 16:55:32,043 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2013-07-26 16:55:32,043 WARN mapred.LocalJobRunner - job_local1142877999_0055 java.lang.Exception: java.lang.NullPointerException at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) Caused by: java.lang.NullPointerException at org.apache.hadoop.io.Text.encode(Text.java:388) at org.apache.hadoop.io.Text.set(Text.java:178) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:679)

b4lqfgs4

b4lqfgs41#

由于文档没有url字段,因此文档的id为空,因此当它运行下面的方法时会抛出一个空指针异常。
下面是Nutch1.7 trunk中solrdeleteduplicate类的代码,其中solr记录被id字段删除。

updateRequest.deleteById(solrRecord.id);

updaterequest=>org.apache.solr.client.solrj.request.updaterequest的示例
solrrecord=>需要删除的solr文档。
id=>从nutch发行版的conf文件夹中的solrindex-mapping.xml读取的solr文档的id(如果为null,则会引发异常)

相关问题