nutch/hadoop:regex-normalize.xml和regex-urlfilter.txt未找到错误,即使它们存在

vfh0ocws  于 2021-05-27  发布在  Hadoop
关注(0)|答案(1)|浏览(326)

我正在尝试通过eclipse运行nutch和hadoop,并按照一些教程进行设置。我目前陷入了一个nullpointerexception,我认为这是由于没有找到regex-urlfilter.txt和regex-normalize.xml引起的。
以下是来自logs:-

[LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.conf.Configuration  - regex-normalize.xml not found
4473 [LocalJobRunner Map Task Executor #0] WARN org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer  - Can't load the default rules! 
4477 [LocalJobRunner Map Task Executor #0] DEBUG org.apache.nutch.util.ObjectCache  - No object cache found for conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, file:/tmp/hadoop-338737067/mapred/local/localRunner/338737067/job_local1524701719_0001/job_local1524701719_0001.xml, instantiating a new object cache
4486 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.conf.Configuration  - regex-urlfilter.txt not found
4486 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask  - Starting flush of map output
4516 [LocalJobRunner Map Task Executor #0] DEBUG org.apache.hadoop.util.concurrent.ExecutorHelper  - afterExecute in thread: LocalJobRunner Map Task Executor #0, runnable type: java.util.concurrent.FutureTask
4516 [Thread-3] INFO org.apache.hadoop.mapred.LocalJobRunner  - map task executor complete.
4521 [Thread-3] WARN org.apache.hadoop.mapred.LocalJobRunner  - job_local1524701719_0001
java.lang.Exception: java.lang.NullPointerException
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:491)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:551)

这两个文件都存在于\workspace\apache-nutch-1.16\conf文件夹中,我不确定我做错了什么。我仔细检查了hadoop\u home和hadoop\u bin的环境变量是否设置正确,它们是否指向正确的目录。我不确定他们在哪个目录下查找regex-urlfilter.txt和regex-normalize.xml。如能帮助解决此问题,我们将不胜感激。
我使用的是hadoop3.0.0和apache-nutch-1.16。

czq61nw1

czq61nw11#

这个 conf/ 文件夹必须位于java类路径上。通过使用提供的脚本之一运行nutch,这是最容易做到的 bin/nutch 或者 bin/crawl . 如果使用二进制软件包,则脚本位置为 apache-nutch-1.16/bin/nutch . 在源代码包中 apache-nutch-1.16/runtime/local/bin/nutch 之后 ant runtime 已经被执行了。使用这些脚本还允许将配置文件放在不同的目录中,并将nutch\u conf\u dir指向这个目录。脚本只会将这个位置放在类路径前面。

相关问题