hadoop Apache Nutch在使用作业文件运行时未阅读新的配置文件

omvjsjqw 于 2022-11-01 发布在 Hadoop

关注(0)|答案(1)|浏览(129)

我已经配置了Apache Nutch 1.x用于网络爬行。有一个要求，我应该添加一些额外的信息到Solr文档中的每个域的索引。配置是一个JSON文件。我已经开发了以下代码，并在本地模式下测试成功。我已经更新了index-basic插件。代码片段如下：

this.enable_extra_domain  = conf.getBoolean("domain.extraInfo.enable", false);
    if (this.enable_extra_domain) {
         String domainExtraInfo = conf.get("domain.extraInfo.file","conf/domain-extra.json");
         readDomainFile(domainExtraInfo);
         LOG.info("domain.extraInfo.enable is enabled. Using " + domainExtraInfo + " for input.");
    }
    else {
        LOG.info("domain.extraInfo.enable is disabled.");
    }

读取文件的函数如下

private void readDomainFile(String domainExtraInfo) {
    // Instance of our Domain map with extra info
    website_records = new HashMap<String, List<Object>>();

    JSONParser jsonParser = new JSONParser();
    try (FileReader reader = new FileReader(domainExtraInfo))
    {
        Object obj = jsonParser.parse(reader);
        JSONArray DomainList = (JSONArray) obj;

        DomainList.forEach( domain -> parseDomainObject( (JSONObject) domain ) );

    }
    catch (Exception e) {
        // TODO: handle exception
        e.printStackTrace();
    }
}

当我在本地模式下运行时，这段代码可以成功运行。但是当我在EMR（或其他Hadoop集群）上运行带有.job文件的Nutch时，我遇到了java.io.filenotfoundexception。问题出在哪里？我在本地模式下的conf文件夹中有我的新配置文件，而在部署时，它被添加到了.job文件中

hadoop

来源：https://stackoverflow.com/questions/72592718/apache-nutch-not-reading-a-new-configuration-file-when-run-with-job-file

1条答案

按热度按时间

oyt4ldly1#

我在本地模式下的conf文件夹中有我的新配置文件，而在部署时，它被添加到.job文件中
在分布式模式下，需要从部署到Hadoop集群节点的作业文件中读取文件，最简单的方法是使用Hadoop Configuration class提供的方法，例如getConfResourceAsReader（String name）。注意：参数“name”是不带目录部分的文件名（“domain-extra.json”）。你可以在Nutch源代码中找到很多例子，例如在一个URL过滤器中。

赞(0）回复(0）举报 2022-11-01

我来回答

hadoop Apache Nutch在使用作业文件运行时未阅读新的配置文件

1条答案

相关问题

热门标签

最新问答