appmaster请求容器不工作

disbfnqx  于 2021-05-30  发布在  Hadoop
关注(0)|答案(2)|浏览(363)

我正在运行一个具有8个vcore和8gb总内存的本地Yarn集群。
工作流程如下:
yarnclient提交一个应用程序请求,启动容器中的appmaster。
appmaster启动,创建amrmclient和nmclient,将自身注册到rm,然后通过amrmclient.addcontainerrequest为工作线程创建4个容器请求
即使有足够的可用资源,也不会分配容器(从不调用回调函数onContainerSalocated)。我试图检查nodemanager和resourcemanager的日志,但没有看到任何与容器请求相关的行。我密切关注Apache文档,不明白自己做错了什么。
以下是appmaster代码供参考:

@Override
public void run() {
    Map<String, String> envs = System.getenv();

    String containerIdString = envs.get(ApplicationConstants.Environment.CONTAINER_ID.toString());
    if (containerIdString == null) {
        // container id should always be set in the env by the framework
        throw new IllegalArgumentException("ContainerId not set in the environment");
    }
    ContainerId containerId = ConverterUtils.toContainerId(containerIdString);
    ApplicationAttemptId appAttemptID = containerId.getApplicationAttemptId();

    LOG.info("Starting AppMaster Client...");

    YarnAMRMCallbackHandler amHandler = new YarnAMRMCallbackHandler(allocatedYarnContainers);

    // TODO: get heart-beet interval from config instead of 100 default value
    amClient = AMRMClientAsync.createAMRMClientAsync(1000, this);
    amClient.init(config);
    amClient.start();

    LOG.info("Starting AppMaster Client OK");

    //YarnNMCallbackHandler nmHandler = new YarnNMCallbackHandler();
    containerManager = NMClient.createNMClient();
    containerManager.init(config);
    containerManager.start();

    // Get port, ulr information. TODO: get tracking url
    String appMasterHostname = NetUtils.getHostname();

    String appMasterTrackingUrl = "/progress";

    // Register self with ResourceManager. This will start heart-beating to the RM
    RegisterApplicationMasterResponse response = null;

    LOG.info("Register AppMaster on: " + appMasterHostname + "...");

    try {
        response = amClient.registerApplicationMaster(appMasterHostname, 0, appMasterTrackingUrl);
    } catch (YarnException | IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
        return;
    }

    LOG.info("Register AppMaster OK");

    // Dump out information about cluster capability as seen by the resource manager
    int maxMem = response.getMaximumResourceCapability().getMemory();
    LOG.info("Max mem capabililty of resources in this cluster " + maxMem);

    int maxVCores = response.getMaximumResourceCapability().getVirtualCores();
    LOG.info("Max vcores capabililty of resources in this cluster " + maxVCores);

    containerMemory = Integer.parseInt(config.get(YarnConfig.YARN_CONTAINER_MEMORY_MB));
    containerCores = Integer.parseInt(config.get(YarnConfig.YARN_CONTAINER_CPU_CORES));

    // A resource ask cannot exceed the max.
    if (containerMemory > maxMem) {
      LOG.info("Container memory specified above max threshold of cluster."
          + " Using max value." + ", specified=" + containerMemory + ", max="
          + maxMem);
      containerMemory = maxMem;
    }

    if (containerCores > maxVCores) {
      LOG.info("Container virtual cores specified above max threshold of  cluster."
        + " Using max value." + ", specified=" + containerCores + ", max=" + maxVCores);
      containerCores = maxVCores;
    }
    List<Container> previousAMRunningContainers = response.getContainersFromPreviousAttempts();
    LOG.info("Received " + previousAMRunningContainers.size()
            + " previous AM's running containers on AM registration.");

    for (int i = 0; i < 4; ++i) {
        ContainerRequest containerAsk = setupContainerAskForRM();
        amClient.addContainerRequest(containerAsk); // NOTHING HAPPENS HERE...
        LOG.info("Available resources: " + amClient.getAvailableResources().toString());
    }

    while(completedYarnContainers != 4) {
        try {
            Thread.sleep(1000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

    LOG.info("Done with allocation!");

}

@Override
public void onContainersAllocated(List<Container> containers) {
    LOG.info("Got response from RM for container ask, allocatedCnt=" + containers.size());

    for (Container container : containers) {
        LOG.info("Allocated yarn container with id: {}" + container.getId());
        allocatedYarnContainers.push(container);

        // TODO: Launch the container in a thread
    }
}

@Override
public void onError(Throwable error) {
    LOG.error(error.getMessage());
}

@Override
public float getProgress() {
    return (float) completedYarnContainers / allocatedYarnContainers.size();
}

以下是jps的输出:

14594 NameNode
15269 DataNode
17975 Jps
14666 ResourceManager
14702 NodeManager

下面是初始化的appmaster日志和4个容器请求:

23:47:09 YarnAppMaster - Starting AppMaster Client OK
23:47:09 YarnAppMaster - Register AppMaster on: andrei-mbp.local/192.168.1.4...
23:47:09 YarnAppMaster - Register AppMaster OK
23:47:09 YarnAppMaster - Max mem capabililty of resources in this cluster 2048
23:47:09 YarnAppMaster - Max vcores capabililty of resources in this cluster 2
23:47:09 YarnAppMaster - Received 0 previous AM's running containers on AM registration.
23:47:11 YarnAppMaster - Requested container ask: Capability[<memory:512, vCores:1>]Priority[0]
23:47:11 YarnAppMaster - Available resources: <memory:7680, vCores:0>
23:47:11 YarnAppMaster - Requested container ask: Capability[<memory:512, vCores:1>]Priority[0]
23:47:11 YarnAppMaster - Available resources: <memory:7680, vCores:0>
23:47:11 YarnAppMaster - Requested container ask: Capability[<memory:512, vCores:1>]Priority[0]
23:47:11 YarnAppMaster - Available resources: <memory:7680, vCores:0>
23:47:11 YarnAppMaster - Requested container ask: Capability[<memory:512, vCores:1>]Priority[0]
23:47:11 YarnAppMaster - Available resources: <memory:7680, vCores:0>
23:47:11 YarnAppMaster - Progress indicator should not be negative

提前谢谢。

ejk8hzay

ejk8hzay1#

感谢alexandre fonseca指出getprogress()在第一次分配之前调用它时返回一个除以零的nan,这使得resourcemanager在出现异常时立即退出。
请在此处阅读更多信息。

piv4azn7

piv4azn72#

我怀疑问题恰恰来自消极的进展:

23:47:11 YarnAppMaster - Progress indicator should not be negative

注意,由于您使用的是amrmasyncclient,因此在调用addcontainerrequest时不会立即发出请求。实际上,有一个heartbeat函数定期运行,在这个函数中调用allocate并发出挂起的请求。此函数使用的进度值最初从0开始,但一旦获得acquire的响应,就会用处理程序返回的值进行更新。
第一次获取应该在寄存器之后完成,因此应该调用getprogress函数并更新现有的进度。实际上,您的进度将更新为nan,因为此时allocatedyarncontainers将为空,completedyarncontainers也将为0,因此返回的进度将是未定义的0/0的结果。碰巧的是,当下一个allocate检查您的进度值时,它将失败,因为nan在所有比较中都返回false,因此没有其他allocate函数真正与resourcemanager通信,因为它在第一步就退出了,但出现了一个异常。
尝试将进度函数更改为以下内容:

@Override
public float getProgress() {
    return (float) allocatedYarnContainers.size() / 4.0f;
}

(注:从此处复制到stackoverflow以备后续处理)

相关问题