如果给定容器错误状态代码,在哪里可以找到更明确的错误?

2jcobegt  于 2021-06-26  发布在  Mesos
关注(0)|答案(2)|浏览(362)

我实际上是通过 Mesos 堆栈,使用 Docker 容器。
有时,有些任务失败了。
以下是一些相关的 TaskStatus 信息和原因:

message: Container exited with status 1 - reason: REASON_COMMAND_EXECUTOR_FAILED
message: Container exited with status 42 - reason: REASON_COMMAND_EXECUTOR_FAILED
message: Container exited with status 137 - reason: REASON_COMMAND_EXECUTOR_FAILED

是否有一个对应表链接容器错误状态代码 TaskStatus 有更明确错误的消息?

ujv3wf0j

ujv3wf0j1#

命令任务可能会由于多种原因而失败,并设置正确的退出代码。例如docker 1.10设置如下退出状态代码(来自文档和答案):
docker run的退出代码提供了有关容器未能运行或退出的原因的信息。docker run exits使用非零代码时,退出代码遵循chroot标准,如下所示:
125如果错误与docker守护程序本身有关:

$ docker run --foo busybox; echo $?

# flag provided but not defined: --foo   See 'docker run --help'.

126如果无法调用包含的命令:

$ docker run busybox /etc; echo $?

# docker: Error response from daemon: Container command '/etc' could not be invoked.

127如果找不到包含的命令

$ docker run busybox foo; echo $?

# docker: Error response from daemon: Container command 'foo' not found or does not exist.   127 Exit code of contained command

否则

$ docker run busybox /bin/sh -c 'exit 3'; echo $?

# 3

另一个退出代码规则可以在这里找到

| Code  |            Meaning             |         Example         |                                                   Comments                                                   |
|-------|--------------------------------|-------------------------|--------------------------------------------------------------------------------------------------------------|
| 1     | Catchall for general errors    | let "var1 = 1/0"        | Miscellaneous errors, such as "divide by zero" and other impermissible operations                            |
| 2     | Misuse of shell builtins       | empty_function() {}     | Missing keyword or command, or permission problem (and diff return code on a failed binary file comparison). |
| 126   | Command invoked cannot execute | /dev/null               | Permission problem or command is not an executable                                                           |
| 127   | "command not found"            | illegal_command         | Possible problem with $PATH or a typo                                                                        |
| 128   | Invalid argument to exit       | exit 3.14159            | exit takes only integer args in the range 0 - 255 (see first footnote)                                       |
| 128+n | Fatal error signal "n"         | kill -9 $PPID of script | $? returns 137 (128 + 9)                                                                                     |
| 130   | Script terminated by Control-C | Ctl-C                   | Control-C is fatal error signal 2, (130 = 128 + 2, see above)                                                |
| 255*  | Exit status out of range       | exit -1                 | exit takes only integer args in the range 0 - 255                                                            |

根据你的例子:
137–内存不足; 128 + 9 = 137 (9 coming from SIGKILL) 可能会被转码成内存不足的错误并杀死。
1–命令退出 1 . 可能是由于配置无效、内部应用程序错误或输入无效。
42 –
对生命、宇宙和一切终极问题的回答
如果您需要更多信息来解释状态代码,可以检查mesos taskstatus update中的message字段,例如mesos将oom的信息放在那里。同样的信息也可以在mesos日志中找到。要调试命令返回非零代码的原因,可以检查executor沙盒中存储的文件,特别是stderr/stdout或命令特定的日志。

fruv7luv

fruv7luv2#

我猜您想查看中的枚举原因 mesos.proto (复制如下):

enum Reason {
    // TODO(jieyu): The default value when a caller doesn't check for
    // presence is 0 and so ideally the 0 reason is not a valid one.
    // Since this is not used anywhere, consider removing this reason.
    REASON_COMMAND_EXECUTOR_FAILED = 0;

    REASON_CONTAINER_LAUNCH_FAILED = 21;
    REASON_CONTAINER_LIMITATION = 19;
    REASON_CONTAINER_LIMITATION_DISK = 20;
    REASON_CONTAINER_LIMITATION_MEMORY = 8;
    REASON_CONTAINER_PREEMPTED = 17;
    REASON_CONTAINER_UPDATE_FAILED = 22;
    REASON_EXECUTOR_REGISTRATION_TIMEOUT = 23;
    REASON_EXECUTOR_REREGISTRATION_TIMEOUT = 24;
    REASON_EXECUTOR_TERMINATED = 1;
    REASON_EXECUTOR_UNREGISTERED = 2;
    REASON_FRAMEWORK_REMOVED = 3;
    REASON_GC_ERROR = 4;
    REASON_INVALID_FRAMEWORKID = 5;
    REASON_INVALID_OFFERS = 6;
    REASON_IO_SWITCHBOARD_EXITED = 27;
    REASON_MASTER_DISCONNECTED = 7;
    REASON_RECONCILIATION = 9;
    REASON_RESOURCES_UNKNOWN = 18;
    REASON_SLAVE_DISCONNECTED = 10;
    REASON_SLAVE_REMOVED = 11;
    REASON_SLAVE_RESTARTED = 12;
    REASON_SLAVE_UNKNOWN = 13;
    REASON_TASK_CHECK_STATUS_UPDATED = 28;
    REASON_TASK_GROUP_INVALID = 25;
    REASON_TASK_GROUP_UNAUTHORIZED = 26;
    REASON_TASK_INVALID = 14;
    REASON_TASK_UNAUTHORIZED = 15;
    REASON_TASK_UNKNOWN = 16;
  }

相关问题