Azure数据块:等待群集就绪时出现意外故障,原因群集不可用,因为驱动程序运行不正常

laik7k3q  于 2022-11-25  发布在  其他
关注(0)|答案(1)|浏览(112)

I have some scheduled data pipelines that are orchestrated via Azure Data Factory, each with a Databricks activity that runs on a job cluster.
All my Databricks activities are stuck in retry loops and failing with the following error,

Databricks execution failed with error state: InternalError, error message: Unexpected failure while waiting for the cluster <cluster-id> to be ready.Cause Cluster <cluster-id> is unusable since the driver is unhealthy.

My Databricks cluster is not even starting up.
This issue is quite similar to what has been posted here,
AWS Databricks cluster start failure
However, there are a few differences,

  1. My pipelines are running on Azure: Azure Data Factory and Azure Databricks
  2. I can spin up my interactive clusters (in the same workspace) without any problem
  3. I have checked with my colleagues who are running similar pipelines on different subscriptions (in the same region), but they are not facing any issue
    Any idea what is going on here? Is it just a service interruption of sorts or is there something I can do resolve this?
t9aqgxwy

t9aqgxwy1#

原来,我的管道出现故障是因为为集群配置的init脚本没有正确执行。
我们在Azure Artifacts中维护了一个内置Python包。要安装此包,我们需要使用DevOps令牌。要在群集中安装此包,init脚本中提供了一个命令,由于令牌已过期,init脚本失败。
因此,群集无法正常启动。虽然错误消息相当隐晦。“原因群集不可用,因为驱动程序不健康”可能字面上意味着任何事情。
但是,如果您自己遇到这种情况,请检查您的init脚本。
注意:这里的另一个提示是,当我们查看事件日志时,我们注意到事件INIT_SCRIPTS_STARTEDINIT_SCRIPTS_FINISHED之间的时间非常长,比实际需要的时间要长。

相关问题