Apache Spark ETL连接器未加载到aws中

8zzbczxx  于 2023-04-21  发布在  Apache
关注(0)|答案(5)|浏览(101)

我在我的胶水工作中使用了AWS胶水的bigquery连接器。几天前它工作正常,但现在突然它给了我下面的错误:

LAUNCH ERROR | Glue ETL Marketplace - failed to download connector.Please refer logs for details.

下面是我在cloudwatch上得到的完整错误

2021-11-08T11:33:02.045+05:00   Traceback (most recent call last): File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main

2021-11-08T11:33:02.070+05:00   "__main__", mod_spec) File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/tmp/aws_glue_custom_connector_python/docker/unpack_docker_image.py", line 361, in <module>

2021-11-08T11:33:02.070+05:00   main() File "/tmp/aws_glue_custom_connector_python/docker/unpack_docker_image.py", line 351, in main

2021-11-08T11:33:02.070+05:00   res += download_jars_per_connection(conn, region, endpoint, proxy) File "/tmp/aws_glue_custom_connector_python/docker/unpack_docker_image.py", line 304, in download_jars_per_connection

2021-11-08T11:33:02.070+05:00   download_and_unpack_docker_layer(ecr_url, layer["digest"], dir_prefix, http_header) File "/tmp/aws_glue_custom_connector_python/docker/unpack_docker_image.py", line 168, in download_and_unpack_docker_layer

2021-11-08T11:33:02.070+05:00   layer = send_get_request(layer_url, header) File "/tmp/aws_glue_custom_connector_python/docker/unpack_docker_image.py", line 80, in send_get_request

2021-11-08T11:33:02.070+05:00   

2021-11-08T11:33:02.070+05:00   response.raise_for_status() File "/home/spark/.local/lib/python3.7/site-packages/requests/models.py", line 765, in raise_for_status

2021-11-08T11:33:02.071+05:00   raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 400 Client Error: Bad Request

2021-11-08T11:33:02.119+05:00   Glue ETL Marketplace - failed to download connector, activation script exited with code
xnifntxz

xnifntxz1#

当Glue job尝试使用连接器时,它必须以容器的形式下载连接器。连接器的容器在Amazon公共ECR存储库中可用。要从AWS公共存储库中拉取容器,我们必须将“AmazonEC2ContainerRegistryFullAccess”策略添加到您的IAM角色。我们还可以将访问权限限制为只读。

mitkmikd

mitkmikd2#

我在一个组织设置中遇到了这个问题,试图使用BigQuery Markertplace连接器。我被显式拒绝在非欧盟地区使用GetAuthorizationToken。因此,Glue Job将以类似于OP所描述的方式失败,因为它试图从这里下载运行时的docker镜像:https://709825985650.dkr.ecr.us-east-1.amazonaws.com/amazon-web-services/glue/bigquery:0.22.0-glue3.0-2
一个可能的解决方法是将镜像的副本推送到您的私有ECR。然后,在创建GLUE连接时,将connection_properties中的CONNECTOR_URL设置为您的私有ECR url。这将解决类似的问题。此外,这似乎比添加诸如AmazonEC2ContainerRegistryFullAccess之类的广泛策略更合理(如Sparkian所建议的)。您将能够在此特定ECR存储库上给予粒度访问权限。
/e:如果AmazonEC2ContainerRegistryFullAccess解决了您的问题,您也可以使用AmazonEC2ContainerRegistryReadOnly。这将在您添加的权限中具有更严格的限制,并实现相同的功能。在此AWS Marketplace连接器troubleshooting guide中也有描述

au9on6nz

au9on6nz3#

这很可能是一个权限问题。我遇到了它,暂时给了宽松的权限,这似乎解决了它。

daolsyd0

daolsyd04#

我在Glue3.0上也遇到了同样的问题。
错误日志为:

Glue ETL Marketplace - Requesting ECR authorization token for registryIds=709825985650 and region_name=us-east-1.
...
...
socket.timeout: timed out
...
...
(<botocore.awsrequest.AWSHTTPSConnection object at 0x7f1136778a90>, 'Connection to api.ecr.us-east-1.amazonaws.com timed out. (connect timeout=60)')

...

botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://api.ecr.us-east-1.amazonaws.com/"
Glue ETL Marketplace - failed to download connector, activation script exited with code 1
LAUNCH ERROR | Glue ETL Marketplace - failed to download connector.Please refer logs for details.

我可以通过将AmazonEC2ContainerRegistryFullAccess添加到服务角色来解决这个问题,沿着使用以下策略来获取随机密码。我的猜测是,当它试图拉取ECR映像时,它需要暂时生成一些随机密码。
我还将此IAM策略添加到IAM角色:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "secretsmanager:GetRandomPassword",
            "Resource": "*"
        }
    ]
}

然后就成功了。

ulydmbyx

ulydmbyx5#

我在使用GCP BigQuery连接器时遇到了这个问题。有些作业可以运行连接器,有些则不能。所有作业都具有相同的权限和设置。在请求ECR授权令牌后,请求超时时似乎出现了问题。

相关问题