hadoop在aws多区域配置中尝试写入cassandra时超时

but5z9lq  于 2021-06-02  发布在  Hadoop
关注(0)|答案(2)|浏览(396)

我在aws中运行一个多dc cassandra(开源,而不是dse)集群,其中一个dc(us-west-2)用于分析,另一个(us-east)用于事务存储。我对ec2告密者使用了networktopologystrategy,在hadoop配置中使用了local\u one的一致性级别。hadoop可以毫无问题地从cassandra读取,但是尝试写入会产生超时异常。
跑步 nodetool status 显示dcs配置正确:

Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Owns   Host ID                               Token                                    Rack
UN  x.x.x.x       1.01 GB     9.9%   9e7f4393-7ac9-4559-b3ff-de48be50016f  -9127921345534057723                     2a
UN  x.x.x.x       1001.16 MB  11.4%  d0760383-c3dd-474c-9261-239b71dba3f1  -9221279003374097975                     2b
UN  x.x.x.x       1.05 GB     11.7%  3f09fbf5-0d85-4283-9009-0ec0e29223c0  -9140104347498952504                     2c
Datacenter: us-east
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Owns   Host ID                               Token                                    Rack
UN  x.x.x.x       1.1 GB     11.3%  5bbd2de4-e1d2-4a17-9f40-034f60b35954  -9061054426204373981                     1b
UN  x.x.x.x       1.15 GB    11.5%  e34c590e-6176-45b2-a8f9-18b4a9a80032  -9216519687724118609                     1c
UN  x.x.x.x       1.18 GB    10.9%  fa0b0a1a-f156-40fc-a267-970d1eb9cddb  -9207673937991303291                     1a
UN  x.x.x.x       1.46 GB    10.7%  b18ae406-c9ec-42b7-a365-b0c6e2fe582f  -9206671929961171506                     1a
UN  x.x.x.x       1.13 GB    11.4%  1ac9c1c5-55ad-4048-b1ba-3b9768933ecc  -9146100851344467112                     1c
UN  x.x.x.x       1.53 GB    11.2%  dad665bb-68d9-4811-b421-f33333261867  -9178920986366339267                     1b

使用columnfamilyoutputformat的堆栈跟踪:

java.io.IOException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter$RangeClient.run(ColumnFamilyRecordWriter.java:224)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
    at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
    at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:123)
    at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter$RangeClient.run(ColumnFamilyRecordWriter.java:215)
Caused by: java.net.ConnectException: Connection timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:579)
    at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
    ... 4 more

... 使用cqloutputformat:

java.io.IOException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:271)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
    at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
    at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:123)
    at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:262)
Caused by: java.net.ConnectException: Connection timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:579)
    at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
    ... 4 more

两条痕迹最终都指向 AbstractColumnFamilyOutputFormat.createAuthenticatedClient(host, port, conf) .
然后,我打开了该源代码并向异常添加了一些详细信息,以便它将输出连接到的主机名,这导致了以下跟踪:

java.io.IOException: java.lang.Exception: Unable to connect to host [hostname]
    at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:271)
Caused by: java.lang.Exception: Unable to connect to host [hostname]
    at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:139)
    at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:262)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
    at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
    at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:124)
    ... 1 more
Caused by: java.net.ConnectException: Connection timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:579)
    at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
    ... 4 more

问题是[hostname]是一台不在分析集群中的机器(它在美国东部)。为什么它不能自动地知道这一点,尤其是当读取正常工作时?它似乎在尝试环中的所有节点,而不考虑dc。
作为记录,写入失败使用 CqlOutputFormat , ColumnFamilyOutputFormat ,并通过清管器使用 CqlStorage 以及 CassandraStorage .

qgzx9mmu

qgzx9mmu1#

这个问题归结为两件事:
对于多区域ec2设置,cassandra要求将广播地址设置为公共ip,将侦听地址设置为内部ip。在大多数情况下,您希望rpc\u address是内部ip,但这可能会破坏cassandra的hadoop客户机,该客户机根据广播\u address确定要与之通信的端点。
cassandra的hadoop客户端(特别是ringcache)不支持数据中心节点发现,而是尝试发现环中的所有节点——包括非本地节点。它尊重实际写入的一致性级别,但在我们的示例中,由于#1,它从未达到该级别。
我提交了一张罚单并提交了一个补丁来解决这些问题:
https://issues.apache.org/jira/browse/cassandra-7252

mefy6pfw

mefy6pfw2#

我想说的是,尝试将cassandra.yaml中的write\ u request\ u timeout\ in\ ms设置为一个非常高的数字,看看这是否有帮助。节点本身可能有问题,当它没有响应而仍然显示为启动时。如果它仍然超时,请在您怀疑是导致问题的节点上重新启动服务。

相关问题