所以,我在从s3快照将hbase恢复到amazonemr集群时遇到了一个非常奇怪的错误。
hbase恢复有时可以正常工作,但有时不行——这是令人费解的部分。它似乎并不取决于我的示例类型或节点数量,而且它的发生是如此的偶然,以至于我无法确定到底是什么失败了(除了无法获得主表锁和超时之外),而且每次尝试google这个问题都是空的。。。
我的工作流程如下:
主机:1x m1.xl
核心:15x m1.xlspot示例
引导:setup hbase(s3://elasticmapreduce/bootstrap actions/setup hbase)
step1:启动hbase(/home/hadoop/lib/hbase.jar emr.hbase.backup.main—启动主机)
步骤2:还原hbase(/home/hadoop/lib/hbase.jar emr.hbase.backup.main--restore--backup dir s3://mybackupdir--backup version mybackupversion)
在恢复步骤中,恢复要么失败要么成功(看似随机,但我相信这里可能存在超时/延迟问题)
似乎超时的还原一直在尝试锁定主进程,在失败10分钟后,该步骤失败
2014-04-02 17:46:57,028 WARN emr.hbase.backup.HBaseConnector (main): Master is not running, proceeding
2014-04-02 17:46:57,029 INFO emr.hbase.backup.Main (main): Attempting to aquire the master lock
2014-04-02 17:47:07,039 INFO emr.hbase.backup.Main (main): Unable to obtain master lock, attempting to shutdown master. java.lang.RuntimeException: Timeout while performing operation, expireTime=1396460818029 msg=obtaining write lock waiting for notification
2014-04-02 17:47:07,039 INFO emr.hbase.backup.HBaseConnector (main): Listing nodes at beginning of shutdown
2014-04-02 17:47:07,039 INFO emr.hbase.backup.HBaseConnector (main): Get master
2014-04-02 17:47:07,043 INFO emr.hbase.backup.ZooKeeperConnection (main-EventThread): Event received WatchedEvent state:SyncConnected type:None path:null
2014-04-02 17:48:18,370 WARN emr.hbase.backup.HBaseConnector (main): Master is not running, proceeding
2014-04-02 17:48:18,370 INFO emr.hbase.backup.Main (main): Attempting to aquire the master lock
2014-04-02 17:48:18,370 INFO emr.hbase.backup.Main (main): Releasing the lock
2014-04-02 17:48:18,374 FATAL emr.hbase.backup.Main (main): Exception raised in main
java.lang.RuntimeException: Timeout while performing operation, expireTime=1396460846811 msg=Attempting to shutdown master
at emr.hbase.fs.Utils.throwIfExpired(Utils.java:67)
at emr.hbase.backup.PerformBackup.restore(PerformBackup.java:201)
另一方面,当它工作时,尽管锁定主机时超时了几次,但启动恢复只需要大约3分钟
2014-04-01 19:29:43,720 INFO emr.hbase.backup.Main (main): Attempting to aquire the master lock
2014-04-01 19:29:53,730 INFO emr.hbase.backup.Main (main): Unable to obtain master lock, attempting to shutdown master. java.lang.RuntimeException: Timeout while performing operation, expireTime=1396380584720 msg=obtaining write lock waiting for notification
2014-04-01 19:29:53,730 INFO emr.hbase.backup.HBaseConnector (main): Listing nodes at beginning of shutdown
2014-04-01 19:29:53,731 INFO emr.hbase.backup.HBaseConnector (main): Get master
2014-04-01 19:29:53,734 INFO emr.hbase.backup.ZooKeeperConnection (main-EventThread): Event received WatchedEvent state:SyncConnected type:None path:null
2014-04-01 19:30:32,963 WARN emr.hbase.backup.HBaseConnector (main): Master is not running, proceeding
2014-04-01 19:30:32,963 INFO emr.hbase.backup.Main (main): Attempting to aquire the master lock
2014-04-01 19:30:33,028 INFO emr.hbase.backup.Main (main): Distributed copy from s3://myhbasebackup
2014-03-17 16:30:14,502 INFO org.apache.hadoop.mapreduce.Job (main): map 0% reduce 0%
2014-03-17 16:30:22,645 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 0%
2014-03-17 16:30:33,753 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 1%
2014-03-17 16:30:36,778 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 2%
2014-03-17 16:30:39,809 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 5%
2014-03-17 16:30:40,817 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 8%
*...and it works...*
我知道有超时参数,我可以改变,如zookeeper超时,但我不确定超时限制是真正的问题,因为我看到这个字面上失败了一次,如果我用完全相同的设置重试工作。
感谢您的帮助!谢谢!
暂无答案!
目前还没有任何答案,快来回答吧!