hadoop作业在大数据上使用本机simstring c代码失败

epfja78i  于 2021-06-03  发布在  Hadoop
关注(0)|答案(2)|浏览(303)

我在hadoop集群上使用simstring本机库运行大数据(~15g)作业时遇到问题。不过,作业在中/小型数据集(~200m)上运行良好。在作业期间,simstring首先为匹配字符串创建一个基于文件的数据库,然后根据数据库中的字符串对给定字符串执行匹配。作业完成后,它将删除基于文件的数据库。作业以多线程(100线程)方式运行。
为执行作业创建了大约22个Map器,每个Map器运行100个线程。整机ram为4g
错误日志如下:

14/02/12 00:15:53 INFO mapred.JobClient:  map 0% reduce 0%
14/02/12 00:16:13 INFO mapred.JobClient:  map 4% reduce 0%
14/02/12 00:16:24 INFO mapred.JobClient: Task Id : attempt_201402091522_0059_m_000001_0, Status : FAILED
java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 134.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # A fatal error has been detected by the Java Runtime Environment:
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: #  SIGSEGV (0xb) at pc=0x00007f6f1cd8827b, pid=21146, tid=140115055609600
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # JRE version: 6.0_45-b06
attempt_201402091522_0059_m_000001_0: # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.45-b01 mixed mode linux-amd64 compressed oops)
attempt_201402091522_0059_m_000001_0: # Problematic frame:
attempt_201402091522_0059_m_000001_0: # C  [libSimString.so+0x6c27b][thread 140115045103360 also had an error]
attempt_201402091522_0059_m_000001_0:   cdbpp::cdbpp_base<cdbpp::murmurhash2>::get(void const*, unsigned long, unsigned long*) const+0x16f
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # An error report file with more information is saved as:
attempt_201402091522_0059_m_000001_0: # /app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201402091522_0059/attempt_201402091522_0059_m_000001_0/work/hs_err_pid21146.log
attempt_201402091522_0059_m_000001_0: [thread 140115070318336 also had an error]
attempt_201402091522_0059_m_000001_0: [thread 140114919028480 also had an error]
attempt_201402091522_0059_m_000001_0: [thread 140115089229568 also had an error]
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # If you would like to submit a bug report, please visit:
attempt_201402091522_0059_m_000001_0: #   http://java.sun.com/webapps/bugreport/crash.jsp
attempt_201402091522_0059_m_000001_0: # The crash happened outside the Java Virtual Machine in native code.
attempt_201402091522_0059_m_000001_0: # See problematic frame for where to report the bug.

问题似乎是由本机代码引起的,如下所示:

cdbpp::cdbpp_base<cdbpp::murmurhash2>::get(void const*, unsigned long, unsigned long*) const+0x16f

但是,我不明白为什么这不会在小数据集中产生任何问题。我正在运行以下hadoop命令以执行:

hadoop jar hadoopjobs/job.jar Job -D mapred.child.java.opts=-Xss500k -D mapred.reduce.child.java.opts=-Xmx200m -files file1,file2,/home/hduser/libs/libSim/x64/libSimString.so -libjars /home/hduser/libs/Simstring.jar /datasources/XXX/spool/input datasources/XXX/spool/output

参考文献:simstring库:http://www.chokkan.org/software/simstring/
cdbpp::cdbpp\u base::get(void const*,unsigned long,unsigned long*)const+0x16f的源代码:https://gitorious.org/copy-paste/copy-paste/commit/5d9c6b5b29fb2b1b8dd571260e7d50d9c42db9f9

iq0todco

iq0todco1#

如前所述,问题在于在java中调用以下方法:

cdbpp::cdbpp_base<cdbpp::murmurhash2>::get(void const*, unsigned long, unsigned long*) const+0x16f

我使用100个线程每个Map器,我总共有22个线程,其中2个用来并行运行。由于静态读卡器调用上述方法时没有“同步”,造成了这个问题。所以用同步块来包围这个方法调用就解决了这个问题。

yuvru6vn

yuvru6vn2#

问题可能不在于您的杂音3散列,而在于本机库及其如何分配内存。
我对jni调用没有经验,但在内存使用方面,它们是有问题的(每个这样的调用都分配堆栈和堆空间)。我们不能确定gc是否能正确触发(阅读关于gzipinputstream的恐怖故事)。
你说你已经创建了22*100个线程,每个线程都可能为jni调用分配了一些堆栈,而且内存只有4gb。这台机器似乎相当拥挤,我猜这里的限制是cpu/内存访问,而不是长时间的外部等待(只有很少的线程是真正并行活动的)?
当你从根本上减少线程数量时会发生什么?simstrings库是如何使用的?它是否有一个应该被尊重的内部线程模型(即只允许一个线程同时进行查询?)。
恐怕jni是单线程的。
阅读更多关于本机调用如何分配内存的信息。

相关问题