java.lang.outofmemoryerror:在mahout中运行seq2sparse时发生java堆空间错误

bbuxkriu  于 2021-06-03  发布在  Hadoop
关注(0)|答案(2)|浏览(437)

我正在尝试用mahout中的k-means对一些手工制作的日期进行聚类。我创建了6个文件,每个文件中几乎没有1到2个单词的文本。使用./mahout seqdirectory创建了一个序列文件。在尝试使用./mahout seq2sparse命令将序列文件转换为向量时,我得到java.lang.outofmemoryerror:java堆空间错误。序列文件的大小是.215KB。
命令:./mahout seq2sparse-i mokha/output-o mokha/vector-ow
错误日志:

  1. SLF4J: Class path contains multiple SLF4J bindings.
  2. SLF4J: Found binding in [jar:file:/home/bitnami/mahout/mahout-distribution-0.5/m
  3. ahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  4. SLF4J: Found binding in [jar:file:/home/bitnami/mahout/mahout-distribution-0.5/l
  5. ib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  6. SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  7. Apr 24, 2013 2:25:11 AM org.slf4j.impl.JCLLoggerAdapter warn
  8. WARNING: No seq2sparse.props found on classpath, will use command-line arguments
  9. only
  10. Apr 24, 2013 2:25:12 AM org.slf4j.impl.JCLLoggerAdapter info
  11. INFO: Maximum n-gram size is: 1
  12. Apr 24, 2013 2:25:12 AM org.slf4j.impl.JCLLoggerAdapter info
  13. INFO: Deleting mokha/vector
  14. Apr 24, 2013 2:25:12 AM org.slf4j.impl.JCLLoggerAdapter info
  15. INFO: Minimum LLR value: 1.0
  16. Apr 24, 2013 2:25:12 AM org.slf4j.impl.JCLLoggerAdapter info
  17. INFO: Number of reduce tasks: 1
  18. Apr 24, 2013 2:25:12 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
  19. INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
  20. Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li
  21. stStatus
  22. INFO: Total input paths to process : 1
  23. Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  24. INFO: Running job: job_local_0001
  25. Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li
  26. stStatus
  27. INFO: Total input paths to process : 1
  28. Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.Task done
  29. INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commi
  30. ting
  31. Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  32. INFO:
  33. Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.Task commit
  34. INFO: Task attempt_local_0001_m_000000_0 is allowed to commit now
  35. Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitt
  36. er commitTask
  37. INFO: Saved output of task 'attempt_local_0001_m_000000_0' to mokha/vector/token
  38. ized-documents
  39. Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  40. INFO:
  41. Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.Task sendDone
  42. INFO: Task 'attempt_local_0001_m_000000_0' done.
  43. Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  44. INFO: map 100% reduce 0%
  45. Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  46. INFO: Job complete: job_local_0001
  47. Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log
  48. INFO: Counters: 5
  49. Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log
  50. INFO: FileSystemCounters
  51. Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log
  52. INFO: FILE_BYTES_READ=1471400
  53. Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log
  54. INFO: FILE_BYTES_WRITTEN=1496783
  55. Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log
  56. INFO: Map-Reduce Framework
  57. Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log
  58. INFO: Map input records=6
  59. Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log
  60. INFO: Spilled Records=0
  61. Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log
  62. INFO: Map output records=6
  63. Apr 24, 2013 2:25:13 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
  64. INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - al
  65. ready initialized
  66. Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li
  67. stStatus
  68. INFO: Total input paths to process : 1
  69. Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  70. INFO: Running job: job_local_0002
  71. Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li
  72. stStatus
  73. INFO: Total input paths to process : 1
  74. Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
  75. INFO: io.sort.mb = 100
  76. Apr 24, 2013 2:25:14 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
  77. WARNING: job_local_0002
  78. java.lang.OutOfMemoryError: Java heap space
  79. at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:
  80. 781)
  81. at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.ja
  82. va:524)
  83. at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
  84. at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
  85. at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:1
  86. 77)
  87. Apr 24, 2013 2:25:14 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  88. INFO: map 0% reduce 0%
  89. Apr 24, 2013 2:25:14 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  90. INFO: Job complete: job_local_0002
  91. Apr 24, 2013 2:25:14 AM org.apache.hadoop.mapred.Counters log
  92. INFO: Counters: 0
  93. Apr 24, 2013 2:25:14 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
  94. INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - al
  95. ready initialized
  96. Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li
  97. stStatus
  98. INFO: Total input paths to process : 1
  99. Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  100. INFO: Running job: job_local_0003
  101. Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li
  102. stStatus
  103. INFO: Total input paths to process : 1
  104. Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
  105. INFO: io.sort.mb = 100
  106. Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
  107. WARNING: job_local_0003
  108. java.lang.OutOfMemoryError: Java heap space
  109. at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:
  110. 781)
  111. at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.ja
  112. va:524)
  113. at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
  114. at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
  115. at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:1
  116. 77)
  117. Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  118. INFO: map 0% reduce 0%
  119. Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  120. INFO: Job complete: job_local_0003
  121. Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.Counters log
  122. INFO: Counters: 0
  123. Apr 24, 2013 2:25:16 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
  124. INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - al
  125. ready initialized
  126. Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li
  127. stStatus
  128. INFO: Total input paths to process : 0
  129. Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  130. INFO: Running job: job_local_0004
  131. Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li
  132. stStatus
  133. INFO: Total input paths to process : 0
  134. Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
  135. WARNING: job_local_0004
  136. java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
  137. at java.util.ArrayList.RangeCheck(ArrayList.java:547)
  138. at java.util.ArrayList.get(ArrayList.java:322)
  139. at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:1
  140. 24)
  141. Apr 24, 2013 2:25:17 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  142. INFO: map 0% reduce 0%
  143. Apr 24, 2013 2:25:17 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  144. INFO: Job complete: job_local_0004
  145. Apr 24, 2013 2:25:17 AM org.apache.hadoop.mapred.Counters log
  146. INFO: Counters: 0
  147. Apr 24, 2013 2:25:17 AM org.slf4j.impl.JCLLoggerAdapter info
  148. INFO: Deleting mokha/vector/partial-vectors-0
  149. Apr 24, 2013 2:25:17 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
  150. INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - al
  151. ready initialized
  152. Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputExc
  153. eption: Input path does not exist: file:/home/bitnami/mahout/mahout-distribution
  154. -0.5/bin/mokha/vector/tf-vectors
  155. at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(File
  156. InputFormat.java:224)
  157. at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listSta
  158. tus(SequenceFileInputFormat.java:55)
  159. at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileI
  160. nputFormat.java:241)
  161. at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
  162. at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
  163. 79)
  164. at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
  165. at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
  166. at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFI
  167. DFConverter.java:350)
  168. at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.processTfIdf(TFIDFC
  169. onverter.java:151)
  170. at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(Spars
  171. eVectorsFromSequenceFiles.java:262)
  172. at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  173. at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  174. at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(Spar
  175. seVectorsFromSequenceFiles.java:52)
  176. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  177. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
  178. java:39)
  179. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
  180. sorImpl.java:25)
  181. at java.lang.reflect.Method.invoke(Method.java:597)
  182. at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Progra
  183. mDriver.java:68)
  184. at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
  185. at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
inkz8wg9

inkz8wg91#

bin/mahout脚本读取环境变量'mahout\u heapsize'(以兆字节为单位),并从中设置'java\u heap\u max'变量(如果存在)。我正在使用的mahout版本(0.8)将java\u heap\u max设置为3g。执行

  1. export MAHOUT_HEAPSIZE=10000m

以前,群集运行似乎有助于我在一台机器上运行更长时间。但是,我认为最好的解决方案是切换到在集群上运行。
作为参考,还有另一篇相关的文章:mahout耗尽了堆空间

h9a6wy2h

h9a6wy2h2#

我不知道你是否试过这个,但只是张贴它以防你错过它。

  1. 'Set the environment variable 'MAVEN_OPTS' to allow for more memory via 'export MAVEN_OPTS=-Xmx1024m'

请参阅此处的(常见问题部分)

相关问题