我正在用自己的训练数据重新训练斯坦福大学NER模型来提取组织，但是，无论我使用4GB RAM的机器还是8 GB RAM的机器，我都会得到相同的Java堆空间错误。
有谁能告诉我们，在没有这些内存问题的情况下，我们可以重新训练模型的机器的一般配置是什么？
我使用了以下命令：

java -mx4g -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop newdata_retrain.prop

我正在处理训练数据（多个文件-每个文件大约有15000行，格式如下）-每行一个单词及其类别
她是O在O微软组织工作
我们还能做些什么来让这些模型可靠地运行吗？我确实尝试过减少训练数据中的类别数量。但这会影响提取的准确性。例如，一些位置或其他实体被分类为组织名称。我们能在不影响准确性的情况下减少特定类别的数量吗？
我使用的一个数据是Alan Ritter的twitter nlp数据：https://github.com/aritter/twitter_nlp/tree/master/data/annotated/ner.txt
属性文件如下所示：

#location of the training file
trainFile = ner.txt
#location where you would like to save (serialize to) your
#classifier; adding .gz at the end automatically gzips the file,
#making it faster and smaller
serializeTo = ner-model-twitter.ser.gz

#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1

#these are the features we'd like to train with
#some are discussed below, the rest can be
#understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
saveFeatureIndexToDisk = true

我得到的错误：stacktrace是这样的：

CRFClassifier invoked on Mon Dec 01 02:55:22 UTC 2014 with arguments:
   -prop twitter_retrain.prop
usePrevSequences=true
useClassFeature=true
useTypeSeqs2=true
useSequences=true
wordShape=chris2useLC
saveFeatureIndexToDisk=true
useTypeySequences=true
useDisjunctive=true
noMidNGrams=true
serializeTo=ner-model-twitter.ser.gz
maxNGramLeng=6
useNGrams=true
usePrev=true
useNext=true
maxLeft=1
trainFile=ner.txt
map=word=0,answer=1
useWord=true
useTypeSeqs=true
[1000][2000]numFeatures = 215032
setting nodeFeatureIndicesMap, size=149877
setting edgeFeatureIndicesMap, size=65155
Time to convert docs to feature indices: 4.4 seconds
numClasses: 21 [0=O,1=B-facility,2=I-facility,3=B-other,4=I-other,5=B-company,6=B-person,7=B-tvshow,8=B-product,9=B-sportsteam,10=I-person,11=B-geo-loc,12=B-movie,13=I-movie,14=I-tvshow,15=I-company,16=B-musicartist,17=I-musicartist,18=I-geo-loc,19=I-product,20=I-sportsteam]
numDocuments: 2394
numDatums: 46469
numFeatures: 215032
Time to convert docs to data/labels: 2.5 seconds
Writing feature index to temporary file.
numWeights: 31880772
QNMinimizer called on double function of 31880772 variables, using M = 25.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:923)
        at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:885)
        at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:879)
        at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:91)
        at edu.stanford.nlp.ie.crf.CRFClassifier.trainWeights(CRFClassifier.java:1911)
        at edu.stanford.nlp.ie.crf.CRFClassifier.train(CRFClassifier.java:1718)
        at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:759)
        at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:747)
        at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:2937)

6条答案

按热度按时间

piah890a1#

减少类的数量的一种方法是不使用B-I标记法。例如，将B-facility和I-facility俱乐部为facility。当然，另一种方法是使用更大的内存机器。

赞(0）回复(0）举报 2023-03-21

hc2pp10m2#

那不是应该是**-Xmx 4g**而不是-mx 4g吗？

kninwzqo3#

抱歉，我来晚了一点！我怀疑问题出在文件的输入格式上;特别是，我的第一个猜测是，该文件被视为一个单一的长句。
训练文件的预期格式是CoNLL格式，这意味着文件的每一行都是一个新的标记，句子的结尾由双换行符表示。因此，例如，文件可能看起来像这样：

Cats  O
have  O
tails  O
.  O

Felix  ANIMAL
is  O
a  O
cat  O
.  O

你能让我知道它是否确实是这种格式吗？如果是，你能包括错误的堆栈跟踪，以及你正在使用的属性文件吗？如果你只运行文件的前几句话，它会工作吗？
--Gabor

a64a0gku4#

如果你要对非事务性数据集进行分析，你可能需要使用另一种工具，比如Elasticsearch（更简单）或Hadoop（复杂度更高），MongoDB也是一个很好的中间地带。

8zzbczxx5#

首先卸载现有的java jdk并重新安装。
然后，您可以根据硬盘大小尽可能多地使用堆大小。
在术语“-mx 4g”中，4g不是RAM，而是堆大小。
即使是我最初也面临着同样的错误。这样做之后，它就消失了。
甚至我最初也把4G误解为RAM。
现在，我可以启动我的服务器，甚至与100 g的堆大小。
接下来，我建议您使用自定义RegexNER模型，而不是使用自定义NER模型，您可以在单个文档中添加数百万个相同实体名称的单词。
这两个错误，我一开始就犯了。
如有任何疑问，请在下方备注。

dgenwo3n6#

你可以尝试减少线程的数量，因为每个线程占用大量内存multiThreadGrad=4把这个放在你的属性文件中，玩一下这个数字，线程的数量越少，训练过程就越慢，但是你很可能能够用你想要的标签训练。

java 斯坦福大学NER再培训的内存要求

6条答案

相关问题

热门标签

最新问答