使用Google Colab时出现Tensorflow Graph Execution Error(Tensorflow图执行错误)。我在尝试训练模型时遇到问题。我收到Tensorflow Graph Execution Error(Tensorflow图执行错误)。
from transformers import BertTokenizer
from transformers import TFRobertaForSequenceClassification
num_classes=len(data.label.unique())
tokenizer = BertTokenizer.from_pretrained(args.tokenizer_name)
model = TFBertForSequenceClassification.from_pretrained(args.model_name,num_labels=num_classes)
在训练过程中,这个模型抛出了一个错误。我在大约2周前使用了相同的代码和数据集,并且它正在工作。可能是什么问题?
InternalError: Graph execution error:
Detected at node 'StatefulPartitionedCall_200' defined at (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.9/dist-packages/ipykernel_launcher.py", line 16, in <module>
app.launch_new_instance()
File "/usr/local/lib/python3.9/dist-packages/traitlets/config/application.py", line 992, in launch_instance
app.start()
File "/usr/local/lib/python3.9/dist-packages/ipykernel/kernelapp.py", line 619, in start
self.io_loop.start()
File "/usr/local/lib/python3.9/dist-packages/tornado/platform/asyncio.py", line 215, in start
self.asyncio_loop.run_forever()
File "/usr/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/usr/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/usr/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.9/dist-packages/tornado/ioloop.py", line 687, in <lambda>
lambda f: self._run_callback(functools.partial(callback, future))
File "/usr/local/lib/python3.9/dist-packages/tornado/ioloop.py", line 740, in _run_callback
ret = callback()
File "/usr/local/lib/python3.9/dist-packages/tornado/gen.py", line 821, in inner
self.ctx_run(self.run)
File "/usr/local/lib/python3.9/dist-packages/tornado/gen.py", line 782, in run
yielded = self.gen.send(value)
File "/usr/local/lib/python3.9/dist-packages/ipykernel/kernelbase.py", line 361, in process_one
yield gen.maybe_future(dispatch(*args))
File "/usr/local/lib/python3.9/dist-packages/tornado/gen.py", line 234, in wrapper
yielded = ctx_run(next, result)
File "/usr/local/lib/python3.9/dist-packages/ipykernel/kernelbase.py", line 261, in dispatch_shell
yield gen.maybe_future(handler(stream, idents, msg))
File "/usr/local/lib/python3.9/dist-packages/tornado/gen.py", line 234, in wrapper
yielded = ctx_run(next, result)
File "/usr/local/lib/python3.9/dist-packages/ipykernel/kernelbase.py", line 539, in execute_request
self.do_execute(
File "/usr/local/lib/python3.9/dist-packages/tornado/gen.py", line 234, in wrapper
yielded = ctx_run(next, result)
File "/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py", line 302, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "/usr/local/lib/python3.9/dist-packages/ipykernel/zmqshell.py", line 539, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/IPython/core/interactiveshell.py", line 2975, in run_cell
result = self._run_cell(
File "/usr/local/lib/python3.9/dist-packages/IPython/core/interactiveshell.py", line 3030, in _run_cell
return runner(coro)
File "/usr/local/lib/python3.9/dist-packages/IPython/core/async_helpers.py", line 78, in _pseudo_sync_runner
coro.send(None)
File "/usr/local/lib/python3.9/dist-packages/IPython/core/interactiveshell.py", line 3257, in run_cell_async
has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
File "/usr/local/lib/python3.9/dist-packages/IPython/core/interactiveshell.py", line 3473, in run_ast_nodes
if (await self.run_code(code, result, async_=asy)):
File "/usr/local/lib/python3.9/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-20-261e44759480>", line 1, in <cell line: 1>
history=model.fit(
File "/usr/local/lib/python3.9/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/keras/engine/training.py", line 1685, in fit
tmp_logs = self.train_function(iterator)
File "/usr/local/lib/python3.9/dist-packages/keras/engine/training.py", line 1284, in train_function
return step_function(self, iterator)
File "/usr/local/lib/python3.9/dist-packages/keras/engine/training.py", line 1268, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/usr/local/lib/python3.9/dist-packages/keras/engine/training.py", line 1249, in run_step
outputs = model.train_step(data)
File "/usr/local/lib/python3.9/dist-packages/transformers/modeling_tf_utils.py", line 1571, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/usr/local/lib/python3.9/dist-packages/keras/optimizers/optimizer.py", line 543, in minimize
self.apply_gradients(grads_and_vars)
File "/usr/local/lib/python3.9/dist-packages/keras/optimizers/optimizer.py", line 1174, in apply_gradients
return super().apply_gradients(grads_and_vars, name=name)
File "/usr/local/lib/python3.9/dist-packages/keras/optimizers/optimizer.py", line 650, in apply_gradients
iteration = self._internal_apply_gradients(grads_and_vars)
File "/usr/local/lib/python3.9/dist-packages/keras/optimizers/optimizer.py", line 1200, in _internal_apply_gradients
return tf.__internal__.distribute.interim.maybe_merge_call(
File "/usr/local/lib/python3.9/dist-packages/keras/optimizers/optimizer.py", line 1250, in _distributed_apply_gradients_fn
distribution.extended.update(
File "/usr/local/lib/python3.9/dist-packages/keras/optimizers/optimizer.py", line 1245, in apply_grad_to_update_var
return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_200'
RET_CHECK failure (tensorflow/compiler/xla/service/gpu/gpu_compiler.cc:618) dnn != nullptr
[[{{node StatefulPartitionedCall_200}}]] [Op:__inference_train_function_35086]
问题出在GPU版本,因为它在模型训练过程中崩溃。我需要使用它在不同的数据集上训练,但它总是抛出与我两周前使用Colab训练的原始模型相同的错误。我不确定问题出在哪里,所以我希望得到任何澄清。我为这项服务付费,所以我希望尽快解决这个问题。
编辑:CPU版本工作。
2条答案
按热度按时间hivapdat1#
我们在使用相同的模型和google colab pro时也遇到了这种错误。问题是在google colab上而不是代码本身吗?
xpszyzbs2#
我也有同样的问题。似乎是谷歌Colab本身与最新的Tensorflow版本相结合的问题。