Paddle 1)安装环境是否正确;2)本地GPU跑FINETUNE提示Out of memory error

wwwo4jvm  于 2022-10-20  发布在  其他
关注(0)|答案(6)|浏览(449)

环境描述及其问题:

1)PaddlePaddle版本:
anaconda虚拟环境中用conda安装paddlepaddle1.7GPU版本,
当时是按照paddle官网用conda install paddlepaddle-gpu cudatoolkit=10.0安装的,
安装后验证安装成功。
pip安装paddlehub1.6.0,安装后用conda list 展示出来的是这样:
paddlehub 1.6.0 pypi_0 pypi
paddlepaddle-gpu 1.7.0.post107 pypi_0 pypi
这里的问题是,paddlepaddle-gpu怎么变成了用pypi安装?
而且,在jupyter每次restart kernel,首次import paddle.fluid时,会弹出

这些是因为环境混乱造成的吗?
2)GPU:940MX 2G; CUDA: V10.0.130; CUDNN:7.6.5.32
3)系统环境:WIN10 64BIT家庭版(家庭版到底能不能通过conda安装paddlepaddle?)

FINETUNE资源不足问题描述:

使用业余显卡,能用GPU跑senta-bilstm的预训练模型进行预测。
但不能finetune,设置
os.environ['FLAGS_fraction_of_gpu_memory_to_use']='0.95',
batchsize=2
都会提示资源不足。
完整代码、错误信息见附件ERRLOG.ZIP:
ERRLOG.zip

同样的代码,我拿去AI STUDIO的GPU环境里试了下,并且设置了os.environ['FLAGS_fraction_of_gpu_memory_to_use']='0.01',
batchsize=32
都能FINETUNE成功。
如果是因为我本地的显卡2G显存不够,那为什么在AI STUDIO上,即使
设了这么小的可用显存都能FINETUNE?谢谢。

iih3973s

iih3973s1#

WIN10 64BIT家庭版(家庭版到底能不能通过conda安装paddlepaddle?)

这个是可以的。

这些是因为环境混乱造成的吗?

有可能是。请按照下面步骤进行排查下

  1. 先只是conda install paddle 看会不会出错,不要装paddlehub

在jupyter每次restart kernel,首次import paddle.fluid时,会弹出

如果用python直接跑,会有问题么?

都会提示资源不足。

请问你本地显卡有没有跑其他任务,ERRLOG.zip里没有看到错误信息

zrfyljdw

zrfyljdw2#

如果用python直接跑,会有问题么?

刚试了一下,没问题。

......参照这个人的方法......

参照你提供的方法,在anaconda文件夹下搜得两种日期的pythoncom37.dll,将较新的dll替换到Anaconda3\Library\bin里的较旧dll,jupyter里也不再提示无法定位程序输入点。

请问你本地显卡有没有跑其他任务,ERRLOG.zip里没有看到错误信息

没有跑其他任务。

程序代码如下:


# -*- coding: utf-8 -*-

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import os
os.environ['FLAGS_fraction_of_gpu_memory_to_use']='0.95'

import paddlehub as hub 
from paddlehub.dataset.base_nlp_dataset import BaseNLPDataset

class DemoDataset(BaseNLPDataset):
    """DemoDataset"""
    def __init__(self):
        self.dataset_dir = r"./"
        super(DemoDataset, self).__init__(
            base_path=self.dataset_dir,
            train_file="train.tsv",
            dev_file="dev.tsv",
            test_file="test.tsv",
            predict_file="predict.tsv",
            train_file_with_header=False,
            dev_file_with_header=False,
            test_file_with_header=False,
            predict_file_with_header=False,
            label_list=["0", "1"])
dataset = DemoDataset()

module = hub.Module(name="senta_bilstm")

reader = hub.reader.LACClassifyReader(
    dataset=dataset,     
    vocab_path=module.get_vocab_path())

strategy = hub.AdamWeightDecayStrategy(
    weight_decay=0.01,
    warmup_proportion=0.1,
    learning_rate=5e-5)

config = hub.RunConfig(
    use_cuda=True,
    enable_memory_optim=False,#Ture也一样提示资源不足
    use_pyreader=False,
    num_epoch=1,
    use_data_parallel=False,
    checkpoint_dir="tmpcheckpoint",
    batch_size=2,
    eval_interval=50,
    strategy=strategy)

inputs, outputs, program = module.context(trainable=True)

sent_feature = outputs["sentence_feature"]

feed_list = [inputs["words"].name]

cls_task = hub.TextClassifierTask(
    data_reader=reader,
    feature=sent_feature,
    feed_list=feed_list,
    num_classes=dataset.num_labels,
    config=config)

run_states = cls_task.finetune_and_eval()

输出信息如下:

[2020-03-31 17:05:10,502] [    INFO] - Installing senta_bilstm module
[2020-03-31 17:05:10,851] [    INFO] - Module senta_bilstm already installed in F:\PaddleHubHome\.paddlehub\modules\senta_bilstm
[2020-03-31 17:05:27,859] [    INFO] - Dataset label map = {'0': 0, '1': 1}
[2020-03-31 17:05:27,867] [    INFO] - Installing lac module
[2020-03-31 17:05:27,877] [    INFO] - Module lac already installed in F:\PaddleHubHome\.paddlehub\modules\lac
[2020-03-31 17:05:31,576] [ WARNING] - The memory optimization feature has been dropped! PaddleHub now doesn't optimize the memory of the program.
[2020-03-31 17:05:31,579] [    INFO] - Checkpoint dir: tmpcheckpoint
[2020-03-31 17:05:34,137] [    INFO] - processing train data now... this may take a few minutes
[2020-03-31 17:05:37,548] [    INFO] - Strategy with warmup, linear decay, slanted triangle learning rate, weight decay regularization, 
d:\Anaconda3\envs\ohpy37_01\lib\site-packages\paddle\fluid\executor.py:804: UserWarning: There are no operators in the program to be executed. If you pass Program manually, please use fluid.program_guard to ensure the current Program is being used.
  warnings.warn(error_info)
[2020-03-31 17:05:49,254] [    INFO] - Try loading checkpoint from tmpcheckpoint\ckpt.meta
[2020-03-31 17:05:49,257] [    INFO] - PaddleHub model checkpoint not found, start from scratch...
d:\Anaconda3\envs\ohpy37_01\lib\site-packages\paddle\fluid\executor.py:782: UserWarning: The following exception is not an EOF exception.
  "The following exception is not an EOF exception.")
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-2-7372f9c29fb2> in <module>
     55     config=config)
     56 
---> 57 run_states = cls_task.finetune_and_eval()

d:\Anaconda3\envs\ohpy37_01\lib\site-packages\paddlehub\finetune\task\base_task.py in finetune_and_eval(self)
    861 
    862     def finetune_and_eval(self):
--> 863         return self.finetune(do_eval=True)
    864 
    865     def finetune(self, do_eval=False):

d:\Anaconda3\envs\ohpy37_01\lib\site-packages\paddlehub\finetune\task\base_task.py in finetune(self, do_eval)
    876         # Start to finetune
    877         with self.phase_guard(phase="train"):
--> 878             self.init_if_necessary()
    879             self._finetune_start_event()
    880             run_states = []

d:\Anaconda3\envs\ohpy37_01\lib\site-packages\paddlehub\finetune\task\base_task.py in init_if_necessary(self)
    365         if not self.is_checkpoint_loaded:
    366             if not self.load_checkpoint():
--> 367                 self.exe.run(self._base_startup_program)
    368             self.is_checkpoint_loaded = True
    369             self.is_best_model_loaded = False

d:\Anaconda3\envs\ohpy37_01\lib\site-packages\paddle\fluid\executor.py in run(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
    781                 warnings.warn(
    782                     "The following exception is not an EOF exception.")
--> 783             six.reraise(*sys.exc_info())
    784 
    785     def _run_impl(self, program, feed, fetch_list, feed_var_name,

d:\Anaconda3\envs\ohpy37_01\lib\site-packages\six.py in reraise(tp, value, tb)
    701             if value.__traceback__ is not tb:
    702                 raise value.with_traceback(tb)
--> 703             raise value
    704         finally:
    705             value = None

d:\Anaconda3\envs\ohpy37_01\lib\site-packages\paddle\fluid\executor.py in run(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
    776                 scope=scope,
    777                 return_numpy=return_numpy,
--> 778                 use_program_cache=use_program_cache)
    779         except Exception as e:
    780             if not isinstance(e, core.EOFException):

d:\Anaconda3\envs\ohpy37_01\lib\site-packages\paddle\fluid\executor.py in _run_impl(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
    829                 scope=scope,
    830                 return_numpy=return_numpy,
--> 831                 use_program_cache=use_program_cache)
    832 
    833         program._compile(scope, self.place)

d:\Anaconda3\envs\ohpy37_01\lib\site-packages\paddle\fluid\executor.py in _run_program(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
    903         if not use_program_cache:
    904             self._default_executor.run(program.desc, scope, 0, True, True,
--> 905                                        fetch_var_name)
    906         else:
    907             self._default_executor.run_prepared_ctx(ctx, scope, False, False,

RuntimeError: 

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
Windows not support stack backtrace yet.

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 0. Cannot allocate 613.577393MB memory on GPU 0, available memory is only 24.053124MB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please try one of the following suggestions:
   1) Decrease the batch size of your model.
   2) FLAGS_fraction_of_gpu_memory_to_use is 0.95 now, please set it to a higher value but less than 1.0.
      The command is `export FLAGS_fraction_of_gpu_memory_to_use=xxx`.

 at (D:\1.7.0\paddle\paddle\fluid\memory\detail\system_allocator.cc:151)

运行代码前,nvdia-smi信息如下:

运行代码并报资源不足错误后,nvdia-smi信息如下:

6qfn3psc

6qfn3psc3#

Out of memory error on GPU 0. Cannot allocate 613.577393MB memory on GPU 0, available memory is only 24.053124MB.

从报错和nvidia-smi截图看,确实是GPU显存不够了。

[2020-03-31 17:05:31,576] [ WARNING] - The memory optimization feature has been dropped! PaddleHub now doesn't optimize the memory of the program.

这个warning说明PaddleHub暂时不支持显存优化方案,所以你配置里设置 enable_memory_optim=True 没有用。

我们会反馈让PaddleHub支持该方案的。

xvw2m8pv

xvw2m8pv4#

从报错和nvidia-smi截图看,确实是GPU显存不够了。

麻烦再追问下,同样的代码,我拿去AI STUDIO的GPU环境里试了下,并且设置了os.environ['FLAGS_fraction_of_gpu_memory_to_use']='0.01',
batchsize=32
都能FINETUNE成功。AI STUDIO上的显存使用设置是否能正确生效?

hujrc8aj

hujrc8aj5#

请先看下本地和AISTUDIO上装的是否是1.7版本

>>> import paddle
>>> print paddle.__version__
  • 在1.7版本中,默认使用auto_growth策略, FLAGS_fraction_of_gpu_memory_to_use 就不会生效了。
  • 从你本地报错看,装的还是1.6版本,请用上述命令确认下。
5cg8jx4n

5cg8jx4n6#

请先看下本地和AISTUDIO上装的是否是1.7版本

本地paddlepaddle1.7.0 HUB1.6.0
AI STUDIO paddlepaddle1.7.1 HUB1.5.0

  • 在1.7版本中,默认使用auto_growth策略, FLAGS_fraction_of_gpu_memory_to_use 就不会生效了。

后来发现在AI STUDIO上能直接通过性能监控查看显存占用情况,如图:

事实发现,跑senta-bilstm的funetune,即使batchsize只设为2,显存占用也超过4G了。所以现在可以肯定,我在本地2G显存显卡运行报资源不足,确实是因为显存太小。

本帖原问题已弄清楚,谢谢。

ps:此外还发现AI STUDIO上的一个问题,
我想限制项目在AI STUDIO的显存使用,考虑到默认为auto_growth,所以先把FLAGS_allocator_strategy改为naive_best_fit,但无论是在终端输入
export FLAGS_allocator_strategy=naive_best_fit
export FLAGS_fraction_of_gpu_memory_to_use=0.125
还是在notebook开头改
os.environ['FLAGS_allocator_strategy']='naive_best_fit'
os.environ['FLAGS_fraction_of_gpu_memory_to_use']='0.125'
似乎都不生效,显存监控中可看出实际使用照样超过了设定值。

相关问题