Странная проблема с "CUDA вне памяти" при использовании pytorch - PullRequest
0 голосов
/ 10 января 2020

Я пытаюсь запустить базовую модель, получая такую ​​ошибку:

  Traceback (most recent call last):
  File "train_demo.py", line 199, in <module>
    main()
  File "train_demo.py", line 190, in main
    train_iter=opt.train_iter, val_iter=opt.val_iter, bert_optim=bert_optim)
  File "/home/wh/FewRel/fewshot_re_kit/framework.py", line 219, in train
    N_for_train, K, Q * N_for_train + na_rate * Q)
  File "/home/wh/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wh/FewRel/models/proto.py", line 31, in forward
    support_emb = self.sentence_encoder(support) # (B * N * K, D), where D is the hidden size
  File "/home/wh/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 148, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/home/wh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 159, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/home/wh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "/home/wh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 28, in scatter
    res = scatter_map(inputs)
  File "/home/wh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/wh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 19, in scatter_map
    return list(map(type(obj), zip(*map(scatter_map, obj.items()))))
  File "/home/wh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/wh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
  File "/home/wh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 89, in forward
    outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  File "/home/wh/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 147, in scatter
    return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))RuntimeError: CUDA error: out of memory (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:241)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f99524d8813 in /home/wh/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1cb50 (0x7f9952719b50 in /home/wh/anaconda3/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1de6e (0x7f995271ae6e in /home/wh/anaconda3/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x279 (0x7f98943ee7e9 in /home/wh/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #4: <unknown function> + 0x4225538 (0x7f9892e3b538 in /home/wh/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x3cdec28 (0x7f98928f4c28 in /home/wh/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #6: <unknown function> + 0x1c34521 (0x7f989084a521 in /home/wh/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #7: at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) + 0x272 (0x7f989084aec2 in /home/wh/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x1f5db40 (0x7f9890b73b40 in /home/wh/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #9: <unknown function> + 0x3af0873 (0x7f9892706873 in /home/wh/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #10: torch::cuda::scatter(at::Tensor const&, c10::ArrayRef<long>, c10::optional<std::vector<long, std::allocator<long> > > const&, long, c10::optional<std::vector<c10::optional<c10::cuda::CUDAStream>, std::allocator<c10::optional<c10::cuda::CUDAStream> > > > const&) + 0x4db (0x7f989323110b in /home/wh/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #11: <unknown function> + 0x77fd8f (0x7f9953401d8f in /home/wh/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x211014 (0x7f9952e93014 in /home/wh/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: THPFunction_apply(_object*, _object*) + 0xa4f (0x7f99531214af in /home/wh/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

Размер обучающих данных составляет всего 400 КБ, и я пытался ограничить размер пакета 1, но это не так т работа. Еще более странно то, что моя память GPU используется не полностью: введите описание изображения здесь

Как видите, графический процессор NO.2 используется не полностью (мое имя пользователя wh) Я также добавил os.environ["CUDA_VISIBLE_DEVICES"] = '2' в свой код, но он по-прежнему занимает несколько графических процессоров.

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...