CUDA вне памяти исключение DGL Pytorch - PullRequest
0 голосов
/ 25 февраля 2020

У меня есть график с 88830 узлами и ребрами 1,8M.

Я следую DGL без руководства по Graphsage . Размер функции - 2048

Я получаю CUDA из-за исключения памяти.

DGLGraph(num_nodes=88830, num_edges=1865430,
         ndata_schemes={}
         edata_schemes={})
Traceback (most recent call last):
  File "graphsage.py", line 233, in <module>
    train()
  File "graphsage.py", line 213, in train
    loss = model(train_g, img_features, color_features, neg_sample_size)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "graphsage.py", line 125, in forward
    pos_score = score_func(pos_g, emb)
  File "graphsage.py", line 89, in score_func
    pos_tails = emb[dst_nid]
RuntimeError: CUDA out of memory. Tried to allocate 1.14 GiB (GPU 0; 11.17 GiB total capacity; 9.14 GiB already allocated; 1018.06 MiB free; 9.15 GiB reserved in total by PyTorch)

Поскольку на машине установлено 8 графических процессоров, я пытаюсь использовать модуль model = nn.Dataparallel(model). Я получаю следующую ошибку.

Traceback (most recent call last):
  File "graphsage.py", line 233, in <module>
    train()
  File "graphsage.py", line 213, in train
    loss = model(train_g, img_features, color_features, neg_sample_size)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
dgl._ffi.base.DGLError: Caught DGLError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "graphsage.py", line 123, in forward
    emb = self.gconv_model(g, features)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "graphsage.py", line 44, in forward
    h = layer(g, h)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/dgl/nn/pytorch/conv/sageconv.py", line 113, in forward
    graph.ndata['h'] = feat
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/dgl/view.py", line 65, in __setitem__
    self._graph.set_n_repr({key : val}, self._nodes)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/dgl/graph.py", line 1790, in set_n_repr
    ' Got %d and %d instead.' % (nfeats, num_nodes))
dgl._ffi.base.DGLError: Expect number of features to match number of nodes (len(u)). Got 11104 and 88830 instead.

Я вижу, что данные разделены на 8 пакетов 88830 / 8 ~= 11104 Но я не уверен, как обойти эту ошибку.

Может кто-нибудь помочь мне в этом?

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...