У меня есть график с 88830 узлами и ребрами 1,8M.
Я следую DGL без руководства по Graphsage . Размер функции - 2048
Я получаю CUDA из-за исключения памяти.
DGLGraph(num_nodes=88830, num_edges=1865430,
ndata_schemes={}
edata_schemes={})
Traceback (most recent call last):
File "graphsage.py", line 233, in <module>
train()
File "graphsage.py", line 213, in train
loss = model(train_g, img_features, color_features, neg_sample_size)
File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "graphsage.py", line 125, in forward
pos_score = score_func(pos_g, emb)
File "graphsage.py", line 89, in score_func
pos_tails = emb[dst_nid]
RuntimeError: CUDA out of memory. Tried to allocate 1.14 GiB (GPU 0; 11.17 GiB total capacity; 9.14 GiB already allocated; 1018.06 MiB free; 9.15 GiB reserved in total by PyTorch)
Поскольку на машине установлено 8 графических процессоров, я пытаюсь использовать модуль model = nn.Dataparallel(model)
. Я получаю следующую ошибку.
Traceback (most recent call last):
File "graphsage.py", line 233, in <module>
train()
File "graphsage.py", line 213, in train
loss = model(train_g, img_features, color_features, neg_sample_size)
File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
dgl._ffi.base.DGLError: Caught DGLError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "graphsage.py", line 123, in forward
emb = self.gconv_model(g, features)
File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "graphsage.py", line 44, in forward
h = layer(g, h)
File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/dgl/nn/pytorch/conv/sageconv.py", line 113, in forward
graph.ndata['h'] = feat
File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/dgl/view.py", line 65, in __setitem__
self._graph.set_n_repr({key : val}, self._nodes)
File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/dgl/graph.py", line 1790, in set_n_repr
' Got %d and %d instead.' % (nfeats, num_nodes))
dgl._ffi.base.DGLError: Expect number of features to match number of nodes (len(u)). Got 11104 and 88830 instead.
Я вижу, что данные разделены на 8 пакетов 88830 / 8 ~= 11104
Но я не уверен, как обойти эту ошибку.
Может кто-нибудь помочь мне в этом?