Я явно пытаюсь установить версию mxnet БЕЗ поддержки CUDA. При установке с поддержкой cuda я могу запустить этот пример здесь. Я следую за keras & Руководство по установке mxnet здесь.
Действия по воспроизведению успешного keras-mxnet с поддержкой CUDA:
Вот мои конфиги gpu от nvcc --version
:
~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
Убедитесь, что у вас не установлена mxnet
.
pip install mxnet-cu80
pip install keras-mxnet
Запуск кода на jupyter дает мне:
60000 train samples
10000 test samples
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 512) 401920
_________________________________________________________________
activation_1 (Activation) (None, 512) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 512) 0
_________________________________________________________________
dense_2 (Dense) (None, 512) 262656
_________________________________________________________________
activation_2 (Activation) (None, 512) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 512) 0
_________________________________________________________________
dense_3 (Dense) (None, 10) 5130
_________________________________________________________________
activation_3 (Activation) (None, 10) 0
=================================================================
Total params: 669,706
Trainable params: 669,706
Non-trainable params: 0
_________________________________________________________________
Train on 60000 samples, validate on 10000 samples
Epoch 1/20
6400/60000 [==>...........................] - ETA: 39s - loss: 2.1718 - acc: 0.2587
/usr/local/lib/python3.6/dist-packages/mxnet/module/bucketing_module.py:408: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.0078125). Is this intended?
force_init=force_init)
60000/60000 [==============================] - 6s 103us/step - loss: 1.2105 - acc: 0.6957 - val_loss: 0.5334 - val_acc: 0.8728
Epoch 2/20
60000/60000 [==============================] - 2s 27us/step - loss: 0.5280 - acc: 0.8515 - val_loss: 0.3749 - val_acc: 0.8996
Epoch 3/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.4239 - acc: 0.8786 - val_loss: 0.3213 - val_acc: 0.9098
Epoch 4/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.3740 - acc: 0.8911 - val_loss: 0.2923 - val_acc: 0.9162
Epoch 5/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.3437 - acc: 0.9008 - val_loss: 0.2704 - val_acc: 0.9218
Epoch 6/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.3195 - acc: 0.9079 - val_loss: 0.2539 - val_acc: 0.9263
Epoch 7/20
60000/60000 [==============================] - 2s 29us/step - loss: 0.2965 - acc: 0.9151 - val_loss: 0.2393 - val_acc: 0.9312
Epoch 8/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.2792 - acc: 0.9190 - val_loss: 0.2264 - val_acc: 0.9342
Epoch 9/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.2641 - acc: 0.9239 - val_loss: 0.2173 - val_acc: 0.9363
Epoch 10/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.2520 - acc: 0.9277 - val_loss: 0.2064 - val_acc: 0.9413
Epoch 11/20
60000/60000 [==============================] - 2s 29us/step - loss: 0.2409 - acc: 0.9306 - val_loss: 0.1983 - val_acc: 0.9425
Epoch 12/20
60000/60000 [==============================] - 2s 30us/step - loss: 0.2307 - acc: 0.9331 - val_loss: 0.1894 - val_acc: 0.9447
Epoch 13/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.2209 - acc: 0.9362 - val_loss: 0.1813 - val_acc: 0.9463
Epoch 14/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.2106 - acc: 0.9396 - val_loss: 0.1756 - val_acc: 0.9478
Epoch 15/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.2044 - acc: 0.9410 - val_loss: 0.1687 - val_acc: 0.9501
Epoch 16/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.1963 - acc: 0.9424 - val_loss: 0.1625 - val_acc: 0.9528
Epoch 17/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.1912 - acc: 0.9436 - val_loss: 0.1576 - val_acc: 0.9542
Epoch 18/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.1842 - acc: 0.9472 - val_loss: 0.1544 - val_acc: 0.9541
Epoch 19/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.1782 - acc: 0.9482 - val_loss: 0.1490 - val_acc: 0.9553
Epoch 20/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.1729 - acc: 0.9494 - val_loss: 0.1447 - val_acc: 0.9570
Test score: 0.144698123593
Test accuracy: 0.957
Действия по воспроизведению неудачного только для процессора keras-mxnet:
Сделайте то же, что и раньше, но вместо установки mxnet-cu80
установите mxnet
:
pip uninstall mxnet-cu80
pip install mxnet
Выполнение кода на ноутбуке Jupyter теперь дает мне:
60000 train samples
10000 test samples
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_4 (Dense) (None, 512) 401920
_________________________________________________________________
activation_4 (Activation) (None, 512) 0
_________________________________________________________________
dropout_3 (Dropout) (None, 512) 0
_________________________________________________________________
dense_5 (Dense) (None, 512) 262656
_________________________________________________________________
activation_5 (Activation) (None, 512) 0
_________________________________________________________________
dropout_4 (Dropout) (None, 512) 0
_________________________________________________________________
dense_6 (Dense) (None, 10) 5130
_________________________________________________________________
activation_6 (Activation) (None, 10) 0
=================================================================
Total params: 669,706
Trainable params: 669,706
Non-trainable params: 0
_________________________________________________________________
Train on 60000 samples, validate on 10000 samples
Epoch 1/20
---------------------------------------------------------------------------
MXNetError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/mxnet/symbol/symbol.py in simple_bind(self, ctx, grad_req, type_dict, stype_dict, group2ctx, shared_arg_names, shared_exec, shared_buffer, **kwargs)
1512 shared_exec_handle,
-> 1513 ctypes.byref(exe_handle)))
1514 except MXNetError as e:
/usr/local/lib/python3.6/dist-packages/mxnet/base.py in check_call(ret)
148 if ret != 0:
--> 149 raise MXNetError(py_str(_LIB.MXGetLastError()))
150
MXNetError: [04:19:54] src/storage/storage.cc:123: Compile with USE_CUDA=1 to enable GPU usage
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x1c05f2) [0x7f737ac845f2]
[bt] (1) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x1c0bd8) [0x7f737ac84bd8]
[bt] (2) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2d7d3cd) [0x7f737d8413cd]
[bt] (3) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2d8141d) [0x7f737d84541d]
[bt] (4) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2d83206) [0x7f737d847206]
[bt] (5) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27a2831) [0x7f737d266831]
[bt] (6) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27a2984) [0x7f737d266984]
[bt] (7) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27aecec) [0x7f737d272cec]
[bt] (8) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27b55f8) [0x7f737d2795f8]
[bt] (9) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27c163a) [0x7f737d28563a]
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-4-c71d8965f0f3> in <module>()
49 history = model.fit(X_train, Y_train,
50 batch_size=batch_size, epochs=nb_epoch,
---> 51 verbose=1, validation_data=(X_test, Y_test))
52 score = model.evaluate(X_test, Y_test, verbose=0)
53 print('Test score:', score[0])
/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
1042 initial_epoch=initial_epoch,
1043 steps_per_epoch=steps_per_epoch,
-> 1044 validation_steps=validation_steps)
1045
1046 def evaluate(self, x=None, y=None,
/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py in fit_loop(model, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
197 ins_batch[i] = ins_batch[i].toarray()
198
--> 199 outs = f(ins_batch)
200 if not isinstance(outs, list):
201 outs = [outs]
/usr/local/lib/python3.6/dist-packages/keras/backend/mxnet_backend.py in train_function(inputs)
4794 def train_function(inputs):
4795 self._check_trainable_weights_consistency()
-> 4796 data, label, _, data_shapes, label_shapes = self._adjust_module(inputs, 'train')
4797
4798 batch = mx.io.DataBatch(data=data, label=label, bucket_key='train',
/usr/local/lib/python3.6/dist-packages/keras/backend/mxnet_backend.py in _adjust_module(self, inputs, phase)
4746 self._set_weights()
4747 else:
-> 4748 self._module.bind(data_shapes=data_shapes, label_shapes=None, for_training=True)
4749 self._set_weights()
4750 self._module.init_optimizer(kvstore=self._kvstore, optimizer=self.optimizer)
/usr/local/lib/python3.6/dist-packages/mxnet/module/bucketing_module.py in bind(self, data_shapes, label_shapes, for_training, inputs_need_grad, force_rebind, shared_module, grad_req)
341 compression_params=self._compression_params)
342 module.bind(data_shapes, label_shapes, for_training, inputs_need_grad,
--> 343 force_rebind=False, shared_module=None, grad_req=grad_req)
344 self._curr_module = module
345 self._curr_bucket_key = self._default_bucket_key
/usr/local/lib/python3.6/dist-packages/mxnet/module/module.py in bind(self, data_shapes, label_shapes, for_training, inputs_need_grad, force_rebind, shared_module, grad_req)
428 fixed_param_names=self._fixed_param_names,
429 grad_req=grad_req, group2ctxs=self._group2ctxs,
--> 430 state_names=self._state_names)
431 self._total_exec_bytes = self._exec_group._total_exec_bytes
432 if shared_module is not None:
/usr/local/lib/python3.6/dist-packages/mxnet/module/executor_group.py in __init__(self, symbol, contexts, workload, data_shapes, label_shapes, param_names, for_training, inputs_need_grad, shared_group, logger, fixed_param_names, grad_req, state_names, group2ctxs)
263 self.num_outputs = len(self.symbol.list_outputs())
264
--> 265 self.bind_exec(data_shapes, label_shapes, shared_group)
266
267 def decide_slices(self, data_shapes):
/usr/local/lib/python3.6/dist-packages/mxnet/module/executor_group.py in bind_exec(self, data_shapes, label_shapes, shared_group, reshape)
359 else:
360 self.execs.append(self._bind_ith_exec(i, data_shapes_i, label_shapes_i,
--> 361 shared_group))
362
363 self.data_shapes = data_shapes
/usr/local/lib/python3.6/dist-packages/mxnet/module/executor_group.py in _bind_ith_exec(self, i, data_shapes, label_shapes, shared_group)
637 type_dict=input_types, shared_arg_names=self.param_names,
638 shared_exec=shared_exec, group2ctx=group2ctx,
--> 639 shared_buffer=shared_data_arrays, **input_shapes)
640 self._total_exec_bytes += int(executor.debug_str().split('\n')[-3].split()[1])
641 return executor
/usr/local/lib/python3.6/dist-packages/mxnet/symbol/symbol.py in simple_bind(self, ctx, grad_req, type_dict, stype_dict, group2ctx, shared_arg_names, shared_exec, shared_buffer, **kwargs)
1517 error_msg += "%s: %s\n" % (k, v)
1518 error_msg += "%s" % e
-> 1519 raise RuntimeError(error_msg)
1520
1521 # update shared_buffer
RuntimeError: simple_bind error. Arguments:
/dense_4_input1: (128, 784)
[04:19:54] src/storage/storage.cc:123: Compile with USE_CUDA=1 to enable GPU usage
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x1c05f2) [0x7f737ac845f2]
[bt] (1) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x1c0bd8) [0x7f737ac84bd8]
[bt] (2) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2d7d3cd) [0x7f737d8413cd]
[bt] (3) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2d8141d) [0x7f737d84541d]
[bt] (4) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2d83206) [0x7f737d847206]
[bt] (5) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27a2831) [0x7f737d266831]
[bt] (6) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27a2984) [0x7f737d266984]
[bt] (7) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27aecec) [0x7f737d272cec]
[bt] (8) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27b55f8) [0x7f737d2795f8]
[bt] (9) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27c163a) [0x7f737d28563a]
Что именно говорит эта ошибка? Как я могу это исправить?