CuDNN cra sh в TF 2.x после многих эпох обучения - PullRequest
1 голос
/ 08 мая 2020

В настоящее время я становлюсь все более и более отчаянным по поводу моего проекта tenorflow. На установку tensorflow ушло много часов, пока я не понял, что PyCharm, Python 3.7 и TF 2.x почему-то несовместимы. Теперь он работает, но после многих эпох обучения я получаю действительно неопределенную ошибку c CuDNN. Вы знаете, неправильный ли мой код или, например, ошибка установки? Не могли бы вы намекнуть мне направление? Я также не нашел ничего конкретного c при поиске.

Моя настройка [в скобках то, что я тоже пробовал]: * ​​1005 *

  • HW : i7-4790K, 32 ГБ ОЗУ и GeForce 2070 Super 8 ГБ
  • ОС: Windows 10 64 бит
  • Python: 3.6.8 [и 3.7 (где tf не удалось установить)]
  • IDE: PyCharm 2020.1.1 [и 2020.1]
  • Драйвер: Последний драйвер «Studio» 442.92 [а также последний «игровой» драйвер]
  • CuDA: 10.1 + последние библиотеки DLL CuDNN для этой версии [Я также пробовал 10.2, но tf не обнаруживает it]
  • TF: 2.2.0 RC4 [, 2.0.x и 2.1.5] Все пакеты, установленные через PyCharm (и, следовательно, pip)

Это ошибка возникает через ~ 3 часа обучения. В других случаях (или при параметризации net) ошибка возникает гораздо раньше. Здесь вы можете увидеть полный вывод фрагмента кода ниже:

C:\Users\Fhnx\.virtualenvs\Processing-TA9ofq3q\Scripts\python.exe C:/Users/Fhnx/.../playground/AI_Predictor_Test.py
2020-05-08 11:47:25.924424: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
Starting training sweep with Epochs: 10000, LRstart: 0.01, LRend: 5e-05
2020-05-08 11:47:27.887135: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-05-08 11:47:27.912998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.815GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-05-08 11:47:27.913212: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-08 11:47:27.921203: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-08 11:47:27.930115: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-05-08 11:47:27.932760: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-05-08 11:47:27.944938: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-05-08 11:47:27.952321: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-05-08 11:47:27.960042: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-08 11:47:27.960698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-08 11:47:27.961058: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-05-08 11:47:27.969636: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2df4e1dcd00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-08 11:47:27.969831: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-05-08 11:47:27.970579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.815GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-05-08 11:47:27.970964: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-08 11:47:27.971208: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-08 11:47:27.971389: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-05-08 11:47:27.971602: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-05-08 11:47:27.971839: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-05-08 11:47:27.972112: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-05-08 11:47:27.972324: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-08 11:47:27.973322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-08 11:47:28.530960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-08 11:47:28.531109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0
2020-05-08 11:47:28.531180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N
2020-05-08 11:47:28.532337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6213 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-05-08 11:47:28.534819: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2df7aeb31a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-05-08 11:47:28.534946: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2070 SUPER, Compute Capability 7.5
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            [(None, 22)]         0
__________________________________________________________________________________________________
tf_op_layer_ExpandDims (TensorF [(None, 22, 1)]      0           input_1[0][0]
__________________________________________________________________________________________________
dense (Dense)                   (None, 22, 64)       128         tf_op_layer_ExpandDims[0][0]
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 22, 64)       128         tf_op_layer_ExpandDims[0][0]
__________________________________________________________________________________________________
dense_6 (Dense)                 (None, 22, 64)       128         tf_op_layer_ExpandDims[0][0]
__________________________________________________________________________________________________
dense_9 (Dense)                 (None, 22, 64)       128         tf_op_layer_ExpandDims[0][0]
__________________________________________________________________________________________________
dense_12 (Dense)                (None, 22, 64)       128         tf_op_layer_ExpandDims[0][0]
__________________________________________________________________________________________________
dense_15 (Dense)                (None, 22, 64)       128         tf_op_layer_ExpandDims[0][0]
__________________________________________________________________________________________________
gaussian_dropout (GaussianDropo (None, 22, 64)       0           dense[0][0]
__________________________________________________________________________________________________
gaussian_dropout_2 (GaussianDro (None, 22, 64)       0           dense_3[0][0]
__________________________________________________________________________________________________
gaussian_dropout_4 (GaussianDro (None, 22, 64)       0           dense_6[0][0]
__________________________________________________________________________________________________
gaussian_dropout_6 (GaussianDro (None, 22, 64)       0           dense_9[0][0]
__________________________________________________________________________________________________
gaussian_dropout_8 (GaussianDro (None, 22, 64)       0           dense_12[0][0]
__________________________________________________________________________________________________
gaussian_dropout_10 (GaussianDr (None, 22, 64)       0           dense_15[0][0]
__________________________________________________________________________________________________
bidirectional (Bidirectional)   (None, 22, 16)       4672        gaussian_dropout[0][0]
__________________________________________________________________________________________________
bidirectional_2 (Bidirectional) (None, 22, 16)       4672        gaussian_dropout_2[0][0]
__________________________________________________________________________________________________
bidirectional_4 (Bidirectional) (None, 22, 16)       4672        gaussian_dropout_4[0][0]
__________________________________________________________________________________________________
bidirectional_6 (Bidirectional) (None, 22, 16)       4672        gaussian_dropout_6[0][0]
__________________________________________________________________________________________________
bidirectional_8 (Bidirectional) (None, 22, 16)       4672        gaussian_dropout_8[0][0]
__________________________________________________________________________________________________
bidirectional_10 (Bidirectional (None, 22, 16)       4672        gaussian_dropout_10[0][0]
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 22, 16)       1600        bidirectional[0][0]
__________________________________________________________________________________________________
bidirectional_3 (Bidirectional) (None, 22, 16)       1600        bidirectional_2[0][0]
__________________________________________________________________________________________________
bidirectional_5 (Bidirectional) (None, 22, 16)       1600        bidirectional_4[0][0]
__________________________________________________________________________________________________
bidirectional_7 (Bidirectional) (None, 22, 16)       1600        bidirectional_6[0][0]
__________________________________________________________________________________________________
bidirectional_9 (Bidirectional) (None, 22, 16)       1600        bidirectional_8[0][0]
__________________________________________________________________________________________________
bidirectional_11 (Bidirectional (None, 22, 16)       1600        bidirectional_10[0][0]
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 20, 13)       1780        bidirectional_1[0][0]
__________________________________________________________________________________________________
conv1d_4 (Conv1D)               (None, 20, 13)       1780        bidirectional_3[0][0]
__________________________________________________________________________________________________
conv1d_8 (Conv1D)               (None, 20, 13)       1780        bidirectional_5[0][0]
__________________________________________________________________________________________________
conv1d_12 (Conv1D)              (None, 20, 13)       1780        bidirectional_7[0][0]
__________________________________________________________________________________________________
conv1d_16 (Conv1D)              (None, 20, 13)       1780        bidirectional_9[0][0]
__________________________________________________________________________________________________
conv1d_20 (Conv1D)              (None, 20, 13)       1780        bidirectional_11[0][0]
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 20, 10)       1620        conv1d[0][0]
__________________________________________________________________________________________________
conv1d_5 (Conv1D)               (None, 20, 10)       1620        conv1d_4[0][0]
__________________________________________________________________________________________________
conv1d_9 (Conv1D)               (None, 20, 10)       1620        conv1d_8[0][0]
__________________________________________________________________________________________________
conv1d_13 (Conv1D)              (None, 20, 10)       1620        conv1d_12[0][0]
__________________________________________________________________________________________________
conv1d_17 (Conv1D)              (None, 20, 10)       1620        conv1d_16[0][0]
__________________________________________________________________________________________________
conv1d_21 (Conv1D)              (None, 20, 10)       1620        conv1d_20[0][0]
__________________________________________________________________________________________________
conv1d_2 (Conv1D)               (None, 20, 7)        1620        conv1d_1[0][0]
__________________________________________________________________________________________________
conv1d_6 (Conv1D)               (None, 20, 7)        1620        conv1d_5[0][0]
__________________________________________________________________________________________________
conv1d_10 (Conv1D)              (None, 20, 7)        1620        conv1d_9[0][0]
__________________________________________________________________________________________________
conv1d_14 (Conv1D)              (None, 20, 7)        1620        conv1d_13[0][0]
__________________________________________________________________________________________________
conv1d_18 (Conv1D)              (None, 20, 7)        1620        conv1d_17[0][0]
__________________________________________________________________________________________________
conv1d_22 (Conv1D)              (None, 20, 7)        1620        conv1d_21[0][0]
__________________________________________________________________________________________________
conv1d_3 (Conv1D)               (None, 20, 4)        1620        conv1d_2[0][0]
__________________________________________________________________________________________________
conv1d_7 (Conv1D)               (None, 20, 4)        1620        conv1d_6[0][0]
__________________________________________________________________________________________________
conv1d_11 (Conv1D)              (None, 20, 4)        1620        conv1d_10[0][0]
__________________________________________________________________________________________________
conv1d_15 (Conv1D)              (None, 20, 4)        1620        conv1d_14[0][0]
__________________________________________________________________________________________________
conv1d_19 (Conv1D)              (None, 20, 4)        1620        conv1d_18[0][0]
__________________________________________________________________________________________________
conv1d_23 (Conv1D)              (None, 20, 4)        1620        conv1d_22[0][0]
__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 20, 4)        16          conv1d_3[0][0]
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 20, 4)        16          conv1d_7[0][0]
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 20, 4)        16          conv1d_11[0][0]
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 20, 4)        16          conv1d_15[0][0]
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 20, 4)        16          conv1d_19[0][0]
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 20, 4)        16          conv1d_23[0][0]
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 20, 128)      640         batch_normalization[0][0]
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, 20, 128)      640         batch_normalization_1[0][0]
__________________________________________________________________________________________________
dense_7 (Dense)                 (None, 20, 128)      640         batch_normalization_2[0][0]
__________________________________________________________________________________________________
dense_10 (Dense)                (None, 20, 128)      640         batch_normalization_3[0][0]
__________________________________________________________________________________________________
dense_13 (Dense)                (None, 20, 128)      640         batch_normalization_4[0][0]
__________________________________________________________________________________________________
dense_16 (Dense)                (None, 20, 128)      640         batch_normalization_5[0][0]
__________________________________________________________________________________________________
gaussian_dropout_1 (GaussianDro (None, 20, 128)      0           dense_1[0][0]
__________________________________________________________________________________________________
gaussian_dropout_3 (GaussianDro (None, 20, 128)      0           dense_4[0][0]
__________________________________________________________________________________________________
gaussian_dropout_5 (GaussianDro (None, 20, 128)      0           dense_7[0][0]
__________________________________________________________________________________________________
gaussian_dropout_7 (GaussianDro (None, 20, 128)      0           dense_10[0][0]
__________________________________________________________________________________________________
gaussian_dropout_9 (GaussianDro (None, 20, 128)      0           dense_13[0][0]
__________________________________________________________________________________________________
gaussian_dropout_11 (GaussianDr (None, 20, 128)      0           dense_16[0][0]
__________________________________________________________________________________________________
flatten (Flatten)               (None, 2560)         0           gaussian_dropout_1[0][0]
__________________________________________________________________________________________________
flatten_1 (Flatten)             (None, 2560)         0           gaussian_dropout_3[0][0]
__________________________________________________________________________________________________
flatten_2 (Flatten)             (None, 2560)         0           gaussian_dropout_5[0][0]
__________________________________________________________________________________________________
flatten_3 (Flatten)             (None, 2560)         0           gaussian_dropout_7[0][0]
__________________________________________________________________________________________________
flatten_4 (Flatten)             (None, 2560)         0           gaussian_dropout_9[0][0]
__________________________________________________________________________________________________
flatten_5 (Flatten)             (None, 2560)         0           gaussian_dropout_11[0][0]
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1)            2561        flatten[0][0]
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 1)            2561        flatten_1[0][0]
__________________________________________________________________________________________________
dense_8 (Dense)                 (None, 1)            2561        flatten_2[0][0]
__________________________________________________________________________________________________
dense_11 (Dense)                (None, 1)            2561        flatten_3[0][0]
__________________________________________________________________________________________________
dense_14 (Dense)                (None, 1)            2561        flatten_4[0][0]
__________________________________________________________________________________________________
dense_17 (Dense)                (None, 1)            2561        flatten_5[0][0]
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 6)            0           dense_2[0][0]
                                                                 dense_5[0][0]
                                                                 dense_8[0][0]
                                                                 dense_11[0][0]
                                                                 dense_14[0][0]
                                                                 dense_17[0][0]
==================================================================================================
Total params: 97,542
Trainable params: 97,494
Non-trainable params: 48
__________________________________________________________________________________________________
***** Training Net ForkedConvLSTM_D64_LSTM2x8_Conv4x20x4_D1x128_dr0.40 now *****
BatchSize: 2108, NumNetParams: 97542, Feature shape: (500000, 22), Output shape: (500000, 6), In/Out Elem.: 14.0000M with est. size: 448.0000 MB
Epoch 1/10000
2020-05-08 11:47:57.675309: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-08 11:47:57.962354: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-08 11:47:59.216097: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: Invoking GPU asm compilation is supported on Cuda non-Windows platforms only
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
238/238 [==============================] - 21s 90ms/step - loss: 0.3145 - val_loss: 0.0846 - lr: 0.0100
Epoch 2/10000
238/238 [==============================] - 15s 62ms/step - loss: 0.0851 - val_loss: 0.0837 - lr: 0.0100
[...]
Epoch 694/10000
238/238 [==============================] - 14s 61ms/step - loss: 0.0833 - val_loss: 0.0836 - lr: 5.0000e-05
Epoch 695/10000
  6/238 [..............................] - ETA: 12s - loss: 0.08302020-05-08 14:39:02.141015: E tensorflow/stream_executor/dnn.cc:613] CUDNN_STATUS_INTERNAL_ERROR
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1986): 'cudnnRNNBackwardData( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, output_desc.handles(), output_data.opaque(), output_desc.handles(), output_backprop_data.opaque(), output_h_desc.handle(), output_h_backprop_data.opaque(), output_c_desc.handle(), output_c_backprop_data.opaque(), rnn_desc.params_handle(), params.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), input_desc.handles(), input_backprop_data->opaque(), input_h_desc.handle(), input_h_backprop_data->opaque(), input_c_desc.handle(), input_c_backprop_data->opaque(), workspace.opaque(), workspace.size(), reserve_space_data->opaque(), reserve_space_data->size())'
2020-05-08 14:39:02.141642: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at cudnn_rnn_ops.cc:1922 : Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 16, 8, 1, 22, 2108, 8]
2020-05-08 14:39:02.141037: F tensorflow/stream_executor/cuda/cuda_dnn.cc:189] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream.
20
Process finished with exit code -1073740791 (0xC0000409)

Вот некоторый код, который должен иметь возможность запускаться и выдавать приведенный выше результат:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# from os import environ
# environ['TF_CPP_MIN_LOG_LEVEL'] = '1'

from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.optimizers import *
import tensorflow as tf
import numpy as np
import sys


def build_model_simple(inputLength=1, outputLength=1, lr=0.0001, device="/gpu:0",
                       dropoutRate=0.4,
                       nNeuFirstDense=64,
                       numLSTM=2, nNeuLSTM=8,
                       numConv=4, nFiltConv=20, szConvKernel=4,
                       numDenseInner=1, nNeuDenseInner=128):
    tf.keras.backend.set_floatx('float32')
    with tf.device(device):
        input = Input(shape=(inputLength,), dtype=tf.float32)
        inputExp = tf.expand_dims(input, -1)
        allInner = []
        for _ in range(outputLength):
            inner = Dense(nNeuFirstDense, activation="linear")(inputExp)
            inner = GaussianDropout(rate=dropoutRate)(inner)

            if numLSTM and nNeuLSTM:
                for _ in range(numLSTM):
                    inner = (Bidirectional(LSTM(nNeuLSTM, return_sequences=True))(inner))

            if numConv:
                for _ in range(numConv):
                    inner = Conv1D(filters=nFiltConv, kernel_size=szConvKernel,
                                   strides=1, padding='valid',
                                   data_format='channels_first')(inner)
                inner = BatchNormalization()(inner)

            if numDenseInner:
                for _ in range(numDenseInner):
                    inner = Dense(nNeuDenseInner, activation="linear")(inner)
                    inner = GaussianDropout(rate=dropoutRate)(inner)
            inner = Flatten()(inner)
            inner = Dense(1, activation="linear")(inner)
            allInner.append(inner)
        out = Concatenate()(allInner)
        # out = outTmp * outTmp * outTmp
        model = Model(inputs=input, outputs=out)

        model.compile(loss="mse", optimizer=Adam(lr=lr))
        # model.compile(loss="mse", optimizer=Adadelta())
        return model, 'ForkedConvLSTM_D{}_LSTM{}x{}_Conv{}x{}x{}_D{}x{}_dr{:.2f}'.format(
            nNeuFirstDense,
            numLSTM, nNeuLSTM,
            numConv, nFiltConv, szConvKernel,
            numDenseInner, nNeuDenseInner,
            dropoutRate)


def scheduler(epoch, lrStart, lrEnd, lrDecay=0.05, lrNStable=10):
    lr = lrStart
    if epoch > lrNStable:
        fac = tf.math.exp(lrDecay * (lrNStable - epoch))
        lr = lrStart * fac + lrEnd * (1 - fac)
    return lr


if __name__ == '__main__':
    numFeatures = 22
    numOutputs = 6

    trainIn = np.random.rand(500000, numFeatures)
    trainOut = np.random.rand(500000, numOutputs)
    valiIn = np.random.rand(12000, numFeatures)
    valiOut = np.random.rand(12000, numOutputs)

    numDataElements = trainIn.shape[0] * (trainIn.shape[1] + trainOut.shape[1])
    sizeCalc = numDataElements * sys.getsizeof(trainIn[0][0])

    EPOCHS = 10000
    LEARNING_RATE_START = 0.01
    LEARNING_RATE_END = 0.00005
    LEARNING_DECAY = 0.05

    print("Starting training sweep with Epochs: {}, LRstart: {}, LRend: {}".format(
        EPOCHS, LEARNING_RATE_START, LEARNING_RATE_END))

    network, nwName = build_model_simple(inputLength=numFeatures, outputLength=numOutputs)

    netWeights = network.get_weights()
    numNetPrams = np.sum([np.prod(ele.shape) for ele in netWeights])

    # Estimation of Batch Size: GRAM * RAM Factor / NumParams in Net = ~75k. This divided by 30 for to get a
    # good rough estimate for the batch size
    BATCH_SIZE = int(np.floor(8 * 1e9 * 0.9 / numNetPrams / 35))
    network.summary()

    print("***** Training Net {} now *****".format(nwName))
    print("BatchSize: {}, NumNetParams: {}, Feature shape: {}, Output shape: "
                 "{}, In/Out Elem.: {:.4f}M with est. size: {:.4f} MB".format(
        BATCH_SIZE, numNetPrams, trainIn.shape, trainOut.shape,
        numDataElements / 1e6, sizeCalc / 1e6))

    callback = tf.keras.callbacks.LearningRateScheduler(
        lambda x: scheduler(x, LEARNING_RATE_START, LEARNING_RATE_END, LEARNING_DECAY))
    fitRes = network.fit(trainIn, trainOut, batch_size=BATCH_SIZE, epochs=EPOCHS,
                         validation_data=(valiIn, valiOut),
                         callbacks=[callback, tf.keras.callbacks.TerminateOnNaN()],
                         verbose=1)

    logging.info("FINISHED")

1 Ответ

1 голос
/ 27 мая 2020

Для тех, кто придет после меня:

Я много играл с разными версиями. Я даже пытался заставить CUDA 10.2 работать, связывая новые библиотеки DLL со старыми именами. Но даже это не устранило ошибку.

Мне, наконец, удалось заставить его работать, удалив все вещи NVidia (включая драйверы) и установив новейшую версию 10.1 (с конца 1919 года) с студийные драйверы из этого выпуска. Итак, версия 431.86 вместо последней студийной версии 441.66.

Я не думаю, что предыдущие установки ios имели ошибку, поэтому, по моим оценкам, проблема была в версии драйвера. время ...

...