=== environment===
docker: centos7
GPU: 780 3G
python: 3.7
CUDA Version: 10.0
tensorflow-gpu: 1.14 keras: 2.3.1
Я скачал исходный код keras 2.3.1 с github, нормально запускать файл example / keras.py напрямую. Я хочу использовать nvpfor для анализа. Я запускаю следующую команду:
nvprof --metrics flop_count_sp python3 keras_cnn.py
Экран напечатан "== 6114 == Ошибка: внутренняя ошибка профилирования 4055: 34", nvprof не выдает слишком много информации, у меня нет способа найти проблему , Это все журналы, напечатанные во время прогона:
(base) [root@e2777d6bdd1f ~]# nvprof --metrics flop_count_sp python3 keras_cnn.py
Using TensorFlow backend.
x_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
WARNING:tensorflow:From /opt/miniconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:4070: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.
Using real-time data augmentation.
2020-04-08 09:16:35.817449: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-08 09:16:35.821501: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3696000000 Hz
2020-04-08 09:16:35.822202: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5576890fcd50 executing computations on platform Host. Devices:
2020-04-08 09:16:35.822214: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2020-04-08 09:16:35.823010: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
==6114== NVPROF is profiling process 6114, command: python3 keras_cnn.py
2020-04-08 09:16:35.979708: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-08 09:16:35.980316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 780 major: 3 minor: 5 memoryClockRate(GHz): 1.0325
2020-04-08 09:16:36.066954: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2405 MB memory) -> physical GPU (device: 0, name: GeForce GTX 780, pci bus id: 0000:01:00.0, compute capability: 3.5)
2020-04-08 09:16:36.068262: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55768e474c40 executing computations on platform CUDA. Devices:
2020-04-08 09:16:36.068274: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce GTX 780, Compute Capability 3.5
Saved trained model at /root/saved_models/keras_cifar10_trained_model.h5
WARNING:tensorflow:From /opt/miniconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
==6114== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version
32/32 [==============================] - 7s 230ms/step
Test loss: 2.3130927085876465
Test accuracy: 0.03125
==6114== Error: Internal profiling error 4055:34.
======== Profiling result:
======== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "GeForce GTX 780 (0)"
Kernel: void cudnn::winograd_nonfused::winogradForwardOutput4x4<float, float>(cudnn::winograd_nonfused::WinogradOutputParams<float, float>)
4 flop_count_sp Floating Point Operations(Single Precision) 8200192 17825792 13000704
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
7 flop_count_sp Floating Point Operations(Single Precision) 1 1179648 178592
Kernel: cudnn_convolve_sgemm_sm35_ldg_nn_64x16x64x16x16
4 flop_count_sp Floating Point Operations(Single Precision) 136314880 1063526400 467283968
Kernel: void flip_filter<float, float>(float*, float const *, int, int, int, int)
4 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorReshapingOp<Eigen::IndexList<int> const , Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, long>, int=16, Eigen::MakePointer>>, Eigen::TensorReductionOp<Eigen::internal::SumReducer<float>, Eigen::IndexList<Eigen::type2index<long=1>> const , Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_exp_op<float>, Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, long>, int=16, Eigen::MakePointer> const > const , Eigen::MakePointer> const > const , Eigen::GpuDevice>, long>(int, )
1 flop_count_sp Floating Point Operations(Single Precision) 3008 3008 3008
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
1 flop_count_sp Floating Point Operations(Single Precision) 32 32 32
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorEvalToOp<Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_log_op<float>, Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, long>, int=16, Eigen::MakePointer> const > const , Eigen::MakePointer> const , Eigen::GpuDevice>, long>(float, Eigen::internal::scalar_log_op<float>)
1 flop_count_sp Floating Point Operations(Single Precision) 792 792 792
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorConversionOp<float, Eigen::TensorMap<Eigen::Tensor<int const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
2 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorEvalToOp<Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<float const , float const >, Eigen::TensorBroadcastingOp<Eigen::array<long, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, long>, int=16, Eigen::MakePointer> const > const , Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float, float>, Eigen::TensorBroadcastingOp<Eigen::IndexList<Eigen::type2index<long=1>> const , Eigen::TensorForcedEvalOp<Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_log_op<float>, Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, long>, int=16, Eigen::MakePointer> const > const > const > const , Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::MakePointer> const , Eigen::GpuDevice>, long>(float const , float const )
1 flop_count_sp Floating Point Operations(Single Precision) 640 640 640
Kernel: void fermiPlusCgemmLDS128_batched<bool=1, bool=0, bool=0, bool=0, int=4, int=4, int=4, int=3, int=3, bool=1, bool=0>(float2* const *, float2* const *, float2* const *, float2*, float2 const *, float2 const *, int, int, int, int, int, int, __int64, __int64, __int64, float2 const *, float2 const *, float2, float2, int)
3 flop_count_sp Floating Point Operations(Single Precision) 147062784 294125568 198180864
Kernel: void pointwise_mult_and_sum_complex<float2, int=8, int=4>(float2*, float2*, float2*, int, int, int, int, int, float2)
1 flop_count_sp Floating Point Operations(Single Precision) 56229888 56229888 56229888
Kernel: void tensorflow::BiasNHWCKernel<float>(int, float const *, float const , tensorflow::BiasNHWCKernel<float>*, int)
2 flop_count_sp Floating Point Operations(Single Precision) 320 16384 8352
Kernel: void cudnn::winograd::winograd3x3Kernel<float, float, int=1, int=4, int=8, bool=1>(cudnn::maxwell::winograd::KernelParams)
1 flop_count_sp Floating Point Operations(Single Precision) 77070336 77070336 77070336
Kernel: void fft2d_c2r_32x32<float, bool=0, bool=0, unsigned int=0, bool=0, bool=0>(float*, float2 const *, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*, int2, int, int)
6 flop_count_sp Floating Point Operations(Single Precision) 22599680 23545856 23096661
Kernel: void sgemm_largek_lds64<bool=0, bool=0, int=5, int=5, int=4, int=4, int=4, int=32>(float*, float const *, float const *, int, int, int, int, int, int, float const *, float const *, float, float, int, int, int*, int*)
1 flop_count_sp Floating Point Operations(Single Precision) 75644928 75644928 75644928
Kernel: void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=1024, int=1024, int=2, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=1024, int=1024, int=2, bool=0>*)
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorReductionOp<Eigen::internal::SumReducer<float>, Eigen::IndexList<Eigen::type2index<long=1>> const , Eigen::TensorForcedEvalOp<Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<float const , float const >, Eigen::TensorBroadcastingOp<Eigen::array<long, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, long>, int=16, Eigen::MakePointer> const > const , Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float, float>, Eigen::TensorBroadcastingOp<Eigen::IndexList<Eigen::type2index<long=1>> const , Eigen::TensorForcedEvalOp<Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_log_op<float>, Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, long>, int=16, Eigen::MakePointer> const > const > const > const , Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, long>, int=16, Eigen::MakePointer> const > const > const > const , Eigen::MakePointer> const > const , Eigen::GpuDevice>, long>(float, int=1)
1 flop_count_sp Floating Point Operations(Single Precision) 448 448 448
Kernel: sgemm_sm35_ldg_nn_64x16x64x16x16
145 flop_count_sp Floating Point Operations(Single Precision) 2195456 8781824 4959126
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_max_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float const >, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const > const , Eigen::GpuDevice>, long>(float, int=1)
5 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
2 flop_count_sp Floating Point Operations(Single Precision) 1 1 1
Kernel: void fft2d_r2c_64x64<float>(float2*, float const *, int, int, int, int, int, int, int, int)
2 flop_count_sp Floating Point Operations(Single Precision) 12404736 12404736 12404736
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorReshapingOp<Eigen::IndexList<int> const , Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, long>, int=16, Eigen::MakePointer>>, Eigen::TensorReductionOp<Eigen::internal::MaxReducer<float>, Eigen::IndexList<Eigen::type2index<long=1>> const , Eigen::TensorBroadcastingOp<Eigen::array<long, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, long>, int=16, Eigen::MakePointer> const > const , Eigen::MakePointer> const > const , Eigen::GpuDevice>, long>(int, )
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
Kernel: cgemm_strided_batched_sm35_ldg_nt_64x8x64x16x16
7 flop_count_sp Floating Point Operations(Single Precision) 173801472 1172045824 438641810
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float, float>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_quotient_op<float, float>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_exp_op<float>, Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, long>, int=16, Eigen::MakePointer> const > const , Eigen::TensorBroadcastingOp<Eigen::IndexList<Eigen::type2index<long=1>> const , Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::TensorBroadcastingOp<Eigen::array<long, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, long>, int=16, Eigen::MakePointer> const > const > const > const , Eigen::GpuDevice>, long>(float, int=2)
1 flop_count_sp Floating Point Operations(Single Precision) 7680 7680 7680
Kernel: void tensorflow::_GLOBAL__N__64_tmpxft_000046f0_00000000_11_softmax_op_gpu_cu_compute_70_cpp1_ii_0bfd2d72::GenerateNormalizedProb<float, float>(float const *, float const *, float const , tensorflow::_GLOBAL__N__64_tmpxft_000046f0_00000000_11_softmax_op_gpu_cu_compute_70_cpp1_ii_0bfd2d72::GenerateNormalizedProb<float, float>*, int, int, bool)
1 flop_count_sp Floating Point Operations(Single Precision) 7680 7680 7680
Kernel: void cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
12 flop_count_sp Floating Point Operations(Single Precision) 68157440 531763200 316217344
Kernel: void fft2d_r2c_32x32<float, bool=0, unsigned int=0, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
6 flop_count_sp Floating Point Operations(Single Precision) 2167296 23117824 9150805
Kernel: void im2col4d_kernel<float, int>(im2col4d_params, cudnnConvolutionStruct, cudnnTensor4dStruct, float const *, float*, int)
4 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
Kernel: void cudnn::winograd_nonfused::winogradForwardData4x4<float, float>(cudnn::winograd_nonfused::WinogradDataParams<float, float>)
4 flop_count_sp Floating Point Operations(Single Precision) 1474560 15728640 7249920
Kernel: void cudnn::winograd::winograd3x3Kernel<float, float, int=4, int=1, int=8, bool=0>(cudnn::maxwell::winograd::KernelParams)
2 flop_count_sp Floating Point Operations(Single Precision) 143130624 268369920 205750272
Kernel: void fft2d_c2r_32x32<float, bool=0, bool=0, unsigned int=1, bool=0, bool=0>(float*, float2 const *, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*, int2, int, int)
3 flop_count_sp Floating Point Operations(Single Precision) 45447168 45649920 45582336
Kernel: compute_gemm_pointers(float2**, float2 const *, int, float2 const *, int, float2 const *, int, int)
3 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
Kernel: void fft2d_c2r_64x64<float, bool=0>(float*, float2*, int, int, int, int, int, int, int, int, int, int, float, float, int, float*, float*)
1 flop_count_sp Floating Point Operations(Single Precision) 129171456 129171456 129171456
Kernel: void tensorflow::functor::ShuffleInTensor3Simple<float, int=2, int=1, int=0, bool=0>(int, float const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::ShuffleInTensor3Simple<float, int=2, int=1, int=0, bool=0>*)
4 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorConversionOp<float, Eigen::TensorMap<Eigen::Tensor<bool const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
Kernel: void fft2d_r2c_16x16<float>(float2*, float const *, int, int, int, int, int, int, int, int)
2 flop_count_sp Floating Point Operations(Single Precision) 9027584 18055168 13541376
Kernel: void scal_kernel<float, float, int=1, bool=1, int=6, int=5, int=5, int=3>(cublasTransposeParams<float>, float const *, float*, float const *)
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_quotient_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
2 flop_count_sp Floating Point Operations(Single Precision) 15 15 15
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float const , float const >, Eigen::TensorBroadcastingOp<Eigen::array<long, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, long>, int=16, Eigen::MakePointer> const > const , Eigen::TensorBroadcastingOp<Eigen::IndexList<Eigen::type2index<long=1>> const , Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, long>, int=16, Eigen::MakePointer> const > const > const > const , Eigen::GpuDevice>, long>(float, int=2)
1 flop_count_sp Floating Point Operations(Single Precision) 320 320 320
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_left<float, float, Eigen::internal::scalar_sum_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
6 flop_count_sp Floating Point Operations(Single Precision) 864 1179648 208357
Kernel: void cudnn::detail::pooling_fw_4d_kernel<float, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, int=0, bool=0>(cudnnTensorStruct, float const *, cudnn::detail::pooling_fw_4d_kernel<float, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, int=0, bool=0>, cudnnTensorStruct*, cudnnPoolingStruct, float, cudnnPoolingStruct, int, cudnn::reduced_divisor, float)
2 flop_count_sp Floating Point Operations(Single Precision) 73728 230400 152064
Kernel: void fft2d_r2c_32x32<float, bool=0, unsigned int=1, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
5 flop_count_sp Floating Point Operations(Single Precision) 23117824 46235648 32364953
Kernel: void cudnn::winograd_nonfused::winogradForwardFilter4x4<float, float>(cudnn::winograd_nonfused::WinogradFilterParams<float, float>)
4 flop_count_sp Floating Point Operations(Single Precision) 7776 331776 147096
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<bool, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::equal_to<__int64>, Eigen::TensorMap<Eigen::Tensor<__int64 const , int=1, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<__int64 const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(bool, int=1)
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
Kernel: void tensorflow::BiasNCHWKernel<float>(int, float const *, float const , tensorflow::BiasNCHWKernel<float>*, int, int)
4 flop_count_sp Floating Point Operations(Single Precision) 346112 1048576 694272
Kernel: void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)
4 flop_count_sp Floating Point Operations(Single Precision) 14848 237568 107648
Kernel: void tensorflow::functor::RowReduceKernel<cub::TransformInputIterator<float, tensorflow::_GLOBAL__N__64_tmpxft_000046f0_00000000_11_softmax_op_gpu_cu_compute_70_cpp1_ii_0bfd2d72::SubtractAndExpFunctor<float, float>, cub::CountingInputIterator<int, long>, long>, float*, cub::Sum>(float, float, int, int, float, std::iterator_traits<tensorflow::functor::RowReduceKernel<cub::TransformInputIterator<float, tensorflow::_GLOBAL__N__64_tmpxft_000046f0_00000000_11_softmax_op_gpu_cu_compute_70_cpp1_ii_0bfd2d72::SubtractAndExpFunctor<float, float>, cub::CountingInputIterator<int, long>, long>, float*, cub::Sum>>::value_type)
1 flop_count_sp Floating Point Operations(Single Precision) 3680 3680 3680
Kernel: void tensorflow::functor::RowReduceKernel<float const *, float*, cub::Max>(float const *, float*, int, int, cub::Max, std::iterator_traits<tensorflow::functor::RowReduceKernel<float const *, float*, cub::Max>>::value_type)
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
Kernel: void cudnn::winograd::winograd3x3Kernel<float, float, int=2, int=2, int=8, bool=0>(cudnn::maxwell::winograd::KernelParams)
1 flop_count_sp Floating Point Operations(Single Precision) 282591232 282591232 282591232
Kernel: void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>*)
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
Kernel: void tensorflow::functor::FillPhiloxRandomKernelLaunch<tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>>(tensorflow::random::PhiloxRandom, tensorflow::random::PhiloxRandomResultElementType*, __int64, tensorflow::functor::FillPhiloxRandomKernelLaunch<tensorflow::random::UniformDistribution<tensorflow::random::PhiloxRandom, float>>)
6 flop_count_sp Floating Point Operations(Single Precision) 99168 1277952 306661
Kernel: void fft2d_c2r_16x16<float, bool=0>(float*, float2*, int, int, int, int, int, int, int, int, int, int, float, float, int, float*, float*)
1 flop_count_sp Floating Point Operations(Single Precision) 9406464 9406464 9406464
Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<__int64, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorConversionOp<__int64, Eigen::TensorTupleReducerOp<Eigen::internal::ArgMaxTupleReducer<Eigen::Tuple<long, float>>, Eigen::array<long, unsigned long=1> const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, long>, int=16, Eigen::MakePointer> const > const > const > const , Eigen::GpuDevice>, long>(__int64, int=1)
2 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
Kernel: void tensorflow::functor::BlockReduceKernel<float*, float*, int=256, tensorflow::functor::Sum<float>>(float*, float*, int, float, std::iterator_traits<tensorflow::functor::BlockReduceKernel<float*, float*, int=256, tensorflow::functor::Sum<float>>>::value_type)
2 flop_count_sp Floating Point Operations(Single Precision) 1250 1250 1250
Kernel: void fft2d_r2c_32x32<float, bool=0, unsigned int=1, bool=1>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
4 flop_count_sp Floating Point Operations(Single Precision) 2167296 92471296 40998016
Kernel: void tensorflow::functor::BlockReduceKernel<int*, int*, int=256, tensorflow::functor::Prod<int>>(int*, int*, int, int, std::iterator_traits<tensorflow::functor::BlockReduceKernel<int*, int*, int=256, tensorflow::functor::Prod<int>>>::value_type)
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
======== Error: CUDA profiling error.