Python скрипт убит oom-killer на ec2 amazon linux - PullRequest
0 голосов
/ 12 февраля 2020

Я пытаюсь запустить учебную сессию, используя keras общий набор данных (1000 80 x 80 изображений) очень мал (всего 20 МБ) в экземпляре бесплатного облака Amazon ec2 ( 1 ГБ памяти ) однако процесс убивается после запуска model.fit() 2 эпох (и иногда он меняется до 15). Я пытаюсь отключить Oom Killer или найти обходные пути ... какие-либо предложения? Ниже вы найдете трассировку памяти (которая не показывает некоторые серьезные цифры, поэтому мне интересно, почему сценарий был убит в первую очередь ???)

Ошибка: (воспроизводится на 1 ГБ экземпляр памяти)

 64/870 [=>............................] - ETA: 12s - loss: 0.4477 - accuracy: 0.8750Traceback (most recent call last):
  File "image_classifier.py", line 990, in <module>
    clf.predict_folder_k_cnn(folder_path='test_photos_2/', label='One', epochs=50)
  File "image_classifier.py", line 951, in predict_folder_k_cnn
    model.fit(self.x_train, self.y_train, epochs=epochs, batch_size=batch_size, **(model_fit_args or {}))
  File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/engine/training.py", line 1239, in fit
    validation_freq=validation_freq)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop
    outs = fit_function(ins_batch)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/keras/backend.py", line 3510, in __call__
    outputs = self._graph_fn(*converted_inputs)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 572, in __call__
    return self._call_flat(args)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 671, in _call_flat
    outputs = self._inference_function.call(ctx, args)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 445, in call
    ctx=ctx)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[64,80,80,32] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[node gradients/max_pool/MaxPool_grad/MaxPoolGrad (defined at /home/ec2-user/.local/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_keras_scratch_graph_1638]

Function call stack:
keras_scratch_graph

dmesg вывод:

t:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:16kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  504.825883] lowmem_reserve[]: 0 932 932 932
[  504.829525] Node 0 DMA32 free:44316kB min:44316kB low:55392kB high:66468kB active_anon:892184kB inactive_anon:256kB active_file:24kB inactive_file:0kB unevictable:0kB writepending:0kB present:1032192kB managed:991368kB mlocked:0kB kernel_stack:1952kB pagetables:7124kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB
[  504.851094] lowmem_reserve[]: 0 0 0 0
[  504.854427] Node 0 DMA: 10*4kB (UME) 11*8kB (UME) 13*16kB (UME) 15*32kB (UE) 9*64kB (UE) 8*128kB (UME) 6*256kB (UME) 1*512kB (E) 0*1024kB 0*2048kB 0*4096kB = 4464kB
[  504.865932] Node 0 DMA32: 1101*4kB (UE) 781*8kB (UE) 458*16kB (UE) 317*32kB (UE) 121*64kB (UME) 46*128kB (UME) 6*256kB (U) 0*512kB 1*1024kB (M) 0*2048kB 0*4096kB = 44316kB
[  504.877626] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  504.884964] 103 total pagecache pages
[  504.888296] 0 pages in swap cache
[  504.891399] Swap cache stats: add 0, delete 0, find 0/0
[  504.895970] Free swap  = 0kB
[  504.898881] Total swap = 0kB
[  504.901907] 262045 pages RAM
[  504.904737] 0 pages HighMem/MovableOnly
[  504.908299] 10227 pages reserved
[  504.911383] 0 pages hwpoisoned
[  504.914445] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  504.921393] [ 1931]     0  1931    10278       97      28       3        0             0 systemd-journal
[  504.928934] [ 1961]     0  1961    29191       67      28       4        0             0 lvmetad
[  504.936328] [ 2655]     0  2655    16041      149      30       3        0         -1000 auditd
[  504.943150] [ 2683]    81  2683    15123      118      35       3        0          -900 dbus-daemon
[  504.950385] [ 2686]    32  2686    18423      178      38       3        0             0 rpcbind
[  504.957604] [ 2690]   999  2690     3152       41      12       3        0             0 lsmd
[  504.964760] [ 2691]     0  2691     3274       28      12       3        0             0 rngd
[  504.972138] [ 2693]     0  2693     7117       89      19       3        0             0 systemd-logind
[  504.979632] [ 2700]   997  2700    30649      135      33       3        0             0 chronyd
[  504.987111] [ 2716]     0  2716    24457      163      35       3        0             0 gssproxy
[  504.994331] [ 2920]     0  2920    25156      514      48       3        0             0 dhclient
[  505.001383] [ 2961]     0  2961    25156      510      48       3        0             0 dhclient
[  505.008709] [ 3105]     0  3105    22545      262      44       3        0             0 master
[  505.015992] [ 3109]    89  3109    22567      253      44       3        0             0 pickup
[  505.022854] [ 3110]    89  3110    22586      256      46       3        0             0 qmgr
[  505.029730] [ 3157]     0  3157   117174      442      30       6        0             0 amazon-ssm-agen
[  505.037492] [ 3159]     0  3159    54140      270      41       3        0             0 rsyslogd
[  505.044641] [ 3199]     0  3199    30322       32      12       3        0             0 agetty
[  505.051767] [ 3200]     0  3200     2634       33      11       3        0             0 agetty
[  505.059124] [ 3333]     0  3333    38138      334      76       3        0             0 sshd
[  505.066299] [ 3371]     0  3371     1065       26       8       3        0             0 acpid
[  505.073401] [ 3414]  1000  3414    38175      390      73       3        0             0 sshd
[  505.082220] [ 3415]  1000  3415    31219      269      16       3        0             0 bash
[  505.089459] [ 3564]     0  3564    11355      132      24       3        0         -1000 systemd-udevd
[  505.097212] [ 4261]     0  4261    28182      254      59       4        0         -1000 sshd
[  505.103965] [ 4396]     0  4396    33767      158      21       4        0             0 crond
[  505.110852] [ 4421]     0  4421     6968       50      19       3        0             0 atd
[  505.118310] [22988]  1000 22988    33586       64      21       3        0             0 screen
[  505.125710] [22989]  1000 22989    33621      128      19       3        0             0 screen
[  505.132826] [22990]  1000 22990    31215      270      16       3        0             0 bash
[  505.140153] [23011]  1000 23011   568549   219738     812       5        0             0 python3
[  505.147922] Out of memory: Kill process 23011 (python3) score 875 or sacrifice child
[  505.154309] Killed process 23011 (python3) total-vm:2274196kB, anon-rss:878952kB, file-rss:0kB, shmem-rss:0kB
[  505.195909] oom_reaper: reaped process 23011 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ec2-user@ip-172-31-95-14 ~]$ python3 image_classifier.py 

Трассировка памяти (1 эпоха):

image_classifier.py:263: size=159 MiB, count=3, average=53.1 MiB
/home/ec2-user/.local/lib/python3.7/site-packages/tables/atom.py:1224: size=20.1 MiB, count=3715, average=5675 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/lines.py:380: size=2597 KiB, count=1205, average=2207 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/util/tf_stack.py:147: size=2546 KiB, count=26034, average=100 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/transforms.py:179: size=1783 KiB, count=18009, average=101 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/transforms.py:93: size=1487 KiB, count=16326, average=93 B
<frozen importlib._bootstrap_external>:525: size=1171 KiB, count=10792, average=111 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/artist.py:75: size=1170 KiB, count=2859, average=419 B
/usr/lib64/python3.7/contextlib.py:82: size=791 KiB, count=5773, average=140 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/util/tf_stack.py:131: size=608 KiB, count=22225, average=28 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:795: size=565 KiB, count=61, average=9483 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/util/tf_stack.py:136: size=552 KiB, count=20184, average=28 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:365: size=498 KiB, count=2520, average=202 B
/usr/lib64/python3.7/abc.py:143: size=462 KiB, count=3773, average=125 B
<__array_function__ internals>:6: size=342 KiB, count=6058, average=58 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/transforms.py:180: size=317 KiB, count=6156, average=53 B
/home/ec2-user/.local/lib/python3.7/site-packages/numpy/core/_asarray.py:85: size=294 KiB, count=3926, average=77 B
/home/ec2-user/.local/lib/python3.7/site-packages/cycler.py:227: size=278 KiB, count=3253, average=87 B
...