Как исправить «Невозможно перезапустить задачу, потому что не осталось попыток перезапуска» в задании Matlab parfor на кластере AWS - PullRequest
1 голос
/ 12 апреля 2019

Я запустил пакетное задание в кластере AWS, которое, к сожалению, завершилось с ошибкой примерно через 2 часа. Перед отправкой задания я запускаю его на локальном кластере с сокращенными циклами итераций, где он работает нормально. Сообщение об ошибке было:

Task with properties:

ID: 1
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:14
Running Duration: 0 days 1h 42m 50s

Error: All workers aborted during execution of the parfor loop.
Error Stack: parallel_function (line 607)
generic_adaptation (line 75)
Warnings: List warnings
Task with properties:

ID: 2
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 3
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:23
Running Duration: 0 days 1h 42m 42s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 4
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 5
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 6
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:23
Running Duration: 0 days 1h 42m 40s

Error: Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
The worker MATLAB exited or was stopped during task evaluation. MATLAB ended with exit status 9.
Warnings:
Task with properties:

ID: 7
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 8
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 9
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 10
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 11
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 12
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 13
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:23
Running Duration: 0 days 1h 42m 42s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 14
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 15
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 16
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:

Поскольку все работало в локальном кластере на моем ПК, я подозреваю, что сам код в порядке, но причина ошибки в другом месте (возможно, подключение к кластеру AWS EC2 или внутренняя ошибка в кластере?)

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...