Я запустил пакетное задание в кластере AWS, которое, к сожалению, завершилось с ошибкой примерно через 2 часа. Перед отправкой задания я запускаю его на локальном кластере с сокращенными циклами итераций, где он работает нормально. Сообщение об ошибке было:
Task with properties:
ID: 1
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:14
Running Duration: 0 days 1h 42m 50s
Error: All workers aborted during execution of the parfor loop.
Error Stack: parallel_function (line 607)
generic_adaptation (line 75)
Warnings: List warnings
Task with properties:
ID: 2
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 3
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:23
Running Duration: 0 days 1h 42m 42s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 4
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 5
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 6
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:23
Running Duration: 0 days 1h 42m 40s
Error: Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
The worker MATLAB exited or was stopped during task evaluation. MATLAB ended with exit status 9.
Warnings:
Task with properties:
ID: 7
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 8
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 9
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 10
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 11
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 12
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 13
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:23
Running Duration: 0 days 1h 42m 42s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 14
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 15
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 16
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Поскольку все работало в локальном кластере на моем ПК, я подозреваю, что сам код в порядке, но причина ошибки в другом месте (возможно, подключение к кластеру AWS EC2 или внутренняя ошибка в кластере?)