Кража заданий новых узлов с помощью Ignite Compute - PullRequest
0 голосов
/ 28 ноября 2018

Я пытаюсь вычислить пакет задач в кластере Ignite, где узлы используют политику кражи заданий.

Все работает нормально, за исключением случаев, когда новый узел присоединяется к кластеру, пока пакет уже инициирован:Кажется, что узел не может украсть какие-либо задачи из уже запущенного пакета.Я получаю следующее сообщение:

'SEVERE: Failed to send job stealing message to node: TcpDiscoveryNode [...]'

Я думаю, что здесь описана уже существующая проблема: https://issues.apache.org/jira/browse/IGNITE-1267

Эта проблема, похоже, исправлена ​​в потоке, но в Ignite2.6.0 проблема все еще здесь.

Вот моя конфигурация вычислений:

    JobStealingCollisionSpi spi = new JobStealingCollisionSpi();
    spi.setWaitJobsThreshold(1);
    spi.setMessageExpireTime(1000);
    spi.setMaximumStealingAttempts(10);
    spi.setActiveJobsThreshold(1);
    spi.setStealingEnabled(true);

    JobStealingFailoverSpi failoverSpi = new JobStealingFailoverSpi();
    cfg.setCollisionSpi(spi);
    cfg.setFailoverSpi(failoverSpi);

    Ignite ignite = Ignition.start(cfg);

Я что-то не так делаю?

РЕДАКТИРОВАТЬ: пытался воспроизвести, нотеперь, кажется, работает как задумано.Это действительно странное поведение!

РЕДАКТИРОВАТЬ2: удалось случайным образом воспроизвести проблему, вот стек:

class org.apache.ignite.spi.IgniteSpiException: Failed to send message to remote node: TcpDiscoveryNode [id=f54e6f43-620c-418d-a840-bce51ad1f5f5, addrs=[0:0:0:0:0:0:0:1%lo, 10.36.3.4, 127.0.0.1], sockAddrs=[/10.36.3.4:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=3, intOrder=3, lastExchangeTime=1543917557221, loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=false]
    at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2718)
    at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2651)
    at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1643)
    at org.apache.ignite.internal.managers.communication.GridIoManager.sendToCustomTopic(GridIoManager.java:1703)
    at org.apache.ignite.internal.managers.GridManagerAdapter$1.send(GridManagerAdapter.java:422)
    at org.apache.ignite.spi.collision.jobstealing.JobStealingCollisionSpi.checkIdle(JobStealingCollisionSpi.java:1074)
    at org.apache.ignite.spi.collision.jobstealing.JobStealingCollisionSpi.onCollision(JobStealingCollisionSpi.java:722)
    at org.apache.ignite.internal.managers.collision.GridCollisionManager.onCollision(GridCollisionManager.java:119)
    at org.apache.ignite.internal.processors.job.GridJobProcessor.handleCollisions(GridJobProcessor.java:712)
    at org.apache.ignite.internal.processors.job.GridJobProcessor.access$3000(GridJobProcessor.java:111)
    at org.apache.ignite.internal.processors.job.GridJobProcessor$JobDiscoveryListener.onEvent(GridJobProcessor.java:2008)
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager$LocalListenerWrapper.onEvent(GridEventStorageManager.java:1384)
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:873)
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:858)
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.record0(GridEventStorageManager.java:341)
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.record(GridEventStorageManager.java:307)
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.recordEvent(GridDiscoveryManager.java:2703)
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body0(GridDiscoveryManager.java:2920)
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body(GridDiscoveryManager.java:2732)
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
    at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=f54e6f43-620c-418d-a840-bce51ad1f5f5, addrs=[/10.36.3.4:47100, /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]
    at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3422)
    at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2958)
    at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2841)
    at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2692)
    ... 20 more
    Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
        ... 23 more
    Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
        ... 23 more
    Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
        ... 23 more
    Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
        ... 23 more
    Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
        ... 23 more
    Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
        ... 23 more
    Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
        ... 23 more
    Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
        ... 23 more
    Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/10.36.3.4:47100, err=Failed to read remote node recovery handshake (connection closed).]
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3425)
        ... 23 more
    Caused by: class org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException: Failed to read remote node recovery handshake (connection closed).
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeTcpHandshake(TcpCommunicationSpi.java:3737)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3276)
        ... 23 more
...