Ремонт Cassandra nodetool иногда застревает - PullRequest
0 голосов
/ 14 июля 2020

Я использую nodetool repair -pr -full my_ks my_tbl в нашем кластере Cassandra (имеет два контроллера домена). Иногда он зависает с приведенными ниже журналами отладки. Он работает после перезапуска процесса Cassandra. Есть какие-нибудь подсказки по поводу root причины этой проблемы?

DEBUG [GossipStage:1] 2020-07-13 14:04:22,818 FailureDetector.java:456 - Ignoring interval time of 2571566434 for /10.22.38.223
DEBUG [GossipStage:1] 2020-07-13 14:04:22,818 FailureDetector.java:456 - Ignoring interval time of 2495429260 for /10.22.38.26
DEBUG [GossipStage:1] 2020-07-13 14:04:22,818 FailureDetector.java:456 - Ignoring interval time of 2571592685 for /10.32.146.85
INFO  [Thread-181] 2020-07-13 14:04:22,900 RepairRunnable.java:125 - Starting repair command #2, repairing keyspace my_ks with repair options (parallelism: parallel, primary range: true, incremental: false, job threads: 1, ColumnFamilies: [my_tbl], dataCenters: [], hosts: [], # of ranges: 256)
INFO  [HANDSHAKE-/10.32.146.85] 2020-07-13 14:04:23,460 OutboundTcpConnection.java:515 - Handshaking version with /10.32.146.85
DEBUG [GossipStage:1] 2020-07-13 14:04:23,716 FailureDetector.java:456 - Ignoring interval time of 2000838464 for /10.22.38.27
DEBUG [GossipStage:1] 2020-07-13 14:04:23,716 FailureDetector.java:456 - Ignoring interval time of 2000923736 for /10.22.38.68
DEBUG [GossipStage:1] 2020-07-13 14:04:23,815 FailureDetector.java:456 - Ignoring interval time of 2100571952 for /10.32.253.232
DEBUG [GossipStage:1] 2020-07-13 14:04:25,247 FailureDetector.java:456 - Ignoring interval time of 2429005356 for /10.32.144.198

Я использую Cassandra 3.9.

Изменить: я вижу ниже журналы с включенной трассировкой:

INFO  [HANDSHAKE-/10.32.168.76] 2020-07-18 02:12:41,253 OutboundTcpConnection.java:515 - Handshaking version with /10.32.168.76
INFO  [HANDSHAKE-/10.32.142.195] 2020-07-18 02:12:42,260 OutboundTcpConnection.java:515 - Handshaking version with /10.32.142.195
INFO  [HANDSHAKE-/10.32.144.198] 2020-07-18 02:12:42,260 OutboundTcpConnection.java:515 - Handshaking version with /10.32.144.198
ERROR [RepairTracePolling] 2020-07-18 02:12:45,836 CassandraDaemon.java:226 - Exception in thread Thread[RepairTracePolling,5,RMI Runtime]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
    at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.repair.RepairRunnable$4.runMayThrow(RepairRunnable.java:412) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.9.jar:3.9]
    at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_72]
...