У меня есть четыре узла ES кластера (64 VCores, 60 ГБ ОЗУ) с 28 ГБ, выделенных для кучи ES. У меня есть 21 миллион документов, которые мне нужно проиндексировать. Эти документы довольно сложны и содержат много вложенных документов.
Я массово индексирую эти документы, используя elasticsearch-hadoop
в приложении Spark, используя 140 потоков, и каждый поток отправляет 2 МБ данных.
Иногда я получаю следующее исключение
Connection error (check network and/or proxy settings)- all nodes failed; tried [[10.132.15.200:9200, 10.132.15.201:9200, 10.132.15.202:9200, 10.132.15.199:9200]]
Я предполагаю, что в течение этого времени все узлы заняты stop-the-world
сборкой мусора и, следовательно, не могут отвечать на запросы индексации.
Это исключение не приводит к сбою приложения, индексация продолжается через несколько секунд.
Я также начал отслеживать журналы кластера с 1 узла, чтобы увидеть, что происходит.
[2019-03-13T07:19:52,377][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][16762] overhead, spent [583ms] collecting in the last [1s]
[2019-03-13T07:19:53,821][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][16763] overhead, spent [939ms] collecting in the last [1.4s]
[2019-03-13T07:20:56,995][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][16826] overhead, spent [395ms] collecting in the last [1s]
[2019-03-13T07:20:57,995][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][16827] overhead, spent [481ms] collecting in the last [1s]
[2019-03-13T07:23:54,591][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17003] overhead, spent [303ms] collecting in the last [1s]
[2019-03-13T07:24:15,864][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17024] overhead, spent [542ms] collecting in the last [1.2s]
[2019-03-13T07:24:25,866][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17034] overhead, spent [266ms] collecting in the last [1s]
[2019-03-13T07:24:34,223][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17042] overhead, spent [454ms] collecting in the last [1.3s]
[2019-03-13T07:25:35,255][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17103] overhead, spent [264ms] collecting in the last [1s]
[2019-03-13T07:26:01,835][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17129] overhead, spent [682ms] collecting in the last [1.5s]
[2019-03-13T07:26:04,915][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17132] overhead, spent [326ms] collecting in the last [1s]
[2019-03-13T07:26:52,089][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17179] overhead, spent [375ms] collecting in the last [1s]
[2019-03-13T07:27:38,249][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17225] overhead, spent [277ms] collecting in the last [1s]
[2019-03-13T07:28:02,429][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17249] overhead, spent [540ms] collecting in the last [1s]
[2019-03-13T07:28:03,430][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17250] overhead, spent [415ms] collecting in the last [1s]
[2019-03-13T07:28:09,508][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17256] overhead, spent [274ms] collecting in the last [1s]
[2019-03-13T07:28:43,642][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17290] overhead, spent [660ms] collecting in the last [1s]
[2019-03-13T07:28:44,659][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17291] overhead, spent [260ms] collecting in the last [1s]
[2019-03-13T07:29:18,766][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17325] overhead, spent [284ms] collecting in the last [1s]
[2019-03-13T07:31:10,090][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17436] overhead, spent [275ms] collecting in the last [1s]
[2019-03-13T07:31:59,359][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17485] overhead, spent [252ms] collecting in the last [1s]
[2019-03-13T07:32:24,453][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17510] overhead, spent [339ms] collecting in the last [1s]
[2019-03-13T07:33:08,570][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17554] overhead, spent [411ms] collecting in the last [1s]
[2019-03-13T07:35:19,122][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][young][17684][9041] duration [881ms], collections [1]/[1.3s], total [881ms]/[14.1m], memory [75.4gb]->[74.7gb]/[117.6gb], all_pools {[young] [1.2gb]->[3.2mb]/[2.7gb]}{[survivor] [306.3mb]->[357.7mb]/[357.7mb]}{[old] [73.8gb]->[74.3gb]/[114.5gb]}
[2019-03-13T07:35:19,122][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17684] overhead, spent [881ms] collecting in the last [1.3s]
[2019-03-13T07:35:26,209][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17691] overhead, spent [346ms] collecting in the last [1s]
[2019-03-13T07:36:02,609][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17727] overhead, spent [361ms] collecting in the last [1.3s]
[2019-03-13T07:36:15,642][INFO ][o.e.i.e.InternalEngine$EngineMergeScheduler] [elasticsearch-2-elastic-vm-3] [customers_201850][3] now throttling indexing: numMergesInFlight=10, maxNumMerges=9
[2019-03-13T07:36:19,649][INFO ][o.e.i.e.InternalEngine$EngineMergeScheduler] [elasticsearch-2-elastic-vm-3] [customers_201850][3] stop throttling indexing: numMergesInFlight=8, maxNumMerges=9
Итак, у меня есть пара вопросов после прочтения журналов.
Означает ли следующий журнал, что ES тратит 339 мс из последних 1000 мс в GCing?
[2019-03-13T07:32:24,453][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][17510] overhead, spent [339ms] collecting in the last [1s]
Это определенно то место, где происходит GC и память восстанавливается. Я прав?
[2019-03-13T07:35:19,122][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-2-elastic-vm-3] [gc][young][17684][9041] duration [881ms], collections [1]/[1.3s], total [881ms]/[14.1m], memory [75.4gb]->[74.7gb]/[117.6gb], all_pools {[young] [1.2gb]->[3.2mb]/[2.7gb]}{[survivor] [306.3mb]->[357.7mb]/[357.7mb]}{[old] [73.8gb]->[74.3gb]/[114.5gb]}
Именно здесь ES регулирует процесс индексации из-за слияния сегментов.
[2019-03-13T07:36:15,642][INFO ][o.e.i.e.InternalEngine$EngineMergeScheduler] [elasticsearch-2-elastic-vm-3] [customers_201850][3] now throttling indexing: numMergesInFlight=10, maxNumMerges=9
[2019-03-13T07:36:19,649][INFO ][o.e.i.e.InternalEngine$EngineMergeScheduler] [elasticsearch-2-elastic-vm-3] [customers_201850][3] stop throttling indexing: numMergesInFlight=8, maxNumMerges=9
Как мы можем гарантировать, что мы минимизируем stop-the-world
GC, а также как мы можем минимизировать эти слияния, замедляя процесс индексации.