Странное поведение при настройке отчетов по метрикам Flink с InfluxDB - PullRequest
0 голосов
/ 06 февраля 2020

При попытке выполнить задание Flink в кластере Flink (1.9) в кубернетах и ​​при наличии метрик, записанных в базе данных временных рядов effxdb, произошла «серия» очень странных событий.

Допустим, у нас есть эта чрезвычайно простая работа:

// setup Kafka consumer
Properties kafkaConsumerProps = new Properties();
kafkaConsumerProps.setProperty("bootstrap.servers", ...);
kafkaConsumerProps.setProperty("group.id", ...);

FlinkKafkaConsumer<String> myConsumer =
      new FlinkKafkaConsumer<>("topic", new SimpleStringSchema(), kafkaConsumerProps);

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.getConfig().setAutoWatermarkInterval(1000L);

// create direct kafka stream
DataStream<String> trafficEventStream = env.addSource(myConsumer);

trafficEventStream.map(new RichMapFunction<String, String>() {
    @Override
    public String map(String value) throws Exception {
          return value;
    }
});

env.execute("Traffic");

ПРИМЕЧАНИЕ: работа на самом деле не имеет значения, она урезана до костей.

Конфигурация кластера была установлена ​​в flink-conf.yaml согласно документации (https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html):

metrics.reporter.influxdb.class: org.apache.flink.metrics.influxdb.InfluxdbReporter
metrics.reporter.influxdb.host: influxdb
metrics.reporter.influxdb.port: 8086
metrics.reporter.influxdb.db: flink
metrics.reporter.influxdb.username: flink-metrics
metrics.reporter.influxdb.password: qwerty

Однако, когда задание передается в кластер, журналы загрязняются. со следующими сообщениями об ошибках:

2020-02-05 21:45:13,135 WARN  org.apache.flink.runtime.metrics.MetricRegistryImpl Error while reporting metrics 
 org.apache.flink.metrics.influxdb.shaded.org.influxdb.InfluxDBException\$UnableToParseException:partial write: unable to parse 'taskmanager_job_task_operator_reauthentication-latency-avg,host=flink-taskmanager-6484bdf6c5-ssmdc,job_id=45fd8a6e0dbae699a4fd810d5fecc65f,job_name=Traffic,operator_id=cbc357ccb763df2852fee8c4fc7d55f2,operator_name=Source:\ Custom\ Source,subtask_index=0,task_attempt_id=8fda43cec7b39138c4a5cc6f8738971f,task_attempt_num=0,task_id=cbc357ccb763df2852fee8c4fc7d55f2,task_name=Source:\ Custom\ Source\ ->\ Map,tm_id=12f6e13572c00ec73a98734e4c5d307a value=� 1580939113059000000': invalid boolean
unable to parse 'taskmanager_job_task_operator_KafkaConsumer_sync-time-avg,host=flink-taskmanager-6484bdf6c5-ssmdc,job_id=45fd8a6e0dbae699a4fd810d5fecc65f,job_name=Traffic,operator_id=cbc357ccb763df2852fee8c4fc7d55f2,operator_name=Source:\ Custom\ Source,subtask_index=0,task_attempt_id=8fda43cec7b39138c4a5cc6f8738971f,task_attempt_num=0,task_id=cbc357ccb763df2852fee8c4fc7d55f2,task_name=Source:\ Custom\ Source\ ->\ Map,tm_id=12f6e13572c00ec73a98734e4c5d307a value=� 1580939113059000000': invalid boolean
unable to parse 'taskmanager_job_task_operator_commit-latency-avg,host=flink-taskmanager-6484bdf6c5-ssmdc,job_id=45fd8a6e0dbae699a4fd810d5fecc65f,job_name=Traffic,operator_id=cbc357ccb763df2852fee8c4fc7d55f2,operator_name=Source:\ Custom\ Source,subtask_index=0,task_attempt_id=8fda43cec7b39138c4a5cc6f8738971f,task_attempt_num=0,task_id=cbc357ccb763df2852fee8c4fc7d55f2,task_name=Source:\ Custom\ Source\ ->\ Map,tm_id=12f6e13572c00ec73a98734e4c5d307a value=� 1580939113059000000': invalid boolean
unable to parse 'taskmanager_job_task_operator_commit-latency-max,host=flink-taskmanager-6484bdf6c5-ssmdc,job_id=45fd8a6e0dbae699a4fd810d5fecc65f,job_name=Traffic,operator_id=cbc357ccb763df2852fee8c4fc7d55f2,operator_name=Source:\ Custom\ Source,subtask_index=0,task_attempt_id=8fda43cec7b39138c4a5cc6f8738971f,task_attempt_num=0,task_id=cbc357ccb763df2852fee8c4fc7d55f2,task_name=Source:\ Custom\ Source\ ->\ Map,tm_id=12f6e13572c00ec73a98734e4c5d307a value=� 1580939113059000000': invalid boolean

это продолжается довольно долгое время, а затем:

    at org.apache.flink.metrics.influxdb.shaded.org.influxdb.InfluxDBException.buildExceptionFromErrorMessage(InfluxDBException.java:147)
    at org.apache.flink.metrics.influxdb.shaded.org.influxdb.InfluxDBException.buildExceptionForErrorState(InfluxDBException.java:173)
    at org.apache.flink.metrics.influxdb.shaded.org.influxdb.impl.InfluxDBImpl.execute(InfluxDBImpl.java:796)
    at org.apache.flink.metrics.influxdb.shaded.org.influxdb.impl.InfluxDBImpl.write(InfluxDBImpl.java:455)
    at org.apache.flink.metrics.influxdb.InfluxdbReporter.report(InfluxdbReporter.java:101)
    at org.apache.flink.runtime.metrics.MetricRegistryImpl$ReporterTask.run(MetricRegistryImpl.java:436)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

При взгляде на infxdb, когда выполняется аргумент командной строки show series в CLI в базе данных flink перечислены следующие чрезвычайно странные имена серий:

jobmanager_Status_JVM_CPU_Load,host=flink-jobmanager
jobmanager_Status_JVM_CPU_Time,host=flink-jobmanager
jobmanager_Status_JVM_ClassLoader_ClassesLoaded,host=flink-jobmanager
jobmanager_Status_JVM_ClassLoader_ClassesUnloaded,host=flink-jobmanager
jobmanager_Status_JVM_GarbageCollector_Copy_Count,host=flink-jobmanager
jobmanager_Status_JVM_GarbageCollector_Copy_Time,host=flink-jobmanager
jobmanager_Status_JVM_GarbageCollector_MarkSweepCompact_Count,host=flink-jobmanager
jobmanager_Status_JVM_GarbageCollector_MarkSweepCompact_Time,host=flink-jobmanager
jobmanager_Status_JVM_Memory_Direct_Count,host=flink-jobmanager
jobmanager_Status_JVM_Memory_Direct_MemoryUsed,host=flink-jobmanager
jobmanager_Status_JVM_Memory_Direct_TotalCapacity,host=flink-jobmanager
jobmanager_Status_JVM_Memory_Heap_Committed,host=flink-jobmanager
jobmanager_Status_JVM_Memory_Heap_Max,host=flink-jobmanager
jobmanager_Status_JVM_Memory_Heap_Used,host=flink-jobmanager
jobmanager_Status_JVM_Memory_Mapped_Count,host=flink-jobmanager
jobmanager_Status_JVM_Memory_Mapped_MemoryUsed,host=flink-jobmanager
jobmanager_Status_JVM_Memory_Mapped_TotalCapacity,host=flink-jobmanager
jobmanager_Status_JVM_Memory_NonHeap_Committed,host=flink-jobmanager
jobmanager_Status_JVM_Memory_NonHeap_Max,host=flink-jobmanager
jobmanager_Status_JVM_Memory_NonHeap_Used,host=flink-jobmanager
jobmanager_Status_JVM_Threads_Count,host=flink-jobmanager
jobmanager_job_downtime,host=flink-jobmanager,job_id=078a136e99f5028671744fe4da4ef814,job_name=Traffic
jobmanager_job_downtime,host=flink-jobmanager,job_id=2b4aa6d82aea381721d435fa36f56afc,job_name=Traffic
jobmanager_job_downtime,host=flink-jobmanager,job_id=45fd8a6e0dbae699a4fd810d5fecc65f,job_name=Traffic
jobmanager_job_downtime,host=flink-jobmanager,job_id=613e99628c06c38211fb63d31afe8f0f,job_name=Traffic
jobmanager_job_downtime,host=flink-jobmanager,job_id=d4ead0ce5e17397c03969a6c89790f54,job_name=Traffic
jobmanager_job_downtime,host=flink-jobmanager,job_id=e977b10e9ed0f236bb154515c708682b,job_name=Traffic
jobmanager_job_fullRestarts,host=flink-jobmanager,job_id=078a136e99f5028671744fe4da4ef814,job_name=Traffic

, и чем дальше на юг, тем более странным становится go:

taskmanager_job_task_Shuffle_Netty_Output_numBuffersInRemotePerSecond,host=flink-taskmanager-6484bdf6c5-ssmdc,job_id=d4ead0ce5e17397c03969a6c89790f54,job_name=Traffic,subtask_index=0,task_attempt_id=ef8aae2b44854021f18aeb4707babd06,task_attempt_num=0,task_id=cbc357ccb763df2852fee8c4fc7d55f2,task_name=Source:\ Custom\ Source\ ->\ Map,tm_id=12f6e13572c00ec73a98734e4c5d307a
taskmanager_job_task_Shuffle_Netty_Output_numBuffersInRemotePerSecond,host=flink-taskmanager-6484bdf6c5-ssmdc,job_id=e977b10e9ed0f236bb154515c708682b,job_name=Traffic,subtask_index=0,task_attempt_id=2c96da26392011d06220871513bd5f8b,task_attempt_num=0,task_id=cbc357ccb763df2852fee8c4fc7d55f2,task_name=Source:\ Custom\ Source\ ->\ Map,tm_id=12f6e13572c00ec73a98734e4c5d307a
taskmanager_job_task_Shuffle_Netty_Output_numBytesInLocal,host=flink-taskmanager-6484bdf6c5-kzq2h,job_id=2b4aa6d82aea381721d435fa36f56afc,job_name=Traffic,subtask_index=0,task_attempt_id=886a48ce0db510645a48a5541038fe89,task_attempt_num=0,task_id=cbc357ccb763df2852fee8c4fc7d55f2,task_name=Source:\ Custom\ Source\ ->\ Map,tm_id=7e175a73b9a3ae1c3a8d748982530aa2
taskmanager_job_task_Shuffle_Netty_Output_numBytesInLocal,host=flink-taskmanager-6484bdf6c5-kzq2h,job_id=e977b10e9ed0f236bb154515c708682b,job_name=Traffic,subtask_index=0,task_attempt_id=fc071b097d73d8ed9a1688a3d8251cf4,task_attempt_num=0,task_id=cbc357ccb763df2852fee8c4fc7d55f2,task_name=Source:\ Custom\ Source\ ->\ Map,tm_id=7e175a73b9a3ae1c3a8d748982530aa2
taskmanager_job_task_Shuffle_Netty_Output_numBytesInLocal,host=flink-taskmanager-6484bdf6c5-ssmdc,job_id=078a136e99f5028671744fe4da4ef814,job_name=Traffic,subtask_index=0,task_attempt_id=e791473d96872ce886a60aa547f139e4,task_attempt_num=0,task_id=cbc357ccb763df2852fee8c4fc7d55f2,task_name=Source:\ Custom\ Source\ ->\ Map,tm_id=12f6e13572c00ec73a98734e4c5d307a
taskmanager_job_task_Shuffle_Netty_Output_numBytesInLocal,host=flink-taskmanager-6484bdf6c5-ssmdc,job_id=45fd8a6e0dbae699a4fd810d5fecc65f,job_name=Traffic,subtask_index=0,task_attempt_id=8fda43cec7b39138c4a5cc6f8738971f,task_attempt_num=0,task_id=cbc357ccb763df2852fee8c4fc7d55f2,task_name=Source:\ Custom\ Source\ ->\ Map,tm_id=12f6e13572c00ec73a98734e4c5d307a

Любой идеи, что будет причиной этого?

...