COUNT (*) возвращает несколько строк в улье вместо одной - PullRequest
0 голосов
/ 15 апреля 2020

Почему COUNT (*) возвращает 2 счетчика в Hive. Замечено, что суммирование обоих возвращенных значений равно фактическому количеству записей в таблице. Обратите внимание, что я не использую любое выражение where, group by в запрос. Ниже приведен мой запрос на создание

     CREATE TABLE `employee`(                    
   `file_id` string,                                
   `record_id` bigint,                              
   `emp_id` bigint,                            
   `dept` string,                                                 
   `salary` bigint                           
 )                       
 PARTITIONED BY (                                   
   `load_date` string)                              
 CLUSTERED BY (                                     
   `emp_id`)                                     
 INTO 7 BUCKETS                                     
 ROW FORMAT SERDE                                   
   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'      
 STORED AS INPUTFORMAT                              
   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'  
 OUTPUTFORMAT                                       
   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 
 LOCATION                                           
   'hdfs://data/schema/emp/employee' 
 TBLPROPERTIES (                                    
   'bucketing_version'='2',                         
   'compactor.mapreduce.map.memory.mb'='2048',      
   'compactorthreshold.hive.compactor.delta.num.threshold'='4',  
   'compactorthreshold.hive.compactor.delta.pct.threshold'='0.5',  
   'discover.partitions'='true',                    
   'last_modified_by'='pately',                     
   'last_modified_time'='1579080979',               
   'orc.compress'='ZLIB',                           
   'transactional'='true',                          
   'transactional_properties'='default',            
   'transient_lastDdlTime'='1579080979');

Я пытаюсь выбрать количество, как показано ниже

select count(*) from `employee`; 

Сокращение заданий установлено на 1

set mapred.reduce.tasks;
+-------------------------+
|           set           |
+-------------------------+
| mapred.reduce.tasks=-1  |
+-------------------------+

Результат:

INFO  : Compiling command(queryId=dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428): select count(*) from employee
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428); Time taken: 0.176 seconds
INFO  : Executing command(queryId=dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428): select count(*) from employee
INFO  : Query ID = dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Subscribed to counters: [] for queryId: dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428
INFO  : Session is already open
INFO  : Dag name: select count(*) from employee (Stage-1)
INFO  : Tez session was closed. Reopening...
INFO  : Session re-established.
INFO  : Session re-established.
INFO  : Status: Running (Executing on YARN cluster with App id application_1586778720046_0798)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      2          2        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 6.14 s
----------------------------------------------------------------------------------------------
INFO  : Status: DAG finished successfully in 6.12 seconds
INFO  :
INFO  : Query Execution Summary
INFO  : ----------------------------------------------------------------------------------------------
INFO  : OPERATION                            DURATION
INFO  : ----------------------------------------------------------------------------------------------
INFO  : Compile Query                           0.18s
INFO  : Prepare Plan                            0.06s
INFO  : Get Query Coordinator (AM)              0.00s
INFO  : Submit Plan                             5.75s
INFO  : Start DAG                               0.89s
INFO  : Run DAG                                 6.12s
INFO  : ----------------------------------------------------------------------------------------------
INFO  :
INFO  : Task Execution Summary
INFO  : ----------------------------------------------------------------------------------------------
INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   INPUT_RECORDS   OUTPUT_RECORDS
INFO  : ----------------------------------------------------------------------------------------------
INFO  :      Map 1           2587.00         15,520            147          12,506                0
INFO  : ----------------------------------------------------------------------------------------------
INFO  :
INFO  : org.apache.tez.common.counters.DAGCounter:
INFO  :    NUM_SUCCEEDED_TASKS: 2
INFO  :    TOTAL_LAUNCHED_TASKS: 2
INFO  :    RACK_LOCAL_TASKS: 2
INFO  :    AM_CPU_MILLISECONDS: 4030
INFO  :    AM_GC_TIME_MILLIS: 0
INFO  : File System Counters:
INFO  :    HDFS_BYTES_READ: 3956004
INFO  :    HDFS_BYTES_WRITTEN: 232
INFO  :    HDFS_READ_OPS: 363
INFO  :    HDFS_WRITE_OPS: 4
INFO  :    HDFS_OP_CREATE: 2
INFO  :    HDFS_OP_GET_FILE_STATUS: 125
INFO  :    HDFS_OP_OPEN: 238
INFO  :    HDFS_OP_RENAME: 2
INFO  : org.apache.tez.common.counters.TaskCounter:
INFO  :    GC_TIME_MILLIS: 147
INFO  :    TASK_DURATION_MILLIS: 4661
INFO  :    CPU_MILLISECONDS: 15520
INFO  :    PHYSICAL_MEMORY_BYTES: 4294967296
INFO  :    VIRTUAL_MEMORY_BYTES: 14757040128
INFO  :    COMMITTED_HEAP_BYTES: 4294967296
INFO  :    INPUT_RECORDS_PROCESSED: 119
INFO  :    INPUT_SPLIT_LENGTH_BYTES: 3884610
INFO  :    OUTPUT_RECORDS: 0
INFO  : HIVE:
INFO  :    CREATED_FILES: 2
INFO  :    DESERIALIZE_ERRORS: 0
INFO  :    RECORDS_IN_Map_1: 12506
INFO  :    RECORDS_OUT_0: 2
INFO  :    RECORDS_OUT_INTERMEDIATE_Map_1: 0
INFO  :    RECORDS_OUT_OPERATOR_FS_11: 2
INFO  :    RECORDS_OUT_OPERATOR_GBY_10: 2
INFO  :    RECORDS_OUT_OPERATOR_MAP_0: 0
INFO  :    RECORDS_OUT_OPERATOR_SEL_9: 12506
INFO  :    RECORDS_OUT_OPERATOR_TS_0: 12506
INFO  : TaskCounter_Map_1_INPUT_employee:
INFO  :    INPUT_RECORDS_PROCESSED: 119
INFO  :    INPUT_SPLIT_LENGTH_BYTES: 3884610
INFO  : TaskCounter_Map_1_OUTPUT_out_Map_1:
INFO  :    OUTPUT_RECORDS: 0
INFO  : org.apache.hadoop.hive.ql.exec.tez.HiveInputCounters:
INFO  :    GROUPED_INPUT_SPLITS_Map_1: 2
INFO  :    INPUT_DIRECTORIES_Map_1: 33
INFO  :    INPUT_FILES_Map_1: 119
INFO  :    RAW_INPUT_SPLITS_Map_1: 119
INFO  : savedToCache: true
INFO  : Completed executing command(queryId=dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428); Time taken: 12.819 seconds
INFO  : OK
+-------+
|  _c0  |
+-------+
| 6949  |
| 5557  |
+-------+
2 rows selected (13.073 seconds)

Может кто-нибудь помочь, почему при агрегации fxn like count выводится несколько строк

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...