Я работаю над оптимизацией кода куста (1.4-cdh) в MapReduce, в моем проекте мы использовали много различных операций с числом с предложением groupby, пример hql показан ниже.
DROP TABLE IF EXISTS testdb.NewTable PURGE;
CREATE TABLE testdb.NewTable AS
SELECT a.* FROM (
SELECT col1,
COUNT(DISTINCT col2) AS col2,
COUNT(DISTINCT col3) AS col3,
COUNT(DISTINCT col4) AS col4,
COUNT(DISTINCT col5) AS col5
FROM BaseTable
GROUP BY col1) a
WHERE a.col3 > 1 OR a.col4 > 1 OR a.col2 > 1 OR a.col5 > 1;
Не могли бы вы помочь мне с лучшим подходом к этому, чтобы минимизировать время обработки запроса.
Добавление объяснения пути для CountDistinct и CollectSet:
План объяснения CountDistinct:
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: BaseTable
Statistics: Num rows: 16863109255 Data size: 2613966713222 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: col1 (type: string), col2 (type: decimal(3,0)), col3 (type: string), col4 (type: string), col5 (type: string)
outputColumnNames: col1, col2, col3, col4, col5
Statistics: Num rows: 16863109255 Data size: 2613966713222 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(DISTINCT col5), count(DISTINCT col2), count(DISTINCT col4), count(DISTINCT col3)
keys: col1 (type: string), col5 (type: string), col2 (type: decimal(3,0)), col4 (type: string), col3 (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8
Statistics: Num rows: 16863109255 Data size: 2613966713222 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: decimal(3,0)), _col3 (type: string), _col4 (type: string)
sort order: +++++
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 16863109255 Data size: 2613966713222 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col1:0._col0), count(DISTINCT KEY._col1:1._col0), count(DISTINCT KEY._col1:2._col0), count(DISTINCT KEY._col1:3._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 8431554627 Data size: 1306983356533 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((((_col2 > 1) or (_col3 > 1)) or (_col1 > 1)) or (_col4 > 1)) (type: boolean)
Statistics: Num rows: 8431554627 Data size: 1306983356533 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 8431554627 Data size: 1306983356533 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
CollectSet Объяснить план:
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: BaseTable
Statistics: Num rows: 16863109255 Data size: 2613966713222 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: col1 (type: string), col2 (type: decimal(3,0)), col3 (type: string), col4 (type: string), col5 (type: string)
outputColumnNames: col1, col2, col3, col4, col5
Statistics: Num rows: 16863109255 Data size: 2613966713222 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: collect_set(col5), collect_set(col2), collect_set(col4), collect_set(col3)
keys: col1 (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 16863109255 Data size: 2613966713222 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 16863109255 Data size: 2613966713222 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: array<string>), _col2 (type: array<decimal(3,0)>), _col3 (type: array<string>), _col4 (type: array<string>)
Reduce Operator Tree:
Group By Operator
aggregations: collect_set(VALUE._col0), collect_set(VALUE._col1), collect_set(VALUE._col2), collect_set(VALUE._col3)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 8431554627 Data size: 1306983356533 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), size(_col1) (type: int), size(_col2) (type: int), size(_col3) (type: int), size(_col4) (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 8431554627 Data size: 1306983356533 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((((_col2 > 1) or (_col3 > 1)) or (_col1 > 1)) or (_col4 > 1)) (type: boolean)
Statistics: Num rows: 8431554627 Data size: 1306983356533 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 8431554627 Data size: 1306983356533 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink