BigQuery, как эффективно объединить два результата по уникальному идентификатору и сохранить только последние - PullRequest
0 голосов
/ 28 ноября 2018

Какой эффективный способ достижения того же результата, что и ниже:

WITH
data_one AS (
  SELECT "abc" as id, 100 as value, TIMESTAMP("2018-11-26T14:39:51") as created UNION ALL
  SELECT "def" as id, 111 as value, TIMESTAMP("2018-11-27T14:39:51") as created
),
data_two AS (
  SELECT "abc" as id, 203 as value, TIMESTAMP("2018-11-28T14:39:51") as created UNION ALL
  SELECT "ghi" as id, 418 as value, TIMESTAMP("2018-11-28T14:39:51") as created
),
data AS (
  SELECT * FROM data_one do
  UNION ALL
  SELECT * FROM data_two dt
)
SELECT id, value, created FROM (
  SELECT *,
  rank() over(partition by id order by created desc) rank
  FROM data
) WHERE rank = 1

В результате это будет:

+----+----- -+-------------------------+
| id | value | created                 |
+----+-------+-------------------------+
| abc| 203   | 2018-11-28 14:39:51 UTC |
+----+---------------------------------+
| def| 111   | 2018-11-27 14:39:51 UTC |
+----+-------+-------------------------+
| def| 418   | 2018-11-28 14:39:51 UTC |
+----+-------+-------------------------+

Что если данные будут действительно большими?Это нормальный подход или есть лучший?

1 Ответ

0 голосов
/ 28 ноября 2018

Альтернативным вариантом будет

#standardSQL
WITH data_one AS (
  SELECT "abc" AS id, 100 AS value, TIMESTAMP("2018-11-26T14:39:51") AS created UNION ALL
  SELECT "def" AS id, 111 AS value, TIMESTAMP("2018-11-27T14:39:51") AS created
), data_two AS (
  SELECT "abc" AS id, 203 AS value, TIMESTAMP("2018-11-28T14:39:51") AS created UNION ALL
  SELECT "ghi" AS id, 418 AS value, TIMESTAMP("2018-11-28T14:39:51") AS created
), data AS (
  SELECT * FROM data_one do
  UNION ALL
  SELECT * FROM data_two dt
)
SELECT id, 
  ARRAY_AGG(
    STRUCT<value INT64, created TIMESTAMP>(value, created) 
    ORDER BY created DESC LIMIT 1
  )[OFFSET(0)].*
FROM data t
GROUP BY id   

или, если вы хотите избежать явного объявления STRUCT (например, для многих столбцов или для более общего использования)

#standardSQL
WITH data_one AS (
  SELECT "abc" AS id, 100 AS value, TIMESTAMP("2018-11-26T14:39:51") AS created UNION ALL
  SELECT "def" AS id, 111 AS value, TIMESTAMP("2018-11-27T14:39:51") AS created
), data_two AS (
  SELECT "abc" AS id, 203 AS value, TIMESTAMP("2018-11-28T14:39:51") AS created UNION ALL
  SELECT "ghi" AS id, 418 AS value, TIMESTAMP("2018-11-28T14:39:51") AS created
), data AS (
  SELECT * FROM data_one do
  UNION ALL
  SELECT * FROM data_two dt
)
SELECT * FROM data WHERE FALSE
UNION ALL
SELECT id, 
  ARRAY_AGG(
    (value, created) ORDER BY created DESC LIMIT 1
  )[OFFSET(0)].*
FROM data t
GROUP BY id  

inв обоих случаях результат

Row id  value   created  
1   abc 203     2018-11-28 14:39:51 UTC  
2   ghi 418     2018-11-28 14:39:51 UTC  
3   def 111     2018-11-27 14:39:51 UTC  
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...