Повторная выборка полетных данных временных рядов в SQL - PullRequest
0 голосов
/ 04 апреля 2019

Я занимаюсь анализом исторических данных с открытым исходным кодом (7 ТБ) данных полета opensky. Документация [1]: https://opensky -network.org / data / impala .

Структура таблицы имеет следующую структуру:

[hadoop-2:21000] > describe state_vectors_data4;
Query: describe state_vectors_data4

+---------------+------------+-----------------------------+
| name          | type       | comment                     |
+---------------+------------+-----------------------------+
| time          | int        | Inferred from Parquet file. |
| icao24        | string     | Inferred from Parquet file. |
| lat           | double     | Inferred from Parquet file. |
| lon           | double     | Inferred from Parquet file. |
| velocity      | double     | Inferred from Parquet file. |
| heading       | double     | Inferred from Parquet file. |
| vertrate      | double     | Inferred from Parquet file. |
| callsign      | string     | Inferred from Parquet file. |
| onground      | boolean    | Inferred from Parquet file. |
| alert         | boolean    | Inferred from Parquet file. |
| spi           | boolean    | Inferred from Parquet file. |
| squawk        | string     | Inferred from Parquet file. |
| baroaltitude  | double     | Inferred from Parquet file. |
| geoaltitude   | double     | Inferred from Parquet file. |
| lastposupdate | double     | Inferred from Parquet file. |
| lastcontact   | double     | Inferred from Parquet file. |
| serials       | array<int> | Inferred from Parquet file. |
| hour          | int        |                             |
+---------------+------------+-----------------------------+

Один образец выглядит следующим образом:

[hadoop-2:21000] > select * from state_vectors_data4 where hour=1480759200 order by rand() limit 1;
Query: select * from state_vectors_data4 where hour=1480759200 order by rand() limit 1
+------------+--------+-------------------+--------------------+-------------------+-------------------+----------+----------+----------+-------+-------+--------+-------------------+-------------------+----------------+----------------+------------+
| time       | icao24 | lat               | lon                | velocity          | heading           | vertrate | callsign | onground | alert | spi   | squawk | baroaltitude      | geoaltitude       | lastposupdate  | lastcontact    | hour       |
+------------+--------+-------------------+--------------------+-------------------+-------------------+----------+----------+----------+-------+-------+--------+-------------------+-------------------+----------------+----------------+------------+
| 1480760792 | a0d724 | 37.89463883739407 | -88.93331113068955 | 190.8504039695975 | 265.8263544365708 | 0        | UPS858   | false    | false | false | 7775   | 9144              | 9342.120000000001 | 1480760704.359 | 1480760709.997 | 1480759200 |
+------------+--------+-------------------+--------------------+-------------------+-------------------+----------+----------+----------+-------+-------+--------+-------------------+-------------------+----------------+----------------+------------+
Fetched 1 row(s) in 1.48s

Столбец часа добавлен для ускорения запросов.

Я использую следующий запрос для фильтрации по времени (время / час) и месту (широта / долгота) при повторной выборке со скоростью 180 ((время / 180 AS INT)):

SELECT sv.time,
       sv.icao24,
       sv.callsign,
       sv.hour
  FROM state_vectors_data4 sv
       JOIN
       (
           SELECT CAST (time / 180 AS INT) AS minute,
                  MAX(time) AS recent,
                  icao24
             FROM state_vectors_data4
            WHERE hour BETWEEN 1483228800 AND 1485907200 AND 
                  time BETWEEN 1483228800 AND 1485907200
            GROUP BY icao24,
                     minute
       )
       AS m ON sv.icao24 = m.icao24 AND 
               sv.time = m.recent
 WHERE lat BETWEEN 50.1 AND 54.1 AND 
       lon BETWEEN 2.2 AND 8.2 AND 
       hour BETWEEN 1483228800 AND 1485907200 AND 
       time BETWEEN 1483228800 AND 1485907200;

Однако запрос очень медленный, есть ли более быстрый способ достижения того же результата?

EDIT: Поясненная строка добавлена:

+---------------------------------------------------------------------------------------------------------------------------+
| Explain String                                                                                                            |
+---------------------------------------------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=104.00MB                                                                        |
| Per-Host Resource Estimates: Memory=3.37GB                                                                                |
| WARNING: The following tables are missing relevant table and/or column statistics.                                        |
| opensky.state_vectors_data4                                                                                               |
|                                                                                                                           |
| PLAN-ROOT SINK                                                                                                            |
| |                                                                                                                         |
| 07:EXCHANGE [UNPARTITIONED]                                                                                               |
| |                                                                                                                         |
| 03:HASH JOIN [INNER JOIN, BROADCAST]                                                                                      |
| |  hash predicates: sv.time = max(time), sv.icao24 = icao24                                                               |
| |  runtime filters: RF000 <- max(time), RF001 <- icao24                                                                   |
| |                                                                                                                         |
| |--06:EXCHANGE [BROADCAST]                                                                                                |
| |  |                                                                                                                      |
| |  05:AGGREGATE [FINALIZE]                                                                                                |
| |  |  output: max:merge(time)                                                                                             |
| |  |  group by: icao24, CAST(time / 180 AS INT)                                                                           |
| |  |  having: max(time) <= 1485907200, max(time) >= 1483228800                                                            |
| |  |                                                                                                                      |
| |  04:EXCHANGE [HASH(icao24,CAST(time / 180 AS INT))]                                                                     |
| |  |                                                                                                                      |
| |  02:AGGREGATE [STREAMING]                                                                                               |
| |  |  output: max(time)                                                                                                   |
| |  |  group by: icao24, CAST(time / 180 AS INT)                                                                           |
| |  |                                                                                                                      |
| |  01:SCAN HDFS [opensky.state_vectors_data4]                                                                             |
| |     partitions=745/28054 files=745 size=109.64GB                                                                        |
| |     predicates: time <= 1485907200, time >= 1483228800                                                                  |
| |                                                                                                                         |
| 00:SCAN HDFS [opensky.state_vectors_data4 sv]                                                                             |
|    partitions=745/28054 files=745 size=109.64GB                                                                           |
|    predicates: sv.lat <= 54.1, sv.lat >= 50.1, sv.lon <= 8.2, sv.lon >= 2.2, sv.time <= 1485907200, sv.time >= 1483228800 |
|    runtime filters: RF000 -> sv.time, RF001 -> sv.icao24                                                                  |
+---------------------------------------------------------------------------------------------------------------------------+
[hadoop-29:21000] > 
...