PostgreSQL: задержка запроса для огромного набора данных - PullRequest
0 голосов
/ 04 октября 2019

Я работаю с действительно огромным набором данных пользовательских журналов GPS (в секунду), хранящихся в postgres, содержащих данные почти за 3 месяца.

Теперь я хочу скопировать (как csv) журналы GPS поездок пользователей за 15 дней апреля, этот запрос занимает в настоящее время почти 24 часа и все еще выполняется.

Запрос:

\COPY (SELECT session_id, trip_id, gpstime AS timestamp, lat, lon,  
CASE 
    WHEN travel_mode = 'foot' THEN 0 
    WHEN travel_mode = 'bike' THEN 1 
    WHEN travel_mode = 'bus'  THEN 2 
    WHEN travel_mode = 'car'  THEN 3 
    WHEN travel_mode = 'metro' THEN 4 
    ELSE 999 END AS t_mode
FROM location_pre_filtered
INNER JOIN trips_with_travel_mode
ON session_id = ANY(session_ids) 
WHERE (lat BETWEEN SYMMETRIC lat_start AND lat_end) 
    AND (lon BETWEEN SYMMETRIC lon_start AND lon_end) 
    AND to_timestamp(gpstime) BETWEEN '2016-04-01' AND '2016-04-15' 
ORDER BY session_id, timestamp) 
TO '~/april_15_days.csv' 
    WITH (FORMAT CSV, HEADER);

location_pre_filtered таблица содержит необработанные журналы GSP, JOIN с пользователями trips_with_travel_mode таблица

Аналогично, EXPLAIN ANALYZE ... в этом запросе также занимает много времени, что мне пришлось отменитьоперация. Каков подход для запроса такого огромного набора данных. Является ли эта задержка приемлемой?

РЕДАКТИРОВАТЬ

Ниже приведен DDL для таблиц location_pre_filtered и trips_with_travel_mode, как указано в комментариях. Просто чтобы прояснить ситуацию, запрос ANALYZE занимает много времени, поскольку я его прервал, это запрос копирования данных, который фактически длился почти 27 часов, копируя 937542 rows в файл размером 44.2MB.

citysense=> \d location_pre_filtered
    View "citysense.location_pre_filtered"
Column           |       Type       | Modifiers
-----------------+------------------+-----------
 session_id      | integer          |
 seconds         | integer          |
 millis          | smallint         |
 gpstime         | integer          |
 gpsmillis       | smallint         |
 nsats           | smallint         |
 geo             | geography        |
 lat             | double precision |
 lon             | double precision |
 alt             | real             |
 track           | real             |
 speed           | real             |
 climb           | real             |
 acc             | real             |
 gps_provider    | boolean          |
 notmoving       | boolean          |
 wrongclock      | boolean          |
 duplicate       | boolean          |
 stoppoint       | boolean          |
 locationproblem | boolean          |
 teleportproblem | boolean          |

citysense=> \d trips_with_travel_mode
      View "citysense.trips_with_travel_mode"
     Column      |          Type          | Modifiers
-----------------+------------------------+-----------
 trip_id         | integer                |
 daily_user_id   | integer                |
 session_ids     | integer[]              |
 seconds_start   | integer                |
 lat_start       | double precision       |
 lon_start       | double precision       |
 seconds_end     | integer                |
 lat_end         | double precision       |
 lon_end         | double precision       |
 distance        | double precision       |
 segments_length | double precision       |
 foot            | real                   |
 bike            | real                   |
 car             | real                   |
 bus             | real                   |
 metro           | real                   |
 travel_mode     | text                   |
 role            | character varying(500) |

EDIT-2

Вот план explain select:

citysense=> EXPLAIN SELECT session_id, trip_id, gpstime AS timestamp, lat, lon,  CASE WHEN travel_mode = 'foot' THEN 0 WHEN travel_mode = 'bike' THEN 1 WHEN travel_mode = 'bus'  THEN 2 WHEN travel_mode = 'car'  THEN 3 WHEN travel_mode = 'metro' THEN 4 ELSE 999 END AS t_mode FROM location_pre_filtered INNER JOIN trips_with_travel_mode ON session_id = ANY(session_ids) WHERE (lat BETWEEN SYMMETRIC lat_start AND lat_end) AND (lon BETWEEN SYMMETRIC lon_start AND lon_end) AND to_timestamp(gpstime) BETWEEN '2016-04-01' AND '2016-04-15' ORDER BY session_id, timestamp;
 Sort  (cost=12707773.62..12707816.96 rows=17335 width=32)
   Sort Key: (COALESCE(gslocation.session_id, gps.session_id)), (COALESCE(gps.gpstime, gslocation.gpstime))
   ->  Nested Loop  (cost=11849626.11..12706553.12 rows=17335 width=32)
         Join Filter: ((((COALESCE(gps.lat, gslocation.lat) >= trips.lat_start) AND (COALESCE(gps.lat, gslocation.lat) <= trips.lat_end)) OR ((COALESCE(gps.lat, gslocation.lat) >= trips.lat_end) AND (COALESCE(gps.lat, gslocation.lat) <= trips.lat_start))) AND (((COALESCE(gps.lon, gslocation.lon) >= trips.lon_start) AND (COALESCE(gps.lon, gslocation.lon)<= trips.lon_end)) OR ((COALESCE(gps.lon, gslocation.lon) >= trips.lon_end) AND (COALESCE(gps.lon, gslocation.lon) <= trips.lon_start))) AND (COALESCE(gslocation.session_id, gps.session_id) = ANY (trips.session_ids)))
         ->  Merge Join  (cost=6210477.11..6272281.03 rows=1 width=48)
               Merge Cond: (((COALESCE(gslocation.session_id, gps.session_id)) = annotations.session_id) AND ((COALESCE(gslocation.seconds, gps.seconds)) = annotations.seconds) AND ((COALESCE(gslocation.millis, gps.millis)) = annotations.millis))
               ->  Sort  (cost=3728059.98..3742038.90 rows=5591568 width=60)
                     Sort Key: (COALESCE(gslocation.session_id, gps.session_id)), (COALESCE(gslocation.seconds, gps.seconds)), (COALESCE(gslocation.millis, gps.millis))
                     ->  Hash Full Join  (cost=2529248.16..2891158.99 rows=5591568 width=60)
                           Hash Cond: ((gps.session_id = gslocation.session_id) AND (gps.seconds = gslocation.seconds) AND (gps.millis = gslocation.millis))
                           Filter: ((to_timestamp((COALESCE(gps.gpstime, gslocation.gpstime))::double precision) >= '2016-04-01 00:00:00+01'::timestamp with time zone) AND (to_timestamp((COALESCE(gps.gpstime, gslocation.gpstime))::double precision) <= '2016-04-15 00:00:00+01'::timestamp with time zone))
                           ->  Seq Scan on gps  (cost=0.00..17.10 rows=710 width=30)
                           ->  Hash  (cost=1304563.15..1304563.15 rows=50324115 width=30)
                                 ->  Seq Scan on gslocation  (cost=0.00..1304563.15 rows=50324115 width=30)
               ->  Sort  (cost=2482417.13..2483889.19 rows=588823 width=10)
                     Sort Key: annotations.session_id, annotations.seconds, annotations.millis
                     ->  Seq Scan on annotations  (cost=0.00..2425985.88 rows=588823 width=10)
                           Filter: ((((status)::integer & 2) <= 0) AND (((status)::integer & 4) <= 0) AND (((status)::integer & 256) <= 0) AND (((status)::integer & 512) <= 0))
         ->  Hash Join  (cost=5639149.00..6011446.94 rows=8049685 width=657)
               Hash Cond: (trip_with_travel.trip_id = trips.trip_id)
               CTE trip_with_travel
                 ->  Unique  (cost=4933060.39..5637407.83 rows=8049685 width=36)
                       ->  GroupAggregate  (cost=4933060.39..5617283.62 rows=8049685 width=36)
                             Group Key: trips_1.trip_id, demography.role
                             ->  Sort  (cost=4933060.39..4953184.61 rows=8049685 width=52)
                                   Sort Key: trips_1.trip_id, demography.role
                                   ->  Hash Left Join  (cost=3350824.21..3734602.86 rows=8049685 width=52)
                                         Hash Cond: (segments.session_id = session.session_id)
                                         ->  Merge Join  (cost=3338852.98..3611948.47 rows=8049685 width=52)
                                               Merge Cond: ((travelmode_profile.session_id = segments.session_id) AND (travelmode_profile.segment_id = segments.segment_id))
                                               ->  Index Scan using tm_profile_key on travelmode_profile  (cost=0.42..36573.63 rows=757173 width=48)
                                               ->  Materialize  (cost=3338849.31..3415021.19 rows=15234377 width=16)
                                                     ->  Sort  (cost=3338849.31..3376935.25 rows=15234377 width=16)
                                                         Sort Key: segments.session_id, segments.segment_id
                                                           ->  Nested Loop  (cost=0.43..1260970.62 rows=15234377 width=16)
                                                                 ->  Seq Scan on trips trips_1  (cost=0.00..1159.41 rows=46541 width=29)
                                                                 ->  Index Scan using segments_pkey on segments  (cost=0.43..23.80 rows=327 width=12)
                                                                       Index Cond: (session_id = ANY (trips_1.session_ids))
                                                                       Filter: movement
                                         ->  Hash  (cost=10979.58..10979.58 rows=79332 width=8)
                                               ->  Hash Left Join  (cost=7299.77..10979.58 rows=79332 width=8)
                                                     Hash Cond: (session.user_id = demography.user_id)
                                                     ->  Seq Scan on session  (cost=0.00..2705.65 rows=79332 width=8)
                                                           Filter: ((instance)::text ~~ 'citysense%'::text)
                                                     ->  Hash  (cost=7292.52..7292.52 rows=580 width=8)
                                                           ->  Subquery Scan on demography  (cost=7283.82..7292.52 rows=580 width=8)
                                                                 ->  Unique  (cost=7283.82..7286.72 rows=580 width=8)
                                                                       ->  Sort  (cost=7283.82..7285.27 rows=580 width=8)
                                                                             Sort Key: session_1.user_id
                                                                             ->  Nested Loop  (cost=0.29..7257.20 rows=580 width=8)
                                                                                   ->  Seq Scan on response  (cost=0.00..4206.15 rows=580 width=8)
                                                                                         Filter: (question_id = 24)
                                                                                   ->  Index Scan using session_pkey1 on session session_1  (cost=0.29..5.25 rows=1 width=8)
                                                                                         Index Cond: (session_id = response.session_id)
                                                                                         Filter: ((instance)::text ~~ 'citysense%'::text)
               ->  CTE Scan on trip_with_travel  (cost=0.00..160993.70 rows=8049685 width=24)
               ->  Hash  (cost=1159.41..1159.41 rows=46541 width=61)
                     ->  Seq Scan on trips  (cost=0.00..1159.41 rows=46541 width=61)
(58 rows)
...