Я работаю с действительно огромным набором данных пользовательских журналов GPS (в секунду), хранящихся в postgres
, содержащих данные почти за 3 месяца.
Теперь я хочу скопировать (как csv
) журналы GPS поездок пользователей за 15 дней апреля, этот запрос занимает в настоящее время почти 24 часа и все еще выполняется.
Запрос:
\COPY (SELECT session_id, trip_id, gpstime AS timestamp, lat, lon,
CASE
WHEN travel_mode = 'foot' THEN 0
WHEN travel_mode = 'bike' THEN 1
WHEN travel_mode = 'bus' THEN 2
WHEN travel_mode = 'car' THEN 3
WHEN travel_mode = 'metro' THEN 4
ELSE 999 END AS t_mode
FROM location_pre_filtered
INNER JOIN trips_with_travel_mode
ON session_id = ANY(session_ids)
WHERE (lat BETWEEN SYMMETRIC lat_start AND lat_end)
AND (lon BETWEEN SYMMETRIC lon_start AND lon_end)
AND to_timestamp(gpstime) BETWEEN '2016-04-01' AND '2016-04-15'
ORDER BY session_id, timestamp)
TO '~/april_15_days.csv'
WITH (FORMAT CSV, HEADER);
location_pre_filtered
таблица содержит необработанные журналы GSP, JOIN с пользователями trips_with_travel_mode
таблица
Аналогично, EXPLAIN ANALYZE ...
в этом запросе также занимает много времени, что мне пришлось отменитьоперация. Каков подход для запроса такого огромного набора данных. Является ли эта задержка приемлемой?
РЕДАКТИРОВАТЬ
Ниже приведен DDL для таблиц location_pre_filtered
и trips_with_travel_mode
, как указано в комментариях. Просто чтобы прояснить ситуацию, запрос ANALYZE
занимает много времени, поскольку я его прервал, это запрос копирования данных, который фактически длился почти 27 часов, копируя 937542 rows
в файл размером 44.2MB
.
citysense=> \d location_pre_filtered
View "citysense.location_pre_filtered"
Column | Type | Modifiers
-----------------+------------------+-----------
session_id | integer |
seconds | integer |
millis | smallint |
gpstime | integer |
gpsmillis | smallint |
nsats | smallint |
geo | geography |
lat | double precision |
lon | double precision |
alt | real |
track | real |
speed | real |
climb | real |
acc | real |
gps_provider | boolean |
notmoving | boolean |
wrongclock | boolean |
duplicate | boolean |
stoppoint | boolean |
locationproblem | boolean |
teleportproblem | boolean |
citysense=> \d trips_with_travel_mode
View "citysense.trips_with_travel_mode"
Column | Type | Modifiers
-----------------+------------------------+-----------
trip_id | integer |
daily_user_id | integer |
session_ids | integer[] |
seconds_start | integer |
lat_start | double precision |
lon_start | double precision |
seconds_end | integer |
lat_end | double precision |
lon_end | double precision |
distance | double precision |
segments_length | double precision |
foot | real |
bike | real |
car | real |
bus | real |
metro | real |
travel_mode | text |
role | character varying(500) |
EDIT-2
Вот план explain select
:
citysense=> EXPLAIN SELECT session_id, trip_id, gpstime AS timestamp, lat, lon, CASE WHEN travel_mode = 'foot' THEN 0 WHEN travel_mode = 'bike' THEN 1 WHEN travel_mode = 'bus' THEN 2 WHEN travel_mode = 'car' THEN 3 WHEN travel_mode = 'metro' THEN 4 ELSE 999 END AS t_mode FROM location_pre_filtered INNER JOIN trips_with_travel_mode ON session_id = ANY(session_ids) WHERE (lat BETWEEN SYMMETRIC lat_start AND lat_end) AND (lon BETWEEN SYMMETRIC lon_start AND lon_end) AND to_timestamp(gpstime) BETWEEN '2016-04-01' AND '2016-04-15' ORDER BY session_id, timestamp;
Sort (cost=12707773.62..12707816.96 rows=17335 width=32)
Sort Key: (COALESCE(gslocation.session_id, gps.session_id)), (COALESCE(gps.gpstime, gslocation.gpstime))
-> Nested Loop (cost=11849626.11..12706553.12 rows=17335 width=32)
Join Filter: ((((COALESCE(gps.lat, gslocation.lat) >= trips.lat_start) AND (COALESCE(gps.lat, gslocation.lat) <= trips.lat_end)) OR ((COALESCE(gps.lat, gslocation.lat) >= trips.lat_end) AND (COALESCE(gps.lat, gslocation.lat) <= trips.lat_start))) AND (((COALESCE(gps.lon, gslocation.lon) >= trips.lon_start) AND (COALESCE(gps.lon, gslocation.lon)<= trips.lon_end)) OR ((COALESCE(gps.lon, gslocation.lon) >= trips.lon_end) AND (COALESCE(gps.lon, gslocation.lon) <= trips.lon_start))) AND (COALESCE(gslocation.session_id, gps.session_id) = ANY (trips.session_ids)))
-> Merge Join (cost=6210477.11..6272281.03 rows=1 width=48)
Merge Cond: (((COALESCE(gslocation.session_id, gps.session_id)) = annotations.session_id) AND ((COALESCE(gslocation.seconds, gps.seconds)) = annotations.seconds) AND ((COALESCE(gslocation.millis, gps.millis)) = annotations.millis))
-> Sort (cost=3728059.98..3742038.90 rows=5591568 width=60)
Sort Key: (COALESCE(gslocation.session_id, gps.session_id)), (COALESCE(gslocation.seconds, gps.seconds)), (COALESCE(gslocation.millis, gps.millis))
-> Hash Full Join (cost=2529248.16..2891158.99 rows=5591568 width=60)
Hash Cond: ((gps.session_id = gslocation.session_id) AND (gps.seconds = gslocation.seconds) AND (gps.millis = gslocation.millis))
Filter: ((to_timestamp((COALESCE(gps.gpstime, gslocation.gpstime))::double precision) >= '2016-04-01 00:00:00+01'::timestamp with time zone) AND (to_timestamp((COALESCE(gps.gpstime, gslocation.gpstime))::double precision) <= '2016-04-15 00:00:00+01'::timestamp with time zone))
-> Seq Scan on gps (cost=0.00..17.10 rows=710 width=30)
-> Hash (cost=1304563.15..1304563.15 rows=50324115 width=30)
-> Seq Scan on gslocation (cost=0.00..1304563.15 rows=50324115 width=30)
-> Sort (cost=2482417.13..2483889.19 rows=588823 width=10)
Sort Key: annotations.session_id, annotations.seconds, annotations.millis
-> Seq Scan on annotations (cost=0.00..2425985.88 rows=588823 width=10)
Filter: ((((status)::integer & 2) <= 0) AND (((status)::integer & 4) <= 0) AND (((status)::integer & 256) <= 0) AND (((status)::integer & 512) <= 0))
-> Hash Join (cost=5639149.00..6011446.94 rows=8049685 width=657)
Hash Cond: (trip_with_travel.trip_id = trips.trip_id)
CTE trip_with_travel
-> Unique (cost=4933060.39..5637407.83 rows=8049685 width=36)
-> GroupAggregate (cost=4933060.39..5617283.62 rows=8049685 width=36)
Group Key: trips_1.trip_id, demography.role
-> Sort (cost=4933060.39..4953184.61 rows=8049685 width=52)
Sort Key: trips_1.trip_id, demography.role
-> Hash Left Join (cost=3350824.21..3734602.86 rows=8049685 width=52)
Hash Cond: (segments.session_id = session.session_id)
-> Merge Join (cost=3338852.98..3611948.47 rows=8049685 width=52)
Merge Cond: ((travelmode_profile.session_id = segments.session_id) AND (travelmode_profile.segment_id = segments.segment_id))
-> Index Scan using tm_profile_key on travelmode_profile (cost=0.42..36573.63 rows=757173 width=48)
-> Materialize (cost=3338849.31..3415021.19 rows=15234377 width=16)
-> Sort (cost=3338849.31..3376935.25 rows=15234377 width=16)
Sort Key: segments.session_id, segments.segment_id
-> Nested Loop (cost=0.43..1260970.62 rows=15234377 width=16)
-> Seq Scan on trips trips_1 (cost=0.00..1159.41 rows=46541 width=29)
-> Index Scan using segments_pkey on segments (cost=0.43..23.80 rows=327 width=12)
Index Cond: (session_id = ANY (trips_1.session_ids))
Filter: movement
-> Hash (cost=10979.58..10979.58 rows=79332 width=8)
-> Hash Left Join (cost=7299.77..10979.58 rows=79332 width=8)
Hash Cond: (session.user_id = demography.user_id)
-> Seq Scan on session (cost=0.00..2705.65 rows=79332 width=8)
Filter: ((instance)::text ~~ 'citysense%'::text)
-> Hash (cost=7292.52..7292.52 rows=580 width=8)
-> Subquery Scan on demography (cost=7283.82..7292.52 rows=580 width=8)
-> Unique (cost=7283.82..7286.72 rows=580 width=8)
-> Sort (cost=7283.82..7285.27 rows=580 width=8)
Sort Key: session_1.user_id
-> Nested Loop (cost=0.29..7257.20 rows=580 width=8)
-> Seq Scan on response (cost=0.00..4206.15 rows=580 width=8)
Filter: (question_id = 24)
-> Index Scan using session_pkey1 on session session_1 (cost=0.29..5.25 rows=1 width=8)
Index Cond: (session_id = response.session_id)
Filter: ((instance)::text ~~ 'citysense%'::text)
-> CTE Scan on trip_with_travel (cost=0.00..160993.70 rows=8049685 width=24)
-> Hash (cost=1159.41..1159.41 rows=46541 width=61)
-> Seq Scan on trips (cost=0.00..1159.41 rows=46541 width=61)
(58 rows)