Как разделить интервалы событий по месяцам с помощью dplyr - PullRequest
1 голос
/ 08 ноября 2019

У меня есть набор событий, идентифицируемых отдельным «тегом» их местоположения «местоположение», начало события «StartDateTime_UTC» и конец события «EndDateTime_UTC»

Первые 50 строк: (Iразместим полную информацию ниже)

   Tag     location   StartDateTime_UTC   EndDateTime_UTC    
   <chr>   <fct>      <dttm>              <dttm>             
 1 5004.24 IN         2014-09-30 05:30:00 2014-10-17 14:12:19
 2 5004.24 Unresolved 2014-10-17 14:12:19 2014-10-17 14:15:43
 3 5004.24 OUT        2014-10-17 14:15:43 2014-11-04 10:05:31
 4 5004.24 Unresolved 2014-11-04 10:05:31 2014-11-04 10:08:06
 5 5004.24 IN         2014-11-04 10:08:06 2014-11-12 10:50:28
 6 5004.24 Unresolved 2014-11-12 10:50:28 2014-11-12 10:51:33
 7 5004.24 OUT        2014-11-12 10:51:33 2014-12-24 04:20:28
 8 5004.24 Unresolved 2014-12-24 04:20:28 2014-12-24 04:31:27
 9 5004.24 IN         2014-12-24 04:31:27 2015-02-08 22:30:27
10 5004.24 Unresolved 2015-02-08 22:30:27 2015-02-08 22:31:31
11 5004.24 OUT        2015-02-08 22:31:31 2015-03-01 11:18:32
12 5004.24 Unresolved 2015-03-01 11:18:32 2015-03-01 19:34:36
13 5004.24 IN         2015-03-01 19:34:36 2015-03-09 10:11:55
14 5004.24 Unresolved 2015-03-09 10:11:55 2015-03-09 10:48:02
15 5004.24 OUT        2015-03-09 10:48:02 2015-03-23 09:10:25
16 5004.24 Unresolved 2015-03-23 09:10:25 2015-03-23 09:18:07
17 5004.24 IN         2015-03-23 09:18:07 2015-04-09 23:19:26
18 5004.24 Unresolved 2015-04-09 23:19:26 2015-04-09 23:21:40
19 5004.24 OUT        2015-04-09 23:21:40 2015-04-21 20:11:59
20 5004.24 Unresolved 2015-04-21 20:11:59 2015-04-22 16:33:54
21 5004.24 IN         2015-04-22 16:33:54 2015-06-10 09:19:12
22 5010.04 IN         2014-05-09 18:26:00 2015-04-21 18:28:16
23 5011.03 IN         2016-06-17 15:29:00 2016-07-21 14:23:34
24 5011.07 IN         2014-06-05 17:58:00 2014-12-08 15:15:07
25 5017.06 IN         2013-09-20 03:00:00 2016-08-18 18:13:04
26 5025.22 IN         2013-06-11 14:20:00 2013-06-11 14:20:01
27 5025.26 IN         2016-12-29 22:38:00 2018-01-08 16:51:42
28 5032.24 IN         2014-07-18 18:04:00 2015-06-13 12:44:02
29 5038.04 IN         2014-12-16 17:51:07 2015-04-10 23:28:12
30 5039.03 IN         2016-04-01 17:26:00 2016-07-02 04:39:15
31 5039.07 IN         2014-04-08 19:28:04 2014-04-08 19:53:06
32 5045    IN         2013-08-07 17:00:00 2015-04-08 18:28:43
33 5053.26 IN         2016-11-14 18:12:00 2017-12-24 20:21:31
34 5066.04 IN         2014-02-25 21:32:00 2014-02-25 21:32:01
35 5067.07 IN         2014-10-02 16:39:00 2014-12-08 16:23:50
36 5073.06 IN         2013-05-22 14:45:00 2013-05-22 14:45:01
37 5074.16 IN         2016-02-11 19:02:00 2016-03-09 18:21:45
38 5074.16 Unresolved 2016-03-09 18:21:45 2016-03-09 19:50:22
39 5074.16 OUT        2016-03-09 19:50:22 2016-03-12 23:26:45
40 5088.11 IN         2015-06-02 16:29:00 2015-08-08 11:07:46
41 5094.04 IN         2014-10-17 16:04:00 2014-10-31 16:39:28
42 5101.06 IN         2013-09-20 03:00:00 2013-09-20 03:00:01
43 5101.06 Unresolved 2013-09-20 03:00:01 2014-02-03 07:15:08
44 5101.06 OUT        2014-02-03 07:15:08 2014-04-17 12:15:34
45 5116.24 IN         2014-03-07 20:59:00 2015-03-07 22:21:52
46 5123.03 IN         2016-01-14 18:55:00 2016-03-09 19:21:20
47 5123.03 Unresolved 2016-03-09 19:21:20 2016-03-09 20:24:36
48 5123.03 OUT        2016-03-09 20:24:36 2016-03-09 21:51:51
49 5123.07 IN         2014-01-21 18:54:00 2014-05-20 15:23:09
50 5150.04 IN         2014-05-22 16:17:00 2016-03-07 14:10:00

То, что я хотел бы сделать, это разделить общее время для каждого события на количество времени, которое каждый человек провел в каждом месте (IN, OUT или Unresolved), за каждый месяц каждого года.

Например, первые 3 строки выше станут:

Tag        location    Month    Year   Duration_hrs
5004.24    IN          9        2014   18.50
5004.24    IN          10       2014   328.20
5004.24    Unresolved  10       2014   .06
5004.24    OUT         10       2014   345.74
5004.24    OUT         11       2014   82.09

Я еще не знаком с методом для этого. Есть ли способ сделать это в dplyr или другом пакете, который был бы полезен?

dput:

structure(list(Tag = c("5004.24", "5004.24", "5004.24", "5004.24", 
"5004.24", "5004.24", "5004.24", "5004.24", "5004.24", "5004.24", 
"5004.24", "5004.24", "5004.24", "5004.24", "5004.24", "5004.24", 
"5004.24", "5004.24", "5004.24", "5004.24", "5004.24", "5010.04", 
"5011.03", "5011.07", "5017.06", "5025.22", "5025.26", "5032.24", 
"5038.04", "5039.03", "5039.07", "5045", "5053.26", "5066.04", 
"5067.07", "5073.06", "5074.16", "5074.16", "5074.16", "5088.11", 
"5094.04", "5101.06", "5101.06", "5101.06", "5116.24", "5123.03", 
"5123.03", "5123.03", "5123.07", "5150.04", "5157.06", "5158.16", 
"5165.22", "5172.24", "5178.31", "5179.03", "5179.07", "5186.16", 
"5186.16", "5186.16", "5186.16", "5186.16", "5186.16", "5186.16", 
"5186.16", "5186.16", "5186.16", "5186.16", "5186.16", "5186.16", 
"5186.16", "5186.16", "5200.11", "5206.31", "5214.16", "5228.24", 
"5235", "5242.16", "5242.16", "5242.16", "5242.16", "5242.16", 
"5249", "5249", "5249", "5256.24", "5256.24", "5256.24", "5256.24", 
"5256.24", "5256.24", "5256.24", "5263.03", "5263.07", "5270.16", 
"5284.11", "5290.31", "5290.31", "5290.31", "5290.31", "5298.16", 
"5298.16", "5298.16", "5298.16", "5298.16", "5312.11", "5318.04", 
"5325.06", "5326.16", "5326.16", "5326.16", "5326.16", "5326.16", 
"5326.16", "5340.11", "5346.31", "5347.07", "5347.07", "5347.07", 
"5347.07", "5347.07", "5347.07", "5347.07", "5347.07", "5347.07", 
"5347.07", "5347.07", "5347.07", "5347.07", "5361.26", "5368.24", 
"5374.04", "5374.04", "5374.04", "5374.04", "5374.04", "5375.03", 
"5375.07", "5381.06", "5402.31", "5402.31", "5402.31", "5403.07", 
"5403.07", "5403.07", "5403.07", "5403.07", "5403.07", "5403.07", 
"5431.07", "5438.16", "5445.26", "5445.26", "5445.26", "5452.11", 
"5465.06", "5465.06", "5465.06", "5480.24", "5487.03", "5487.03", 
"5487.03", "5487.07", "5487.07", "5487.07", "5487.07", "5487.07", 
"5487.07", "5487.07", "5487.07", "5487.07", "5487.07", "5487.07", 
"5493.06", "5493.06", "5493.06", "5493.06", "5493.06", "5493.06", 
"5493.06", "5493.06", "5493.06", "5501.22", "5508.11", "5514.31", 
"5536.11", "5542.31", "5542.31", "5542.31", "5550.16", "5557.22", 
"5564.24", "5570.04", "5571.03", "5571.07", "5577", "5585.26", 
"5592.11", "5599.15", "5599.15"), location = structure(c(1L, 
3L, 2L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 3L, 1L, 
3L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 3L, 2L, 1L, 1L, 1L, 3L, 2L, 1L, 1L, 3L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 3L, 1L, 2L, 3L, 1L, 
3L, 2L, 1L, 3L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 3L, 
1L, 1L, 3L, 2L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 
2L, 3L, 1L, 1L, 3L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 3L, 1L, 
2L, 1L, 1L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 3L, 1L, 
1L, 1L, 1L, 3L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 1L, 3L, 2L, 
3L, 1L, 3L, 2L, 1L, 1L, 1L, 3L, 2L, 1L, 1L, 3L, 2L, 1L, 1L, 3L, 
2L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 1L, 3L, 2L, 3L, 
1L, 3L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 3L), .Label = c("IN", "OUT", "Unresolved"
), class = "factor"), StartDateTime_UTC = structure(c(1412055000, 
1413555139, 1413555343, 1415095531, 1415095686, 1415789428, 1415789493, 
1419394828, 1419395487, 1423434627, 1423434691, 1425208712, 1425238476, 
1425895915, 1425898082, 1427101825, 1427102287, 1428621566, 1428621700, 
1429647119, 1429720434, 1399659960, 1466177340, 1401991080, 1379646000, 
1370960400, 1483051080, 1405706640, 1418752267, 1459531560, 1396985284, 
1375894800, 1479147120, 1393363920, 1412267940, 1369233900, 1455217320, 
1457547705, 1457553022, 1433262540, 1413561840, 1379646000, 1379646001, 
1391411708, 1394225940, 1452797700, 1457551280, 1457555076, 1390330440, 
1400775420, 1373994000, 1477593060, 1374166980, 1401299940, 1447954320, 
1456854360, 1393888860, 1472234940, 1489406928, 1489519785, 1493754358, 
1493768930, 1493920595, 1495185477, 1495192085, 1496331260, 1496332668, 
1502330689, 1504055913, 1504056565, 1504471969, 1504484502, 1446829680, 
1450458660, 1478808060, 1391558460, 1375894800, 1483040340, 1489421085, 
1489524492, 1500821550, 1500840149, 1379430000, 1397144701, 1397145001, 
1418064072, 1425499236, 1425499420, 1426550367, 1426705419, 1456257069, 
1456260660, 1471541820, 1407518280, 1477072440, 1443805740, 1425491220, 
1426570277, 1445317719, 1445333300, 1452196320, 1464295591, 1464296126, 
1465051334, 1465057497, 1438967700, 1398371160, 1380208260, 1472238360, 
1473868480, 1473870631, 1500912421, 1500916070, 1501200103, 1450372080, 
1429632420, 1417450780, 1425319859, 1425319875, 1425476848, 1425477093, 
1425490129, 1425490424, 1425497300, 1425497428, 1425497915, 1425499161, 
1426058600, 1426059054, 1458149820, 1416942495, 1399653120, 1411058361, 
1411058655, 1425581564, 1425582554, 1463157660, 1399655460, 1373994000, 
1423859700, 1426434141, 1426438127, 1418054706, 1423434226, 1423434240, 
1424431972, 1424432003, 1425225584, 1425226167, 1413566460, 1463156700, 
1457457262, 1464341296, 1464345439, 1438964700, 1385483040, 1392229993, 
1392230720, 1397856480, 1454698803, 1457554029, 1457557132, 1392415980, 
1394995835, 1394996053, 1394999500, 1394999554, 1398891904, 1398892060, 
1400546392, 1410656812, 1410657169, 1412681302, 1383667200, 1394063203, 
1394064827, 1394534414, 1394546963, 1394903943, 1394904559, 1404290854, 
1404295213, 1374166980, 1450460220, 1444934640, 1443814080, 1423862520, 
1425398953, 1425403026, 1482517054, 1382107500, 1406303160, 1401292020, 
1459530420, 1397852580, 1375806600, 1459355400, 1436547420, 1434128700, 
1455181152), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    EndDateTime_UTC = structure(c(1413555139, 1413555343, 1415095531, 
    1415095686, 1415789428, 1415789493, 1419394828, 1419395487, 
    1423434627, 1423434691, 1425208712, 1425238476, 1425895915, 
    1425898082, 1427101825, 1427102287, 1428621566, 1428621700, 
    1429647119, 1429720434, 1433927952, 1429640896, 1469111014, 
    1418051707, 1471543984, 1370960401, 1515430302, 1434199442, 
    1428708492, 1467434355, 1396986786, 1428517723, 1514146891, 
    1393363921, 1418055830, 1369233901, 1457547705, 1457553022, 
    1457825205, 1439032066, 1414773568, 1379646001, 1391411708, 
    1397736934, 1425766912, 1457551280, 1457555076, 1457560311, 
    1400599389, 1457359800, 1499495745, 1477593061, 1374166981, 
    1401396204, 1447954321, 1462853353, 1402068722, 1489406928, 
    1489519785, 1493754358, 1493768930, 1493915356, 1495185477, 
    1495192085, 1496331260, 1496332668, 1502328647, 1504055913, 
    1504056565, 1504471969, 1504484502, 1510842039, 1449871257, 
    1467438121, 1489417066, 1399047138, 1402529845, 1489421085, 
    1489524492, 1500821550, 1500840149, 1501331232, 1397144701, 
    1397145001, 1397146277, 1425499236, 1425499420, 1426550367, 
    1426550808, 1456257069, 1456260660, 1457306762, 1471541821, 
    1407522783, 1489406240, 1447974368, 1426563298, 1445317719, 
    1445333300, 1460371267, 1464295591, 1464296126, 1465051334, 
    1465057497, 1475823163, 1438969110, 1429569640, 1380208261, 
    1473868480, 1473870631, 1500912421, 1500916070, 1501192164, 
    1522212002, 1455503331, 1429632421, 1425319859, 1425319875, 
    1425476848, 1425477093, 1425490129, 1425490424, 1425497300, 
    1425497428, 1425497915, 1425499161, 1426058600, 1426059054, 
    1435973349, 1518286934, 1424550436, 1411058361, 1411058655, 
    1425581564, 1425582554, 1448200301, 1467374409, 1411930669, 
    1485696012, 1426434141, 1426438127, 1426439051, 1423434226, 
    1423434240, 1424431972, 1424432003, 1425225584, 1425226167, 
    1426387273, 1414861317, 1465669005, 1464341296, 1464345439, 
    1464345554, 1438964701, 1392229993, 1392230720, 1402254843, 
    1403572223, 1457554029, 1457557132, 1457575616, 1394995835, 
    1394996053, 1394999500, 1394999554, 1398891904, 1398892060, 
    1400546392, 1410656812, 1410657169, 1412681302, 1412892331, 
    1394063203, 1394064827, 1394534414, 1394546963, 1394903943, 
    1394904559, 1404290854, 1404295213, 1414516048, 1374166981, 
    1450460221, 1445287124, 1451876530, 1425398953, 1425403026, 
    1433699058, 1503067936, 1394903394, 1406514937, 1411706590, 
    1460054294, 1413725736, 1375806601, 1485062993, 1448650763, 
    1455181152, 1455181271), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -200L))

1 Ответ

2 голосов
/ 09 ноября 2019

В этих ситуациях мне нравится писать небольшую вспомогательную функцию, которая принимает время начала и окончания и делит ее на несколько меньших периодов при заданном размере блока:

library(tidyverse)
library(lubridate)

split_interval <- function(start, end, unit) {
  breaks <- seq(floor_date(start, unit), ceiling_date(end, unit), by = unit)
  timeline <- c(start, breaks[breaks > start & breaks < end], end)
  tibble(.start = head(timeline, -1), .end = tail(timeline, -1))
}

split_interval(now(), now() + hours(2), unit = "hours")
#> # A tibble: 3 x 2
#>   .start              .end               
#>   <dttm>              <dttm>             
#> 1 2019-11-08 23:33:18 2019-11-09 00:00:00
#> 2 2019-11-09 00:00:00 2019-11-09 01:00:00
#> 3 2019-11-09 01:00:00 2019-11-09 01:33:18

С этимв свою очередь, мы можем разделить интервалы в каждой строке, давая столбец списка, содержащий кадры данных периодов, в которые попадает каждая строка:

tbl_split <- tbl %>% 
  mutate(periods = map2(StartDateTime_UTC, EndDateTime_UTC,
                        split_interval, unit = "months"))

tbl_split
#> # A tibble: 3 x 5
#>   Tag     location   StartDateTime_UTC   EndDateTime_UTC     periods       
#>   <chr>   <fct>      <dttm>              <dttm>              <list>        
#> 1 5004.24 IN         2014-09-30 05:30:00 2014-10-17 14:12:19 <tibble [2 x ~
#> 2 5004.24 Unresolved 2014-10-17 14:12:19 2014-10-17 14:15:43 <tibble [1 x ~
#> 3 5004.24 OUT        2014-10-17 14:15:43 2014-11-04 10:05:31 <tibble [2 x ~

(я использую только первые три строки в tblчтобы уменьшить вывод. Код в конце.)

Затем мы можем unnest() создать более длинный фрейм данных, в котором каждая исходная строка теперь разбита на несколько строк более коротких периодов:

tbl_split_long <- tbl_split %>% unnest(periods)

tbl_split_long %>% 
  select(location, .start, .end)
#> # A tibble: 5 x 3
#>   location   .start              .end               
#>   <fct>      <dttm>              <dttm>             
#> 1 IN         2014-09-30 05:30:00 2014-10-01 00:00:00
#> 2 IN         2014-10-01 00:00:00 2014-10-17 14:12:19
#> 3 Unresolved 2014-10-17 14:12:19 2014-10-17 14:15:43
#> 4 OUT        2014-10-17 14:15:43 2014-11-01 00:00:00
#> 5 OUT        2014-11-01 00:00:00 2014-11-04 10:05:31

Теперь мы можем использовать начало каждого периода, чтобы определить год и месяц, и тогда для получения конечного результата достаточно сводной сводки:

tbl_split_long %>% 
  group_by(Tag, location, year = year(.start), month = month(.start)) %>% 
  summarize(duration = sum(difftime(.end, .start, units = "hours")))
#> # A tibble: 5 x 5
#> # Groups:   Tag, location, year [3]
#>   Tag     location    year month duration          
#>   <chr>   <fct>      <dbl> <dbl> <drtn>            
#> 1 5004.24 IN          2014     9  18.50000000 hours
#> 2 5004.24 IN          2014    10 398.20527778 hours
#> 3 5004.24 OUT         2014    10 345.73805556 hours
#> 4 5004.24 OUT         2014    11  82.09194444 hours
#> 5 5004.24 Unresolved  2014    10   0.05666667 hours

Данные:

tbl <- structure(
  list(
    Tag = c("5004.24", "5004.24", "5004.24"),
    location = structure(
      c(1L, 3L, 2L),
      .Label = c("IN", "OUT", "Unresolved"),
      class = "factor"
    ),
    StartDateTime_UTC = structure(
      c(1412055000, 1413555139, 1413555343),
      class = c("POSIXct", "POSIXt"),
      tzone = "UTC"
    ),
    EndDateTime_UTC = structure(
      c(1413555139, 1413555343, 1415095531),
      class = c("POSIXct", "POSIXt"),
      tzone = "UTC"
    )
  ),
  row.names = c(NA, 3L),
  class = c("tbl_df", "tbl", "data.frame")
)
...