Как извлекать из большого набора данных почасовые ежедневные средние значения по уровню из факторной переменной и оценивать другие статистические данные - PullRequest
1 голос
/ 10 июля 2019

Фрейм данных df1 суммирует различные даты, когда люди посещали общественный туалет в течение определенного периода времени (скажем, между "2017-06-01" и "2017-06-30").Столбец Zone указывает область, где был размещен туалет, и имеет коэффициент с двумя уровнями: A (зона для вечеринок) или B (зона для проживания).

Я показываю ниже воспроизводимый пример того, что у меня есть.Этот пример содержит всего два дня, чтобы уменьшить размер набора данных примера.Чтобы создать df1, мне сначала нужно было создать 4 отдельных фрейма данных, а затем связать их для создания фрейма данных df1 (у меня возникла ошибка при попытке создать df1 сразу).df1 имеет 193 строки.

options(digits.secs=3)
day_1_A<- data.frame(Datetime= ymd_hms(c("2017-06-01 00:04:17.986","2017-06-01 00:17:43.456","2017-06-01 00:22:43.456","2017-06-01 00:34:43.456","2017-06-01 00:45:43.456","2017-06-01 01:15:23.275","2017-06-01 01:41:32.609","2017-06-01 02:04:17.986","2017-06-01 02:17:43.456","2017-06-01 03:15:23.275","2017-06-01 03:41:32.609","2017-06-01 04:04:17.986","2017-06-01 04:17:43.456","2017-06-01 05:15:23.275","2017-06-01 05:41:32.609","2017-06-01 06:04:17.986","2017-06-01 06:17:43.456","2017-06-01 07:15:23.275","2017-06-01 07:41:32.609","2017-06-01 08:04:17.986","2017-06-01 08:17:43.456","2017-06-01 09:15:23.275","2017-06-01 09:41:32.609","2017-06-01 10:04:17.986","2017-06-01 10:17:43.456","2017-06-01 11:15:23.275","2017-06-01 11:41:32.609","2017-06-01 12:04:17.986","2017-06-01 12:17:43.456","2017-06-01 13:15:23.275","2017-06-01 13:41:32.609","2017-06-01 14:04:17.986","2017-06-01 14:17:43.456","2017-06-01 15:17:23.275","2017-06-01 15:41:32.609","2017-06-01 16:04:17.986","2017-06-01 16:17:43.456","2017-06-01 17:15:23.275","2017-06-01 17:41:32.609","2017-06-01 18:04:17.986","2017-06-01 18:17:43.456","2017-06-01 19:15:23.275","2017-06-01 19:41:32.609","2017-06-01 20:04:17.986","2017-06-01 20:17:43.456","2017-06-01 21:15:23.275","2017-06-01 21:41:32.609","2017-06-01 22:04:17.986","2017-06-01 22:17:43.456","2017-06-01 23:15:23.275","2017-06-01 23:41:32.609")),
                 ToiletZone = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"))

day_1_B<- data.frame(Datetime= ymd_hms(c("2017-06-01 00:04:17.986","2017-06-01 00:17:43.456","2017-06-01 01:15:23.275","2017-06-01 01:41:32.609","2017-06-01 02:04:17.986","2017-06-01 02:17:43.456","2017-06-01 03:15:23.275","2017-06-01 03:41:32.609","2017-06-01 04:04:17.986","2017-06-01 04:17:43.456","2017-06-01 05:15:23.275","2017-06-01 05:41:32.609","2017-06-01 06:04:17.986","2017-06-01 06:17:43.456","2017-06-01 07:15:23.275","2017-06-01 07:41:32.609","2017-06-01 08:04:17.986","2017-06-01 08:17:43.456","2017-06-01 09:15:23.275","2017-06-01 09:41:32.609","2017-06-01 10:04:17.986","2017-06-01 10:17:43.456","2017-06-01 11:15:23.275","2017-06-01 11:41:32.609","2017-06-01 12:04:17.986","2017-06-01 12:17:43.456","2017-06-01 13:15:23.275","2017-06-01 13:41:32.609","2017-06-01 14:04:17.986","2017-06-01 14:17:43.456","2017-06-01 15:15:23.275","2017-06-01 15:41:32.609","2017-06-01 16:04:17.986","2017-06-01 16:17:43.456","2017-06-01 17:15:23.275","2017-06-01 17:41:32.609","2017-06-01 18:04:17.986","2017-06-01 18:17:43.456","2017-06-01 19:15:23.275","2017-06-01 19:41:32.609","2017-06-01 20:04:17.986","2017-06-01 20:17:43.456","2017-06-01 21:15:23.275","2017-06-01 21:41:32.609","2017-06-01 22:04:17.986","2017-06-01 22:17:43.456","2017-06-01 23:15:23.275","2017-06-01 23:41:32.609")),
                 ToiletZone = c("B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B"))

day_2_A<- data.frame(Datetime= ymd_hms(c("2017-06-02 00:17:43.456","2017-06-02 00:48:43.456","2017-06-02 01:15:23.275","2017-06-02 01:52:23.275","2017-06-02 02:04:17.986","2017-06-02 02:17:43.456","2017-06-02 03:15:23.275","2017-06-02 03:41:32.609","2017-06-02 04:04:17.986","2017-06-02 04:17:43.456","2017-06-02 05:15:23.275","2017-06-02 05:41:32.609","2017-06-02 06:04:17.986","2017-06-02 06:17:43.456","2017-06-02 07:15:23.275","2017-06-02 07:41:32.609","2017-06-02 08:04:17.986","2017-06-02 08:17:43.456","2017-06-02 09:15:23.275","2017-06-02 09:41:32.609","2017-06-02 10:04:17.986","2017-06-02 10:17:43.456","2017-06-02 11:15:23.275","2017-06-02 11:41:32.609","2017-06-02 12:04:17.986","2017-06-02 12:17:43.456","2017-06-02 13:15:23.275","2017-06-02 13:41:32.609","2017-06-02 14:04:17.986","2017-06-02 14:17:43.456","2017-06-02 15:15:23.275","2017-06-02 15:41:32.609","2017-06-02 16:04:17.986","2017-06-02 16:17:43.456","2017-06-02 17:15:23.275","2017-06-02 17:41:32.609","2017-06-02 18:04:17.986","2017-06-02 18:17:43.456","2017-06-02 19:15:23.275","2017-06-02 19:41:32.609","2017-06-02 20:04:17.986","2017-06-02 20:17:43.456","2017-06-02 21:15:23.275","2017-06-02 21:41:32.609","2017-06-02 22:04:17.986","2017-06-02 22:17:43.456","2017-06-02 23:15:23.275","2017-06-02 23:41:32.609")),
                 ToiletZone = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"))

day_2_B<- data.frame(Datetime= ymd_hms(c("2017-06-02 00:04:17.986","2017-06-02 01:15:23.275","2017-06-02 02:04:17.986","2017-06-02 02:17:43.456","2017-06-02 03:15:23.275","2017-06-02 03:41:32.609","2017-06-02 04:04:17.986","2017-06-02 04:17:43.456","2017-06-02 05:15:23.275","2017-06-02 05:41:32.609","2017-06-02 06:04:17.986","2017-06-02 06:17:43.456","2017-06-02 07:15:23.275","2017-06-02 07:41:32.609","2017-06-02 08:04:17.986","2017-06-02 08:17:43.456","2017-06-02 09:15:23.275","2017-06-02 09:41:32.609","2017-06-02 10:04:17.986","2017-06-02 10:17:43.456","2017-06-02 11:15:23.275","2017-06-02 11:41:32.609","2017-06-02 12:04:17.986","2017-06-02 12:17:43.456","2017-06-02 13:15:23.275","2017-06-02 13:41:32.609","2017-06-02 14:04:17.986","2017-06-02 14:17:43.456","2017-06-02 15:15:23.275","2017-06-02 15:41:32.609","2017-06-02 16:04:17.986","2017-06-02 16:17:43.456","2017-06-02 17:15:23.275","2017-06-02 17:41:32.609","2017-06-02 18:04:17.986","2017-06-02 18:17:43.456","2017-06-02 19:15:23.275","2017-06-02 19:41:32.609","2017-06-02 20:04:17.986","2017-06-02 20:17:43.456","2017-06-02 21:15:23.275","2017-06-02 21:41:32.609","2017-06-02 22:04:17.986","2017-06-02 22:17:43.456","2017-06-02 23:15:23.275","2017-06-02 23:41:32.609")),
                 ToiletZone = c("B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B"))


df1<- rbind(day_1_A,day_1_B,day_2_A,day_2_B)
df1

> df1
                   Datetime ToiletZone
1   2017-06-01 00:04:17.986          A
2   2017-06-01 00:17:43.455          A
3   2017-06-01 00:22:43.455          A
4   2017-06-01 00:34:43.455          A
5   2017-06-01 00:45:43.455          A
6   2017-06-01 01:15:23.275          A
.               .                    .
.               .                    .
.               .                    .
193 2017-06-02 23:41:32.608          B

По некоторым причинам, которые я не буду здесь объяснять, мне нужно рассчитать для КАЖДОГО ДНЯ и для КАЖДОЙ ЗОНЫ статистику под названием θ, которую можно определить каккоэффициент деления «среднечасового количества посещений туалета в течение дня» (Hourly_daily_μ) на «среднечасовое количество посещений за весь интересующий период» (Overall_hourly_μ).

Я показываю на рисунке то, что ожидал от предыдущего примера (столбцы Hourly_daily_μ_A, Hourly_daily_μ_B, Overall_hourly_μ_A и Overall_hourly_μ_A включены для пояснения расчетов. Мне действительно нужны столбцы θ_A иθ_B): enter image description here

Почему Hourly_daily_μ_A это 51/24 на 2017-06-01?Потому что в этот день в туалет пришли 51 человек.Следовательно, если мы разделим между 24, мы получим среднечасовое число людей, которые ходили в туалет в этот день.

Почему Overall_hourly_μ_A одинаково для каждой зоны в разные дни?Потому что это общее среднее значение для каждой зоны.Здесь мы хотим знать, каково общее количество людей, которые ходят в туалет в час.В этом примере мы знаем, что 99 человек ходили в туалет в период с 1 июня по 2 июня в зоне А. Таким образом, мы делим это на общее количество часов (в примере 48 часов) и получаем общее среднее почасовое значение.людей, которые ходят в туалет в зоне А. Это уникальное значение для каждой зоны.

Почему θ_A равно (51 * 48) / (24 * 99) на 2017-06-01?Потому что результат деления Hourly_daily_μ_A (51/24) на Overall_hourly_μ_A (99/48).

Кто-нибудь знает, как это сделать?Мой фрейм данных довольно большой, поэтому я думаю, что пакет data.table может быть хорошим вариантом.

Ответы [ 2 ]

1 голос
/ 10 июля 2019

Я думаю, вам нужно только указать даты в единицах дня, а затем использовать их для группировки.С data.table:

setDT(df1)

df1[, Date := floor_date(Datetime, "day")]

daily <- df1[, .(DailyCount = .N, DailyAvg = .N / 24), by = .(ToiletZone, Date)]
overall <- daily[, .(Total = sum(DailyCount) / (.N * 24)), by = .(ToiletZone)]

overall[daily, .(ToiletZone, Date, Theta = DailyAvg / Total), on = "ToiletZone"]
   ToiletZone       Date     Theta
1:          A 2017-06-01 1.0303030
2:          B 2017-06-01 1.0212766
3:          A 2017-06-02 0.9696970
4:          B 2017-06-02 0.9787234

И ежечасно будет похоже, просто измените floor_date и настройте некоторые знаменатели:

df1[, Date := floor_date(Datetime, "hour")]

hourly <- df1[, .(HourlyCount = .N), by = .(ToiletZone, Date)]
overall <- hourly[, .(Total = sum(HourlyCount) / .N), by = "ToiletZone"]

ans <- overall[hourly, .(ToiletZone, Date, Theta = HourlyCount / Total), on = "ToiletZone"]

Кстати, последние строки являются объединением, вы можетедумайте о них как о левом соединении, соответственно, daily и hourly как о левой таблице.

1 голос
/ 10 июля 2019

Опция будет делать группирование по частоте, делать некоторые вычисления, чтобы получить ожидаемый результат

library(dplyr)
library(tidyr)
library(lubridate)
df1 %>% 
  mutate(Date = floor_date(Datetime, "hour")) %>% 
  group_by(ToiletZone, Date) %>% 
  mutate(hourlyCount = n(), HourlyAvg = hourlyCount/24) %>% 
  group_by(ToiletZone) %>% 
  mutate(Total = sum(hourlyCount)/ n() * 24) %>% 
  group_by(Date = as.Date(Date), add = TRUE) %>% 
  summarise(Theta = hourlyCount[1]/Total[1]) %>%
  spread(ToiletZone, Theta)
...