извлечение информации из Excel в списки в R - PullRequest
0 голосов
/ 10 марта 2020

Привет всем, у меня есть этот набор данных:

> dput(test1)
structure(list(startdate = c("2019-11-06", "2019-11-06", "2019-11-06", 
"2019-11-06", "2019-11-06", "2019-11-06", "2019-11-06", "2019-11-06", 
"2019-11-06", "2019-11-06", "2019-11-06", "2019-11-06", "2019-11-06", 
"2019-11-06", "2019-11-06", "2019-11-06", "2019-11-06", "2019-11-06", 
"2019-11-06", "2019-11-06", "2019-11-06", "2019-11-27", "2019-11-27", 
"2019-11-27", "2019-11-27", "2019-11-27", "2019-11-27", "2019-11-27", 
"2019-11-27", "2019-11-27", "2019-11-27", "2019-11-27", "2019-11-27", 
"2019-11-27", "2019-11-27", "2019-11-27", "2019-11-27", "2019-11-27", 
"2019-11-27", "2019-11-27", "2019-11-01", "2019-11-05", "2019-11-15", 
"2019-11-16", "2019-11-17", "2019-11-18", "2019-11-19", "2019-11-20", 
"2019-11-21", NA), id = c("POL55", "POL56", "POL57", "POL58", 
"POL59", "POL60", "POL61", "POL62", "POL63", "POL64", "POL65", 
"POL66", "POL67", "POL68", "POL69", "POL56", "POL57", "POL58", 
"POL59", "POL60", "POL61", "POL55", "POL56", "POL57", "POL58", 
"POL59", "POL60", "POL61", "POL55", "POL56", "POL57", "POL58", 
"POL59", "POL60", "POL61", "POL55", "POL56", "POL57", "POL58", 
"POL59", "POL60", "POL61", "POL62", "POL63", "POL64", "POL65", 
"POL66", "POL67", "POL68", NA), m0_9 = c(NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 
33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), m10_19 = c(NA, 
NA, NA, 32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 
3, 98, 33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), m20_29 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32, 34, NA, 
NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA, NA, 
NA, NA, NA, NA, NA), m30_39 = c(NA, NA, NA, NA, NA, NA, NA, NA, 
NA, 32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 
98, 33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA), m40_49 = c(32, 34, NA, NA, 
NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), m50_59 = c(NA, 
NA, NA, NA, NA, NA, 32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 
7, 9, 1, 65, 3, 98, 33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), m60_69 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, 32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 
1, 65, 3, 98, 33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA), m70 = c(NA, NA, NA, NA, NA, NA, 32, 
34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), f0_9 = c(32, 34, NA, 
NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), f10_19 = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 32, 34, NA, NA, NA, NA, 55, 
3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), f20_29 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, 32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 
98, 33, NA, NA, NA), f30_39 = c(NA, NA, NA, 32, 34, NA, NA, NA, 
NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA), f40_49 = c(NA, NA, NA, NA, 
NA, 32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 
98, 33, NA, NA, NA, NA, NA, NA, NA, NA, 32, 34, NA, NA, NA, NA, 
55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA), f50_59 = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32, 34, NA, NA, NA, NA, 
55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), f60_69 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32, 34, NA, NA, NA, NA, 
55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA), f70 = c(NA, NA, NA, NA, NA, NA, NA, NA, 
NA, 32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 
98, 33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -50L), class = c("tbl_df", 
"tbl", "data.frame"))

Я хотел бы создать список с именем ageCat. Этот список должен содержать несколько списков. Количество списков - это количество возрастных категорий. Затем для каждой возрастной категории я хотел бы извлечь следующую информацию startAge, endAge, maleCount,femaleCount, totalCount.

Кроме того, я хочу подвести итог только тем лицам, которые имеют одинаковый идентификатор и дату начала. На данный момент я написал это:

создать список возраста

createLists <- function(startdate, id){
  testFiltered = test1[policyid == id & start == startdate]

  ageGroup <- vector("list", length == 8)
  names(ageGroup) <- as.character(seq_along(ageGroup))

  for(ageCat in seq_along(ageGroup)){
    ageGroup[[ageCat]] <- getAgeInfo(testFiltered, ageCat)
  }

  getAgeInfo <- function(testFiltered, ageCat){
   start =
    end = 
    nomales =
    nofemales = 
  }

  ageGroup <- list(startAge = start,
                   endAge = end , 
                   maleCount = nomales ,
                   femaleCount = nofemales)
}

Я жестко закодировал длину vecor ageGroup. Как я могу сделать это без жесткого кодирования, ака. посмотреть, сколько столбцов с возрастными категориями у меня есть для каждого пола?

Во-вторых, как мне извлечь информацию startAge, endAge, maleCount,femaleCount, totalCount

1 Ответ

2 голосов
/ 10 марта 2020

Вместо работы со списками я предлагаю преобразовать ваш data.frame в длинный формат, избавиться от пропущенных значений и выделить пол и возраст. Подход 'tidyverse' может выглядеть следующим образом:

library(dplyr)
library(tidyr)
library(tibble)

df <- tibble(
  startdate = c(
    "2019-11-06", "2019-11-06", "2019-11-06",
    "2019-11-06", "2019-11-06", "2019-11-06", "2019-11-06", "2019-11-06",
    "2019-11-06", "2019-11-06", "2019-11-06", "2019-11-06", "2019-11-06",
    "2019-11-06", "2019-11-06", "2019-11-06", "2019-11-06", "2019-11-06",
    "2019-11-06", "2019-11-06", "2019-11-06", "2019-11-27", "2019-11-27",
    "2019-11-27", "2019-11-27", "2019-11-27", "2019-11-27", "2019-11-27",
    "2019-11-27", "2019-11-27", "2019-11-27", "2019-11-27", "2019-11-27",
    "2019-11-27", "2019-11-27", "2019-11-27", "2019-11-27", "2019-11-27",
    "2019-11-27", "2019-11-27", "2019-11-01", "2019-11-05", "2019-11-15",
    "2019-11-16", "2019-11-17", "2019-11-18", "2019-11-19", "2019-11-20",
    "2019-11-21", NA
  ),
  id = c(
    "POL55", "POL56", "POL57", "POL58",
    "POL59", "POL60", "POL61", "POL62", "POL63", "POL64", "POL65",
    "POL66", "POL67", "POL68", "POL69", "POL56", "POL57", "POL58",
    "POL59", "POL60", "POL61", "POL55", "POL56", "POL57", "POL58",
    "POL59", "POL60", "POL61", "POL55", "POL56", "POL57", "POL58",
    "POL59", "POL60", "POL61", "POL55", "POL56", "POL57", "POL58",
    "POL59", "POL60", "POL61", "POL62", "POL63", "POL64", "POL65",
    "POL66", "POL67", "POL68", NA
  ),
  m0_9 = c(
    NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98,
    33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
  ),
  m10_19 = c(
    NA,
    NA, NA, 32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65,
    3, 98, 33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
  ),
  m20_29 = c(
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32, 34, NA,
    NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA, NA,
    NA, NA, NA, NA, NA
  ),
  m30_39 = c(
    NA, NA, NA, NA, NA, NA, NA, NA,
    NA, 32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3,
    98, 33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA
  ),
  m40_49 = c(
    32, 34, NA, NA,
    NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
  ),
  m50_59 = c(
    NA,
    NA, NA, NA, NA, NA, 32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA,
    7, 9, 1, 65, 3, 98, 33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
  ), m60_69 = c(
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, 32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9,
    1, 65, 3, 98, 33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA
  ), m70 = c(
    NA, NA, NA, NA, NA, NA, 32,
    34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
  ), f0_9 = c(
    32, 34, NA,
    NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
  ), f10_19 = c(
    NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, 32, 34, NA, NA, NA, NA, 55,
    3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
  ), f20_29 = c(
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, 32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3,
    98, 33, NA, NA, NA
  ), f30_39 = c(
    NA, NA, NA, 32, 34, NA, NA, NA,
    NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA
  ), f40_49 = c(
    NA, NA, NA, NA,
    NA, 32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3,
    98, 33, NA, NA, NA, NA, NA, NA, NA, NA, 32, 34, NA, NA, NA, NA,
    55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA
  ), f50_59 = c(
    NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32, 34, NA, NA, NA, NA,
    55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
  ), f60_69 = c(
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32, 34, NA, NA, NA, NA,
    55, 3, NA, NA, NA, 7, 9, 1, 65, 3, 98, 33, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA
  ), f70 = c(
    NA, NA, NA, NA, NA, NA, NA, NA,
    NA, 32, 34, NA, NA, NA, NA, 55, 3, NA, NA, NA, 7, 9, 1, 65, 3,
    98, 33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA
  )
)

# Convert to tidy data frame
df_age <- df %>%
  gather(age_sex, count, -startdate, -id) %>%
  filter(!is.na(count)) %>%
  extract(age_sex, into = c("sex", "start_age", "end_age"), regex = "(m|f)(\\d+)_?(\\d+)?", remove = FALSE) %>%
  mutate(ageg = paste0(start_age, "_", end_age))
df_age
#> # A tibble: 187 x 8
#>    startdate  id    age_sex sex   start_age end_age count ageg 
#>    <chr>      <chr> <chr>   <chr> <chr>     <chr>   <dbl> <chr>
#>  1 2019-11-27 POL55 m0_9    m     0         9          32 0_9  
#>  2 2019-11-27 POL56 m0_9    m     0         9          34 0_9  
#>  3 2019-11-27 POL61 m0_9    m     0         9          55 0_9  
#>  4 2019-11-27 POL55 m0_9    m     0         9           3 0_9  
#>  5 2019-11-27 POL59 m0_9    m     0         9           7 0_9  
#>  6 2019-11-27 POL60 m0_9    m     0         9           9 0_9  
#>  7 2019-11-27 POL61 m0_9    m     0         9           1 0_9  
#>  8 2019-11-27 POL55 m0_9    m     0         9          65 0_9  
#>  9 2019-11-27 POL56 m0_9    m     0         9           3 0_9  
#> 10 2019-11-27 POL57 m0_9    m     0         9          98 0_9  
#> # ... with 177 more rows

# df back to nested list by startdate and ageg
df_list <- df_age %>%
  # Count by startdate, ageg, start_age, end_age, sex
  count(startdate, ageg, start_age, end_age, sex, wt = count) %>% 
  # male and female counts back in columns
  spread(sex, n, fill = 0) %>% 
  # split by startdate
  split(.$startdate) %>% 
  # ... and split each startdate list by ageg
  lapply(function(x) split(x, x$ageg))

Создан в 2020-03-10 пакетом Представить (v0.3.0)

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...