Обработка данных, вид понижающей выборки - PullRequest
0 голосов
/ 21 февраля 2020

У меня есть большой CSV-файл, пример данных ниже. Я буду использовать пример из восьми команд для иллюстрации.

home_team    away_team      home_score       away_score         year
belgium      france         2                2                  1990
brazil       uruguay        3                1                  1990
italy        belgium        1                2                  1990
sweden       mexico         3                1                  1990

france       chile          3                1                  1991
brazil       england        2                1                  1991
italy        belgium        1                2                  1991
chile        switzerland    2                2                  1991

Мои данные хранятся много лет. Я хотел бы иметь общее количество очков каждой команды каждый год, см. Пример ниже,

team            total_scores          year
belgium         4                     1990
france          2                     1990
brazil          3                     1990
uruguay         1                     1990
italy           1                     1990
sweden          3                     1990
mexico          1                     1990

france          3                     1991
chile           5                     1991
brazil          2                     1991
england         1                     1991
italy           1                     1991
belgium         2                     1991
switzerland     2                     1991

Мысли?

Ответы [ 4 ]

2 голосов
/ 21 февраля 2020

Вот решение, использующее tidyverse (dplyr и tidyr), в частности функции pivot из tidyr ...

library(tidyverse)

df %>% pivot_longer(cols = -year,   #splits non-year columns into home/away and type columns
                    names_to = c("homeaway", "type"), 
                    names_sep = "_", 
                    values_to = "value", 
                    values_ptypes = list(value = character())) %>% 
  select(-homeaway) %>%             #remove home/away
  pivot_wider(names_from = "type",  #restore team and score columns (as list columns)
              values_from = "value") %>% 
  unnest(cols = c(team, score)) %>% #unnest the list columns to year, team, score
  group_by(year, team) %>% 
  summarise(total_goals = sum(as.numeric(score)))

# A tibble: 14 x 3
# Groups:   year [2]
    year team        total_goals
   <int> <chr>             <dbl>
 1  1990 belgium               4
 2  1990 brazil                3
 3  1990 france                2
 4  1990 italy                 1
 5  1990 mexico                1
 6  1990 sweden                3
 7  1990 uruguay               1
 8  1991 belgium               2
 9  1991 brazil                2
10  1991 chile                 3
11  1991 england               1
12  1991 france                3
13  1991 italy                 1
14  1991 switzerland           2
1 голос
/ 21 февраля 2020

Добавление решения, которое использует только dplyr.

 library(dplyr)

 bind_rows(
   select(df, team = home_team, score = home_score, year),
   select(df, team = away_team, score = away_score, year)
 ) %>% 
   group_by(team, year) %>% 
   summarise(total_scores = sum(score))
1 голос
/ 21 февраля 2020

Вот еще одно решение в R.

#Packages needed
library(dplyr)
library(magrittr)
library(tidyr)

#Your data
home_team <- c("belgium", "brazil", "italy", "sweden",
               "france", "brazil", "italy", "chile")
away_team <- c("france", "uruguay", "belgium", "mexico",
               "chile", "england", "belgium", "switzerland")
home_score <- c(2,3,1,3,
                3,2,1,2)
away_score <- c(2,1,2,1,
                1,1,2,2)
year <- c(1990, 1990, 1990, 1990,
          1991, 1991, 1991, 1991)

df <- data.frame(home_team, away_team, home_score, away_score, year, stringsAsFactors = FALSE)

df

#   home_team   away_team home_score away_score year
# 1   belgium      france          2          2 1990
# 2    brazil     uruguay          3          1 1990
# 3     italy     belgium          1          2 1990
# 4    sweden      mexico          3          1 1990
# 5    france       chile          3          1 1991
# 6    brazil     england          2          1 1991
# 7     italy     belgium          1          2 1991
# 8     chile switzerland          2          2 1991


#Column names for the new data.frames
my_colnames <- c("team", "score", "year")

#Using select() to create separate home and away datasets
df_home <- df %>% select(matches("home|year")) %>% setNames(my_colnames) %>% mutate(game_where = "home")
df_away <- df %>% select(matches("away|year")) %>% setNames(my_colnames) %>% mutate(game_where = "away")

#rbind()'ing both data.frames
#Grouping the rows together first by the team and then by the year
#Summing up the scores for the aforementioned groupings
#Sorting the newly produced data.frame by year
df_1 <- rbind(df_home, df_away) %>% group_by(team, year) %>% tally(score) %>% arrange(year)

df_1 

 #   team         year     n
 #   <chr>       <dbl> <dbl>
 # 1 belgium      1990     4
 # 2 brazil       1990     3
 # 3 france       1990     2
 # 4 italy        1990     1
 # 5 mexico       1990     1
 # 6 sweden       1990     3
 # 7 uruguay      1990     1
 # 8 belgium      1991     2
 # 9 brazil       1991     2
 #10 chile        1991     3
 #11 england      1991     1
 #12 france       1991     3
 #13 italy        1991     1
 #14 switzerland  1991     2
1 голос
/ 21 февраля 2020

Вы можете попробовать:

library(dplyr)

setNames(rbind(df[,c(1,3,5)], 
               setNames(df[,c(2,4,5)], names(df[,c(1,3,5)]))), 
         c("Country", "Goals", "Year")) %>%
  group_by(Year, Country) %>% 
  summarize(Total = sum(Goals))
#> # A tibble: 14 x 3
#> # Groups:   Year [2]
#>     Year Country     Total
#>    <int> <chr>       <int>
#>  1  1990 belgium         4
#>  2  1990 brazil          3
#>  3  1990 france          2
#>  4  1990 italy           1
#>  5  1990 mexico          1
#>  6  1990 sweden          3
#>  7  1990 uruguay         1
#>  8  1991 belgium         2
#>  9  1991 brazil          2
#> 10  1991 chile           3
#> 11  1991 england         1
#> 12  1991 france          3
#> 13  1991 italy           1
#> 14  1991 switzerland     2

Создано в 2020-02-21 пакетом представительство (v0.3.0)

...