R Ограниченная выборка - PullRequest
0 голосов
/ 08 февраля 2020
DATA1 = data.frame("ID" = c("f","e","b","a","e","d","e","f","c","d","d","c","d","b","e","b","d","b","e","e","b","a","b","e","a","d","a","d","b","f","b","e","b","d","e","d","b","e","f","a","b","b","f","e","c","a","b","d","c","d","e","e","f","e","a","b","b","c","b","a","b","f","a","b","c","e","d","a","e","d","a","f","b","d","e","b","f","e","f","f","c","b","f","c","b","e","e","f","e","b","f","f","b","e","c","a","e","c","d","b"),
                   "CLASS" = c(4,1,2,4,1,4,3,2,1,1,2,4,2,2,3,1,4,1,2,4,2,2,1,1,1,3,4,4,4,3,3,2,3,2,2,2,3,4,1,2,4,1,1,3,3,2,2,2,4,4,3,3,1,1,4,2,3,2,4,1,4,3,2,3,4,3,3,2,3,4,4,1,4,1,2,3,4,1,2,1,3,4,4,3,1,2,4,2,3,2,4,2,2,1,1,4,1,3,1,1),
                   "SCORE" = c(59,65,61,64,91,91,70,90,64,87,51,54,92,76,75,78,55,99,66,57,88,89,77,66,100,92,80,84,52,66,59,71,56,88,51,97,65,89,65,67,52,57,51,63,67,79,51,90,79,54,90,55,90,72,64,52,95,61,87,54,91,75,80,93,53,81,87,85,84,84,81,93,100,51,70,64,51,54,83,96,65,61,53,80,68,73,52,57,96,55,63,97,94,77,63,98,85,97,65,77))


DATA2 = data.frame("CLASS" = c(1,2,3,4),
                   "S" = c(2,5,3,1))

У меня есть набор данных 'data1' и 'data2'.

Я хочу go в 'data1' и foreach 'id' в 'data1' Я хочу t ":

     RANDOM SAMPLE 'S' ROWS FOR EACH 'CLASS' WHERE VALUE FOR 'S' IS GOTTEN FROM 'DATA2'. 

Например, затем - в классе 'data2' = 2, s = 5. поэтому для каждого 'id' в 'data1' я хочу выбрать случайную выборку из 5 строк, где class = 2

1 Ответ

1 голос
/ 08 февраля 2020

Решение с использованием dplyr:

library(dplyr)
DATA1 %>% 
  left_join(DATA2) %>% 
  group_by(CLASS, S) %>% 
  sample_n(S) %>% 
  select(ID, CLASS, SCORE)

возвращает:

  # A tibble: 11 x 4
  # Groups:   CLASS, S [4]
        S ID    CLASS SCORE
    <dbl> <fct> <dbl> <dbl>
  1     2 f         1    90
  2     2 b         1    78
  3     5 f         2    83
  4     5 a         2    85
  5     5 b         2    55
  6     5 b         2    51
  7     5 e         2    70
  8     3 c         3    97
  9     3 e         3    96
 10     3 c         3    67
 11     1 b         4    52

Редактировать:

Я стремлюсь реализовать это для каждого идентификатора

Это будет невозможно, поскольку не всегда есть S наблюдений на комбинацию ID-CLASS, как показано в выходных данных:

DATA1 %>% 
  left_join(DATA2) %>% 
  group_by(ID, CLASS, S) %>% 
  summarise(N=n()) %>% 
  mutate(test = ifelse(S > N, "S is to large", ""))

, которые выдают:

   # A tibble: 23 x 5
   # Groups:   ID, CLASS [23]
      ID    CLASS     S     N test           
      <fct> <dbl> <dbl> <int> <chr>          
    1 a         1     2     2 ""             
    2 a         2     5     5 ""             
    3 a         4     1     5 ""             
    4 b         1     2     6 ""             
    5 b         2     5     7 ""             
    6 b         3     3     6 ""             
    7 b         4     1     6 ""             
    8 c         1     2     2 ""             
    9 c         2     5     1 "S is to large"
   10 c         3     3     4 "" 

В противном случае решение будет также группироваться по ID:

DATA1 %>% 
  left_join(DATA2) %>% 
  group_by(ID, CLASS, S) %>% 
  sample_n(S) %>% 
  select(ID, CLASS, SCORE)
...