Фиксированный эффект страны-года (только с одним уровнем) в цикле регрессии - PullRequest
0 голосов
/ 05 сентября 2018

Я использовал следующий код для запуска регрессии:

res <- lm (c241 ~ x + matchcode , data = df)
summary(res)

Где matchcode - это переменная, которая представляет собой комбинацию кода Iso3c и года. Для России 2005 это, например, RUS 2005 (см. Первую переменную таблицы). Идея состоит в том, чтобы использовать этот код совпадения как фиксированный эффект, как lm выше. Применение lm выше отлично работает

Поскольку у меня огромные наборы данных (всего более 4000 переменных):

# A tibble: 450 x 546
   matchcode idstd year  country wt    region income industry sector ownership exporter c201  c202  c203a c203b c203c c203d c2041 c2042 c205a  c205b1 c205b2 c205b3 c205b4 c205b5 c205b6 c205b7 c205b8 c205b9 c205b10 c205c c205d c206a c206b c2071
   <chr+lbl> <dbl> <dbl> <chr+l> <dbl> <dbl+> <dbl+> <dbl+lb> <dbl+> <dbl+lbl> <dbl+lb> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 "BGD 200~  2128 2002  Bangla~ NA    6      1       8       1      2         2        1988  4     100     0   0     NA     2    NA        NA NA     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1     1    NA    2    
 2 "BGD 200~  2926 2002  Bangla~ NA    6      1       1       1      2         2        2000  1     100     0   0     NA     2    NA        NA NA     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1    NA    NA    1    
 3 "BGD 200~  2931 2002  Bangla~ NA    6      1       1       1      2         1        1993  4     100     0   0     NA     2    NA        NA NA     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1    NA    NA    2    
 4 "BRA 200~ 15303 2003  Brazil~ NA    4      2       9       1      2         2        1946  2     100     0   0      0     2    NA     18.72  1     NA     NA     NA     NA     NA     NA     NA     NA     NA       2     2     1     2    5    
 5 "BRA 200~ 14917 2003  Brazil~ NA    4      2       8       1      2         2        1984  2     100     0   0      0     2    NA     50.00  1     NA     NA     NA     NA     NA     NA     NA     NA      1       1     1     1     2    3    
 6 "BRA 200~ 14212 2003  Brazil~ NA    4      2      11       1      2         2        1998  2     100     0   0      0     2    NA     50.00  1     NA     NA     NA     NA     NA     NA     NA     NA      1       1     1     1     2    2    
 7 "KHM 200~ 16067 2003  Cambod~ NA    2      1      23       2      1         1        1993  4      50    50   0      0     2    NA    100.00  1     NA      1      1     NA     NA     NA     NA     NA     NA       1     1     1     1    1    
 8 "KHM 200~ 16233 2003  Cambod~ NA    2      1      10       4      2         2        1989  4     100     0   0      0     2    NA    100.00  1     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1     1     2    3    
 9 "KHM 200~ 16002 2003  Cambod~ NA    2      1       3       1      1         1        1990  5       0   100   0      0     2    NA     50.00  1     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1     1     1    1    
10 "CHN 200~ 17987 2002  China2~ NA    2      2       8       1      1         2        1993  6      55    45   0     NA    NA    NA        NA NA     NA     NA     NA     NA     NA     NA     NA     NA

Я хотел перебрать переменные следующим образом LINK ;

dfoutput<- data.table(df)[, .(nm = names(.SD),dffits= lapply(.SD, function(x) if(is.numeric(x)) summary(lm(y~ x, na.action=na.omit)))), .SDcols = -1]

Тем не менее, в data.table создается переменная data.table NULL для переменной matchcode (а также для любых других символьных переменных).

Example of NULL

Когда я пытаюсь добавить matchcode к регрессии:

dfoutput<- data.table(df)[, .(nm = names(.SD),dffits= lapply(.SD, function(x) if(is.numeric(x)) summary(lm(c241~ x + matchcode, na.action=na.omit)))), .SDcols = -1]

или я использую лапу с matchcode:

df <- lapply( df[,-1], function(x) summary(lm(df$c241~ x + df$matchcode)) )

выдает следующую «знаменитую» ошибку:

 Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels 

Хотя я читал, что эта ошибка может означать что угодно, у моего фактора действительно только один уровень, но, похоже, он отлично работает в единственной регрессии (также хорошо добавлять другие переменные, которые не являются символами). При использовании цикла data.table или lapply это не так. У меня вопрос двоякий:

1) Почему переменная matchcode работает в первой ситуации (res <- lm (c241 ~ x + matchcode , data = df), а не во второй dfoutput<- data.table(df)[, .(nm = names(.SD),dffits= lapply(.SD, function(x) if(is.numeric(x)) summary(lm(c241~ x + matchcode, na.action=na.omit)))), .SDcols = -1]?

2) Что я могу сделать, чтобы обойти это? Поскольку переменная имеет первостепенное значение для модели.

Может быть, я смогу преобразовать символьную переменную или каким-то образом перекодировать ее?

ОБНОВЛЕНИЕ: Я использовал код по этой ссылке: ССЫЛКА , чтобы преобразовать символы в фактор с одним уровнем, что в итоге привело к той же ошибке.

 ES1sample <- dput(head(ES1sample[, ],10))
structure(list(matchcode = structure(c("BGD 2002 ", "BRA 2003 ", 
"KHM 2003 ", "CHN 2002 ", "CHN 2003 ", "ECU 2003 ", "ERI 2002 ", 
"ETH 2002 ", "GTM 2003 ", "HND 2003 "), label = "", class = c("labelled", 
"character")), idstd = structure(c(2760, 14273, 16104, 17039, 
19095, 22207, 23063, 24046, 25420, 26212), label = "WEB STD FIRMID", format.stata = "%5.0f", class = c("labelled", 
"numeric")), year = structure(c(2002, 2003, 2003, 2002, 2003, 
2003, 2002, 2002, 2003, 2003), format.stata = "%9.0g", label = "", class = c("labelled", 
"numeric")), country = structure(c("Bangladesh2002", "Brazil2003", 
"Cambodia2003", "China2002", "China2003", "Ecuador2003", "Eritrea2002", 
"Ethiopia2002", "Guatemala2003", "Honduras2003"), label = "Country", format.stata = "%21s", class = c("labelled", 
"character")), wt = structure(c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
), label = "locations and sectors weights", format.stata = "%9.0g", class = c("labelled", 
"numeric")), region = structure(c(6, 4, 2, 2, 2, 4, 1, 1, 4, 
4), label = "", class = c("labelled", "numeric")), income = structure(c(1, 
2, 1, 2, 2, 2, 1, 1, 2, 2), label = "income grouping for survey year", class = c("labelled", 
"numeric")), industry = structure(c(1, 1, 20, 3, 20, 12, 7, 3, 
NA, 3), label = "Industry", class = c("labelled", "numeric")), 
    sector = structure(c(1, 1, 2, 1, 2, 1, 1, 1, 2, 1), label = "Sector", class = c("labelled", 
    "numeric")), ownership = structure(c(2, 2, 2, 2, 2, 2, 2, 
    2, 1, 1), label = "Ownership", class = c("labelled", "numeric"
    )), exporter = structure(c(2, 2, 2, 2, 2, 1, 2, 2, 2, 1), label = "Export", class = c("labelled", 
    "numeric")), c201 = structure(c(1991, 1993, 1999, 1979, 1998, 
    1997, 1996, 1998, 1990, 1998), label = "Year firm began operations in this country", format.stata = "%4.0f", class = c("labelled", 
    "numeric")), c202 = structure(c(2, 2, 4, 6, 6, 2, 6, NA, 
    NA, NA), label = "Current legal status of firm", class = c("labelled", 
    "numeric")), c203a = structure(c(100, 100, 100, 100, 0, 100, 
    0, 100, 0, 0), label = "Percentage of firm owned by domestic private sector", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c203b = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 
    100, 100), label = "Percentage of firm owned by foreign private sector", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c203c = structure(c(0, 0, 0, 0, 100, 0, 0, 0, 
    0, 0), label = "Percentage of firm owned by government/state", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c203d = structure(c(NA, 0, 0, NA, NA, 0, 100, 
    0, 0, 0), label = "Percentage of firm owned by other types of owner", format.stata = "%8.2f", class = c("labelled", 
    "numeric")), c2041 = structure(c(2, 2, 2, NA, NA, 2, 2, 2, 
    2, 2), label = "Firm previously owned by government?", class = c("labelled", 
    "numeric")), c2042 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Year of privatization", format.stata = "%4.0f", class = c("labelled", 
    "numeric")), c205a = structure(c(NA, 25, 100, NA, 100, 100, 
    NA, NA, 100, 100), label = "Percentage owned by largest shareholder", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c205b1 = structure(c(NA, NA, NA, NA, NA, NA, 
    NA, NA, 1, 1), label = "Largest shareholder: individual", class = c("labelled", 
    "numeric")), c205b2 = structure(c(NA, 1, 1, NA, NA, 1, NA, 
    NA, NA, NA), label = "Largest shareholder: family", class = c("labelled", 
    "numeric")), c205b3 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: domestic company", class = c("labelled", 
    "numeric")), c205b4 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: foreign company", class = c("labelled", 
    "numeric")), c205b5 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: bank", class = c("labelled", 
    "numeric")), c205b6 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: investment fund", class = c("labelled", 
    "numeric")), c205b7 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: firm managers", class = c("labelled", 
    "numeric")), c205b8 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: firm employees", class = c("labelled", 
    "numeric")), c205b9 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: government", class = c("labelled", 
    "numeric")), c205b10 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: other", class = c("labelled", 
    "numeric")), c205c = structure(c(1, 1, 1, NA, 2, 1, 1, 1, 
    1, 1), label = "Is principal shareholder also the manager/director?", class = c("labelled", 
    "numeric")), c205d = structure(c(1, 1, 1, NA, NA, 2, 1, 1, 
    1, 1), label = "Is the principal owner male?", class = c("labelled", 
    "numeric")), c206a = structure(c(1, 1, 1, NA, NA, 1, 0, 1, 
    2, 3), label = "Number of separate operating facilities in this country", format.stata = "%4.0f", class = c("labelled", 
    "numeric")), c206b = structure(c(NA, 2, 2, NA, NA, 2, 2, 
    2, 2, 1), label = "Operations in other countries?", class = c("labelled", 
    "numeric")), c2071 = structure(c(1, 5, 1, 1, 2, 1, 5, 1, 
    1, 3), label = "Location of establishment", class = c("labelled", 
    "numeric")), c2072 = structure(c(NA, NA, NA, NA, NA, 1, 5, 
    NA, 1, 3), label = "Location of headquarters", class = c("labelled", 
    "numeric")), c208 = structure(c(NA, 1723, NA, NA, NA, NA, 
    NA, NA, NA, 1810), label = "Main product line", format.stata = "%10.0g", class = c("labelled", 
    "numeric")), c209a = structure(c(NA, 1, 2, NA, NA, NA, NA, 
    NA, NA, NA), label = "Other income generating activities", class = c("labelled", 
    "numeric")), c209ba = structure(c(NA, 0, NA, NA, NA, 100, 
    NA, NA, NA, NA), label = "Percent workers' time: manufacturing", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c209bb = structure(c(NA, 0, NA, NA, NA, 0, NA, 
    NA, NA, NA), label = "Percent workers' time: services", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c209bc = structure(c(NA, 0, NA, NA, NA, NA, 
    NA, NA, NA, NA), label = "Percent workers' time: commerce (retail/wholesale trade)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c209bd = structure(c(NA, 0, NA, NA, NA, NA, 
    NA, NA, NA, NA), label = "Percent workers' time: construction", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c209be = structure(c(NA, 1, NA, NA, NA, 0, NA, 
    NA, NA, NA), label = "Percent workers' time: other", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c210a = structure(c(NA, 50, 1, NA, NA, NA, NA, 
    NA, 65, 0), label = "Main product line: firm's share of local market", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c210b = structure(c(12, 25, 1, 6, 44.6399993896484, 
    50, NA, 5, 45, 0), label = "Main product line: firm's share of national market", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c211a1 = structure(c(100, 100, 100, 99, 100, 
    90, 100, 100, 95, 0), label = "Percent of sales sold domestically", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c211a2 = structure(c(0, 0, 0, 0, 0, 10, 0, 0, 
    5, 100), label = "Percent of sales exported directly", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c211a3 = structure(c(0, 0, 0, 1, 0, 0, NA, 0, 
    0, 0), label = "Percent of sales exported indirectly", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c211b1 = structure(c(NA, 0, 0, 0, 0, 0, NA, 
    NA, NA, NA), label = "Percentage of domestic sales to government", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c282a2y = structure(c(451645, NA, NA, 49609, 
    1061449, 21, 38966.6171875, 43.0904998779297, NA, NA), label = "Total liabilities 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282b2y = structure(c(NA, NA, NA, 393, 81012, 
    1, 0, NA, NA, NA), label = "Long-term liabilities (>1 year) 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282c2y = structure(c(NA, NA, NA, 55193, 980437, 
    20, 7687.90576171875, NA, NA, NA), label = "Short-term liabilities (<1 year) 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282d2y = structure(c(NA, NA, 5500, 55193, 19194, 
    NA, 0, NA, NA, NA), label = "Payable short-term liabilities 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282e2y = structure(c(10000, NA, NA, 5000, 20329, 
    12, 29826.701171875, 40, NA, NA), label = "Equity–share capital 2 years ago (thousands LCU)", format.stata = "%10.0g", class = c("labelled", 
    "numeric")), c282f2y = structure(c(305436, NA, 5500, 916, 
    NA, NA, 1452.00903320312, 6.09049987792969, NA, NA), label = "Retained earnings 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282a3y = structure(c(456618, NA, NA, NA, 1063217, 
    16, 39676.3984375, 43.7501029968262, NA, NA), label = "Total liabilities 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282b3y = structure(c(NA, NA, NA, NA, 152156, 
    5, 0, NA, NA, NA), label = "Long-term liabilities (>1 year) 3 years ago (thousands LCU)", format.stata = "%10.0g", class = c("labelled", 
    "numeric")), c282c3y = structure(c(NA, NA, NA, NA, 911061, 
    11, 6299.255859375, NA, NA, NA), label = "Short-term liabilities (<1 year) 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282d3y = structure(c(NA, NA, NA, NA, 20964, 
    NA, 0, NA, NA, NA), label = "Payable short-term liabilities 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282e3y = structure(c(10000, NA, NA, NA, 124531, 
    9, 32368.564453125, 40, NA, NA), label = "Equity–share capital 3 years ago (thousands LCU)", format.stata = "%10.0g", class = c("labelled", 
    "numeric")), c282f3y = structure(c(282840, NA, NA, NA, NA, 
    NA, 1008.58001708984, 6.75010013580322, NA, NA), label = "Retained earnings 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), gni = structure(c(370, 2760, 300, 970, 1100, 
    1830, 150, 100, 1910, 960), label = "Gross National Income per capita, Atlas Method (current ), World Development Ind", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), pop = structure(c(135683664, 176596256, 13403644, 
    1280400000, 1288400000, 13007942, 4296700, 67217840, 12307091, 
    6968512), label = "Population, Total, in 2005 (World Development Indicators)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), country_proper = structure(c("Bangladesh", "Brazil", 
    "Cambodia", "China", "China", "Ecuador", "Eritrea", "Ethiopia", 
    "Guatemala", "Honduras"), format.stata = "%22s", label = "", class = c("labelled", 
    "character")), c_abbr = structure(c("BGD", "BRA", "KHM", 
    "CHN", "CHN", "ECU", "ERI", "ETH", "GTM", "HND"), format.stata = "%9s", label = "", class = c("labelled", 
    "character")), countryyear = structure(c("Bangladesh2002", 
    "Brazil2003", "Cambodia2003", "China2002", "China2003", "Ecuador2003", 
    "Eritrea2002", "Ethiopia2002", "Guatemala2003", "Honduras2003"
    ), label = "Country", format.stata = "%21s", class = c("labelled", 
    "character")), iso3c = structure(c("BGD", "BRA", "KHM", "CHN", 
    "CHN", "ECU", "ERI", "ETH", "GTM", "HND"), label = "", class = c("labelled", 
    "character")), cname = structure(c("Bangladesh", "Brazil", 
    "Cambodia", "China", "China", "Ecuador", "Eritrea", "Ethiopia", 
    "Guatemala", "Honduras"), label = "", class = c("labelled", 
    "character")), cyear = structure(c("2002", "2003", "2003", 
    "2002", "2003", "2003", "2002", "2002", "2003", "2003"), label = "", class = c("labelled", 
    "character"))), .Names = c("matchcode", "idstd", "year", 
"country", "wt", "region", "income", "industry", "sector", "ownership", 
"exporter", "c201", "c202", "c203a", "c203b", "c203c", "c203d", 
"c2041", "c2042", "c205a", "c205b1", "c205b2", "c205b3", "c205b4", 
"c205b5", "c205b6", "c205b7", "c205b8", "c205b9", "c205b10", 
"c205c", "c205d", "c206a", "c206b", "c2071", "c2072", "c208", 
"c209a", "c209ba", "c209bb", "c209bc", "c209bd", "c209be", "c210a", 
"c210b", "c211a1", "c211a2", "c211a3", "c211b1", "c282a2y", "c282b2y", 
"c282c2y", "c282d2y", "c282e2y", "c282f2y", "c282a3y", "c282b3y", 
"c282c3y", "c282d3y", "c282e3y", "c282f3y", "gni", "pop", "country_proper", 
"c_abbr", "countryyear", "iso3c", "cname", "cyear"), class = c("data.table", 
"data.frame"), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x0000000002570788>)

1 Ответ

0 голосов
/ 05 сентября 2018

Как я понимаю, вы пытались сделать выбор функции, но вы выбрали сложный метод, и он сделал вывод NULL. Я согласен, что у вас так много функций, и вам нужно сделать выбор функций до регрессии.

Есть очень выдающиеся методы выбора функций , такие как Случайный лес . Случайный лес поможет вам определить лучших предикторов.

Учитывая, что мне интересно прогнозировать Виды растений, но я не знаю, какая особенность может предсказать его лучше всего (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width). Поэтому в приведенном ниже коде указаны лучшие предикторы:

library(party)
colnames(iris)
cf1 <- cforest(Species ~ . , data= iris, control=cforest_unbiased(mtry=2,ntree=50))

varimp(cf1)

Результат varimp() дает вам:

Вектор значимости снижения значимости означает среднее значение. Другими словами, чем выше оценка, тем лучше прогноз.

В примере:

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
 0.047636364  0.002909091  0.354181818  0.227636364
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...