Разница в линейной регрессии с использованием Statsmodels между версией Patsy и версией Dummy-списков - PullRequest
0 голосов
/ 18 февраля 2019

У меня есть различия в значениях коэффициента и погрешностях коэффициента с использованием функций smf.ols и sm.OLS из statsmodels.Несмотря на то, что математически они должны иметь одинаковую формулу регрессии и давать одинаковые результаты.

Я сделал воспроизводимый на 100% пример моего вопроса, отсюда можно скачать фрейм данных df: https://drive.google.com/drive/folders/1i67wztkrAeEZH2tv2hyOlgxG7N80V3pI?usp=sharing

Случай 1: Линейная модель с использованием Patsy из Statsmodels

# First we load the libraries:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import random
import pandas as pd
# We define a specific seed to have the same results:
random.seed(1234)
# Now we read the data that can be downloaded from Google Drive link provided above:
df = pd.read_csv("/Users/user/Documents/example/cars.csv", sep = "|")
# We create the linear regression:
lm1 = smf.ols('price ~ make + fuel_system + engine_type + num_of_doors + bore + compression_ratio + height + peak_rpm + 1', data = df)
# We see the results:
lm1.fit().summary()

Результат lm1:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     35.54
Date:                Mon, 18 Feb 2019   Prob (F-statistic):           5.24e-62
Time:                        17:19:14   Log-Likelihood:                -1899.7
No. Observations:                 205   AIC:                             3879.
Df Residuals:                     165   BIC:                             4012.
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept              1.592e+04   1.21e+04      1.320      0.189   -7898.396    3.97e+04
make[T.audi]           6519.7045   2371.807      2.749      0.007    1836.700    1.12e+04
make[T.bmw]            1.427e+04   2292.551      6.223      0.000    9740.771    1.88e+04
make[T.chevrolet]      -571.8236   2860.026     -0.200      0.842   -6218.788    5075.141
make[T.dodge]         -1186.3430   2261.240     -0.525      0.601   -5651.039    3278.353
make[T.honda]          2779.6496   2891.626      0.961      0.338   -2929.709    8489.009
make[T.isuzu]          3098.9677   2592.645      1.195      0.234   -2020.069    8218.004
make[T.jaguar]         1.752e+04   2416.313      7.252      0.000    1.28e+04    2.23e+04
make[T.mazda]           306.6568   2134.567      0.144      0.886   -3907.929    4521.243
make[T.mercedes-benz]  1.698e+04   2320.871      7.318      0.000    1.24e+04    2.16e+04
make[T.mercury]        2958.1002   3605.739      0.820      0.413   -4161.236    1.01e+04
make[T.mitsubishi]    -1188.8337   2284.697     -0.520      0.604   -5699.844    3322.176
make[T.nissan]        -1211.5463   2073.422     -0.584      0.560   -5305.405    2882.312
make[T.peugot]         3057.0217   4255.809      0.718      0.474   -5345.841    1.15e+04
make[T.plymouth]       -894.5921   2332.746     -0.383      0.702   -5500.473    3711.289
make[T.porsche]        9558.8747   3688.038      2.592      0.010    2277.044    1.68e+04
make[T.renault]       -2124.9722   2847.536     -0.746      0.457   -7747.277    3497.333
make[T.saab]           3490.5333   2319.189      1.505      0.134   -1088.579    8069.645
make[T.subaru]        -1.636e+04   4002.796     -4.087      0.000   -2.43e+04   -8456.659
make[T.toyota]         -770.9677   1911.754     -0.403      0.687   -4545.623    3003.688
make[T.volkswagen]      406.9179   2219.714      0.183      0.855   -3975.788    4789.623
make[T.volvo]          5433.7129   2397.030      2.267      0.025     700.907    1.02e+04
fuel_system[T.2bbl]    2142.1594   2232.214      0.960      0.339   -2265.226    6549.545
fuel_system[T.4bbl]     464.1109   3999.976      0.116      0.908   -7433.624    8361.846
fuel_system[T.idi]     1.991e+04   6622.812      3.007      0.003    6837.439     3.3e+04
fuel_system[T.mfi]     3716.5201   3936.805      0.944      0.347   -4056.488    1.15e+04
fuel_system[T.mpfi]    3964.1109   2267.538      1.748      0.082    -513.019    8441.241
fuel_system[T.spdi]    3240.0003   2719.925      1.191      0.235   -2130.344    8610.344
fuel_system[T.spfi]     932.1959   4019.476      0.232      0.817   -7004.041    8868.433
engine_type[T.dohcv]  -1.208e+04   4205.826     -2.872      0.005   -2.04e+04   -3773.504
engine_type[T.l]      -4833.9860   3763.812     -1.284      0.201   -1.23e+04    2597.456
engine_type[T.ohc]    -4038.8848   1213.598     -3.328      0.001   -6435.067   -1642.702
engine_type[T.ohcf]    9618.9281   3504.600      2.745      0.007    2699.286    1.65e+04
engine_type[T.ohcv]    3051.7629   1445.185      2.112      0.036     198.323    5905.203
engine_type[T.rotor]   1403.9928   3217.402      0.436      0.663   -4948.593    7756.579
num_of_doors[T.two]    -419.9640    521.754     -0.805      0.422   -1450.139     610.211
bore                   3993.4308   1373.487      2.908      0.004    1281.556    6705.306
compression_ratio     -1200.5665    460.681     -2.606      0.010   -2110.156    -290.977
height                  -80.7141    146.219     -0.552      0.582    -369.417     207.988
peak_rpm                 -0.5903      0.790     -0.747      0.456      -2.150       0.970
==============================================================================
Omnibus:                       65.777   Durbin-Watson:                   1.217
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              399.594
Skew:                           1.059   Prob(JB):                     1.70e-87
Kurtosis:                       9.504   Cond. No.                     3.26e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.26e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

Случай 2: Линейная модель, использующая также фиктивные переменные из Statsmodels

# We define a specific seed to have the same results:
random.seed(1234)
# First we check what `object` type variables we have in our dataset:
df.dtypes
# We create a list where we save the `object` type variables names:
object = ['make', 
          'fuel_system', 
          'engine_type', 
          'num_of_doors'
          ]
# Now we convert those object variables to numeric with get_dummies function to have 1 unique numeric dataframe:
df_num = pd.get_dummies(df, columns = object)
# We ensure the dataframe is numeric casting all values to float64:
df_num = df_num[df_num.columns].apply(pd.to_numeric, errors='coerce', axis = 1)
# We define the predictive variables dataset:
X = df_num.drop('price', axis = 1)
# We define the response variable values:
y = df_num.price.values
# We add a constant as we did in the previous example (adding "+1" to Patsy):
Xc = sm.add_constant(X) # Adds a constant to the model
# We create the linear model and obtain results:
lm2 = sm.OLS(y, Xc)
lm2.fit().summary()

Результат lm2:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     35.54
Date:                Mon, 18 Feb 2019   Prob (F-statistic):           5.24e-62
Time:                        17:28:16   Log-Likelihood:                -1899.7
No. Observations:                 205   AIC:                             3879.
Df Residuals:                     165   BIC:                             4012.
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               1.205e+04   6811.094      1.769      0.079   -1398.490    2.55e+04
bore                3993.4308   1373.487      2.908      0.004    1281.556    6705.306
compression_ratio  -1200.5665    460.681     -2.606      0.010   -2110.156    -290.977
height               -80.7141    146.219     -0.552      0.582    -369.417     207.988
peak_rpm              -0.5903      0.790     -0.747      0.456      -2.150       0.970
make_alfa-romero   -2273.9631   1865.185     -1.219      0.225   -5956.669    1408.743
make_audi           4245.7414   1324.140      3.206      0.002    1631.299    6860.184
make_bmw            1.199e+04   1232.635      9.730      0.000    9559.555    1.44e+04
make_chevrolet     -2845.7867   1976.730     -1.440      0.152   -6748.733    1057.160
make_dodge         -3460.3061   1170.966     -2.955      0.004   -5772.315   -1148.297
make_honda           505.6865   2049.865      0.247      0.805   -3541.661    4553.034
make_isuzu           825.0045   1706.160      0.484      0.629   -2543.716    4193.725
make_jaguar         1.525e+04   1903.813      8.010      0.000    1.15e+04     1.9e+04
make_mazda         -1967.3063    982.179     -2.003      0.047   -3906.564     -28.048
make_mercedes-benz  1.471e+04   1423.004     10.338      0.000    1.19e+04    1.75e+04
make_mercury         684.1370   2913.361      0.235      0.815   -5068.136    6436.410
make_mitsubishi    -3462.7968   1221.018     -2.836      0.005   -5873.631   -1051.963
make_nissan        -3485.5094    946.316     -3.683      0.000   -5353.958   -1617.060
make_peugot          783.0586   3513.296      0.223      0.824   -6153.754    7719.871
make_plymouth      -3168.5552   1293.376     -2.450      0.015   -5722.256    -614.854
make_porsche        7284.9115   2853.174      2.553      0.012    1651.475    1.29e+04
make_renault       -4398.9354   2037.945     -2.159      0.032   -8422.747    -375.124
make_saab           1216.5702   1487.192      0.818      0.415   -1719.810    4152.950
make_subaru        -1.863e+04   3263.524     -5.710      0.000   -2.51e+04   -1.22e+04
make_toyota        -3044.9308    776.059     -3.924      0.000   -4577.218   -1512.644
make_volkswagen    -1867.0452   1170.975     -1.594      0.113   -4179.072     444.981
make_volvo          3159.7498   1327.405      2.380      0.018     538.862    5780.638
fuel_system_1bbl   -2790.4092   2230.161     -1.251      0.213   -7193.740    1612.922
fuel_system_2bbl    -648.2498   1094.525     -0.592      0.554   -2809.330    1512.830
fuel_system_4bbl   -2326.2983   3094.703     -0.752      0.453   -8436.621    3784.024
fuel_system_idi     1.712e+04   6154.806      2.782      0.006    4971.083    2.93e+04
fuel_system_mfi      926.1109   3063.134      0.302      0.763   -5121.881    6974.102
fuel_system_mpfi    1173.7017   1186.125      0.990      0.324   -1168.238    3515.642
fuel_system_spdi     449.5911   1827.318      0.246      0.806   -3158.349    4057.531
fuel_system_spfi   -1858.2133   3111.596     -0.597      0.551   -8001.891    4285.464
engine_type_dohc    2703.6445   1803.080      1.499      0.136    -856.440    6263.729
engine_type_dohcv  -9374.0342   3504.717     -2.675      0.008   -1.63e+04   -2454.161
engine_type_l      -2130.3416   3357.283     -0.635      0.527   -8759.115    4498.431
engine_type_ohc    -1335.2404   1454.047     -0.918      0.360   -4206.177    1535.696
engine_type_ohcf    1.232e+04   2850.883      4.322      0.000    6693.659     1.8e+04
engine_type_ohcv    5755.4074   1669.627      3.447      0.001    2458.820    9051.995
engine_type_rotor   4107.6373   3032.223      1.355      0.177   -1879.323    1.01e+04
num_of_doors_four   6234.8048   3491.722      1.786      0.076    -659.410    1.31e+04
num_of_doors_two    5814.8408   3337.588      1.742      0.083    -775.045    1.24e+04
==============================================================================
Omnibus:                       65.777   Durbin-Watson:                   1.217
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              399.594
Skew:                           1.059   Prob(JB):                     1.70e-87
Kurtosis:                       9.504   Cond. No.                     1.01e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.38e-23. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""

Как мы видим, некоторые переменные, такие как height, имеют одинаковый коэффициент.Тем не менее некоторые другие этого не делают (уровень isuzu из переменной make, уровень ohc из engine_type или independent term и т. Д.).Разве это не должно быть одинаковым результатом для обоих выходов?Что я здесь упускаю или делаю неправильно?

Заранее благодарен за вашу помощь.

PD Как разъяснил @sukhbinder, даже используя формулу Пэтси без независимого термина (ставя -1)«в формуле, поскольку Пэтси включает его по умолчанию) и исключая независимый термин из фиктивной формулировки, я получаю разные результаты.

1 Ответ

0 голосов
/ 21 февраля 2019

Причина, по которой результаты не совпадают, заключается в том, что Statsmodels делает предварительный отбор по прогнозным переменным в зависимости от высокой мультиколлинеарности.

Точно такие же результаты достигаются при описании описательной регрессии и идентификацииотсутствуют переменные:

deletex = [
        'make_alfa-romero',
        'fuel_system_1bbl',
        'engine_type_dohc',
        'num_of_doors_four'
        ]
df_num.drop( deletex, axis = 1, inplace = True) 
df_num = df_num[df_num.columns].apply(pd.to_numeric, errors='coerce', axis = 1)
X = df_num.drop('price', axis = 1)
y = df_num.price.values
Xc = sm.add_constant(X) # Adds a constant to the model
random.seed(1234)
linear_regression = sm.OLS(y, Xc)
linear_regression.fit().summary()

, который печатает результат:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     35.54
Date:                Thu, 21 Feb 2019   Prob (F-statistic):           5.24e-62
Time:                        18:16:08   Log-Likelihood:                -1899.7
No. Observations:                 205   AIC:                             3879.
Df Residuals:                     165   BIC:                             4012.
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               1.592e+04   1.21e+04      1.320      0.189   -7898.396    3.97e+04
bore                3993.4308   1373.487      2.908      0.004    1281.556    6705.306
compression_ratio  -1200.5665    460.681     -2.606      0.010   -2110.156    -290.977
height               -80.7141    146.219     -0.552      0.582    -369.417     207.988
peak_rpm              -0.5903      0.790     -0.747      0.456      -2.150       0.970
make_audi           6519.7045   2371.807      2.749      0.007    1836.700    1.12e+04
make_bmw            1.427e+04   2292.551      6.223      0.000    9740.771    1.88e+04
make_chevrolet      -571.8236   2860.026     -0.200      0.842   -6218.788    5075.141
make_dodge         -1186.3430   2261.240     -0.525      0.601   -5651.039    3278.353
make_honda          2779.6496   2891.626      0.961      0.338   -2929.709    8489.009
make_isuzu          3098.9677   2592.645      1.195      0.234   -2020.069    8218.004
make_jaguar         1.752e+04   2416.313      7.252      0.000    1.28e+04    2.23e+04
make_mazda           306.6568   2134.567      0.144      0.886   -3907.929    4521.243
make_mercedes-benz  1.698e+04   2320.871      7.318      0.000    1.24e+04    2.16e+04
make_mercury        2958.1002   3605.739      0.820      0.413   -4161.236    1.01e+04
make_mitsubishi    -1188.8337   2284.697     -0.520      0.604   -5699.844    3322.176
make_nissan        -1211.5463   2073.422     -0.584      0.560   -5305.405    2882.312
make_peugot         3057.0217   4255.809      0.718      0.474   -5345.841    1.15e+04
make_plymouth       -894.5921   2332.746     -0.383      0.702   -5500.473    3711.289
make_porsche        9558.8747   3688.038      2.592      0.010    2277.044    1.68e+04
make_renault       -2124.9722   2847.536     -0.746      0.457   -7747.277    3497.333
make_saab           3490.5333   2319.189      1.505      0.134   -1088.579    8069.645
make_subaru        -1.636e+04   4002.796     -4.087      0.000   -2.43e+04   -8456.659
make_toyota         -770.9677   1911.754     -0.403      0.687   -4545.623    3003.688
make_volkswagen      406.9179   2219.714      0.183      0.855   -3975.788    4789.623
make_volvo          5433.7129   2397.030      2.267      0.025     700.907    1.02e+04
fuel_system_2bbl    2142.1594   2232.214      0.960      0.339   -2265.226    6549.545
fuel_system_4bbl     464.1109   3999.976      0.116      0.908   -7433.624    8361.846
fuel_system_idi     1.991e+04   6622.812      3.007      0.003    6837.439     3.3e+04
fuel_system_mfi     3716.5201   3936.805      0.944      0.347   -4056.488    1.15e+04
fuel_system_mpfi    3964.1109   2267.538      1.748      0.082    -513.019    8441.241
fuel_system_spdi    3240.0003   2719.925      1.191      0.235   -2130.344    8610.344
fuel_system_spfi     932.1959   4019.476      0.232      0.817   -7004.041    8868.433
engine_type_dohcv  -1.208e+04   4205.826     -2.872      0.005   -2.04e+04   -3773.504
engine_type_l      -4833.9860   3763.812     -1.284      0.201   -1.23e+04    2597.456
engine_type_ohc    -4038.8848   1213.598     -3.328      0.001   -6435.067   -1642.702
engine_type_ohcf    9618.9281   3504.600      2.745      0.007    2699.286    1.65e+04
engine_type_ohcv    3051.7629   1445.185      2.112      0.036     198.323    5905.203
engine_type_rotor   1403.9928   3217.402      0.436      0.663   -4948.593    7756.579
num_of_doors_two    -419.9640    521.754     -0.805      0.422   -1450.139     610.211
==============================================================================
Omnibus:                       65.777   Durbin-Watson:                   1.217
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              399.594
Skew:                           1.059   Prob(JB):                     1.70e-87
Kurtosis:                       9.504   Cond. No.                     3.26e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.26e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Результаты, полностью совпадающие с первым вызовом с Statsmodels:

random.seed(1234)
lm_python = smf.ols('price ~ make + fuel_system + engine_type + num_of_doors + bore + compression_ratio + height + peak_rpm + 1', data = df)
lm_python.fit().summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     35.54
Date:                Thu, 21 Feb 2019   Prob (F-statistic):           5.24e-62
Time:                        18:17:37   Log-Likelihood:                -1899.7
No. Observations:                 205   AIC:                             3879.
Df Residuals:                     165   BIC:                             4012.
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept              1.592e+04   1.21e+04      1.320      0.189   -7898.396    3.97e+04
make[T.audi]           6519.7045   2371.807      2.749      0.007    1836.700    1.12e+04
make[T.bmw]            1.427e+04   2292.551      6.223      0.000    9740.771    1.88e+04
make[T.chevrolet]      -571.8236   2860.026     -0.200      0.842   -6218.788    5075.141
make[T.dodge]         -1186.3430   2261.240     -0.525      0.601   -5651.039    3278.353
make[T.honda]          2779.6496   2891.626      0.961      0.338   -2929.709    8489.009
make[T.isuzu]          3098.9677   2592.645      1.195      0.234   -2020.069    8218.004
make[T.jaguar]         1.752e+04   2416.313      7.252      0.000    1.28e+04    2.23e+04
make[T.mazda]           306.6568   2134.567      0.144      0.886   -3907.929    4521.243
make[T.mercedes-benz]  1.698e+04   2320.871      7.318      0.000    1.24e+04    2.16e+04
make[T.mercury]        2958.1002   3605.739      0.820      0.413   -4161.236    1.01e+04
make[T.mitsubishi]    -1188.8337   2284.697     -0.520      0.604   -5699.844    3322.176
make[T.nissan]        -1211.5463   2073.422     -0.584      0.560   -5305.405    2882.312
make[T.peugot]         3057.0217   4255.809      0.718      0.474   -5345.841    1.15e+04
make[T.plymouth]       -894.5921   2332.746     -0.383      0.702   -5500.473    3711.289
make[T.porsche]        9558.8747   3688.038      2.592      0.010    2277.044    1.68e+04
make[T.renault]       -2124.9722   2847.536     -0.746      0.457   -7747.277    3497.333
make[T.saab]           3490.5333   2319.189      1.505      0.134   -1088.579    8069.645
make[T.subaru]        -1.636e+04   4002.796     -4.087      0.000   -2.43e+04   -8456.659
make[T.toyota]         -770.9677   1911.754     -0.403      0.687   -4545.623    3003.688
make[T.volkswagen]      406.9179   2219.714      0.183      0.855   -3975.788    4789.623
make[T.volvo]          5433.7129   2397.030      2.267      0.025     700.907    1.02e+04
fuel_system[T.2bbl]    2142.1594   2232.214      0.960      0.339   -2265.226    6549.545
fuel_system[T.4bbl]     464.1109   3999.976      0.116      0.908   -7433.624    8361.846
fuel_system[T.idi]     1.991e+04   6622.812      3.007      0.003    6837.439     3.3e+04
fuel_system[T.mfi]     3716.5201   3936.805      0.944      0.347   -4056.488    1.15e+04
fuel_system[T.mpfi]    3964.1109   2267.538      1.748      0.082    -513.019    8441.241
fuel_system[T.spdi]    3240.0003   2719.925      1.191      0.235   -2130.344    8610.344
fuel_system[T.spfi]     932.1959   4019.476      0.232      0.817   -7004.041    8868.433
engine_type[T.dohcv]  -1.208e+04   4205.826     -2.872      0.005   -2.04e+04   -3773.504
engine_type[T.l]      -4833.9860   3763.812     -1.284      0.201   -1.23e+04    2597.456
engine_type[T.ohc]    -4038.8848   1213.598     -3.328      0.001   -6435.067   -1642.702
engine_type[T.ohcf]    9618.9281   3504.600      2.745      0.007    2699.286    1.65e+04
engine_type[T.ohcv]    3051.7629   1445.185      2.112      0.036     198.323    5905.203
engine_type[T.rotor]   1403.9928   3217.402      0.436      0.663   -4948.593    7756.579
num_of_doors[T.two]    -419.9640    521.754     -0.805      0.422   -1450.139     610.211
bore                   3993.4308   1373.487      2.908      0.004    1281.556    6705.306
compression_ratio     -1200.5665    460.681     -2.606      0.010   -2110.156    -290.977
height                  -80.7141    146.219     -0.552      0.582    -369.417     207.988
peak_rpm                 -0.5903      0.790     -0.747      0.456      -2.150       0.970
==============================================================================
Omnibus:                       65.777   Durbin-Watson:                   1.217
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              399.594
Skew:                           1.059   Prob(JB):                     1.70e-87
Kurtosis:                       9.504   Cond. No.                     3.26e+05
==============================================================================

Тамэто необходимость проверки соответствия в предиктивных переменных, так как pd.get_dummies выполняет обширное получение всех фиктивных переменных, а Statsmodels применяет уровни N-1 внутри выбора категориальных переменных.

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...