Я пытаюсь создать регрессию с категориальной переменной.
Я начинаю с получения всех фиктивных переменных.И отбросьте все, что мне не нужно, в значение x для
d1 = pd.get_dummies(df2015 ["CBSA Office"])
df2015_new = pd.concat([df2015, d1], axis=1)
d2 = pd.get_dummies(df2016 ["CBSA Office"])
df2016_new = pd.concat([df2016, d2], axis=1)
trainset = pd.concat([df2015_new,df2016_new],axis=0)
trainset = trainset.dropna()
x_train = trainset.drop(['CBSA Office','Location','Updated','Commercial Flow','Travellers Flow'],axis="columns")
y_train = trainset["Travellers Flow"]
Теперь я запускаю регрессию с помощью функции OLS.
x_train = x_train.iloc[:100].values.reshape(-1,1)
y_train = y_train.iloc[:100].values.reshape(-1,1)
modelx = sm.OLS(y_train.astype(float), x_train.astype(float)).fit()
modelx.summary()
Тогда я получу сообщение об ошибкесказал
endog and exog matrices are different sizes
Но я думал, что уже установил для них одинаковый размер
Если я не изменю их, я получу такой результат
C:\Users\CiCi\Anaconda3-1\lib\site-packages\statsmodels\regression\linear_model.py:1554: RuntimeWarning: invalid value encountered in double_scalars
return self.ess/self.df_model
C:\Users\CiCi\Anaconda3-1\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
return (self.a < x) & (x < self.b)
C:\Users\CiCi\Anaconda3-1\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
return (self.a < x) & (x < self.b)
C:\Users\CiCi\Anaconda3-1\lib\site-packages\scipy\stats\_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
cond2 = cond0 & (x <= self.a)
C:\Users\CiCi\Anaconda3-1\lib\site-packages\statsmodels\base\model.py:1100: RuntimeWarning: invalid value encountered in true_divide
return self.params / self.bse
OLS Regression Results
Dep. Variable: Travellers Flow R-squared: 0.000
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: nan
Date: Sun, 09 Dec 2018 Prob (F-statistic): nan
Time: 00:34:01 Log-Likelihood: -429.08
No. Observations: 100 AIC: 860.2
Df Residuals: 99 BIC: 862.8
Df Model: 0
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Abbotsford-Huntingdon 8.5000 1.776 4.786 0.000 4.976 12.024
Aldergrove 0 0 nan nan 0 0
Ambassador Bridge 0 0 nan nan 0 0
Blue Water Bridge 0 0 nan nan 0 0
Boundary Bay 0 0 nan nan 0 0
Cornwall 0 0 nan nan 0 0
Coutts 0 0 nan nan 0 0
Douglas (Peace Arch) 0 0 nan nan 0 0
Edmundston 0 0 nan nan 0 0
Emerson 0 0 nan nan 0 0
Fort Frances Bridge 0 0 nan nan 0 0
North Portal 0 0 nan nan 0 0
Pacific Highway 0 0 nan nan 0 0
Peace Bridge 0 0 nan nan 0 0
Prescott 0 0 nan nan 0 0
Queenston-Lewiston Bridge 0 0 nan nan 0 0
Rainbow Bridge 0 0 nan nan 0 0
Sault Ste. Marie 0 0 nan nan 0 0
St-Armand/Philipsburg 0 0 nan nan 0 0
St-Bernard-de-Lacolle 0 0 nan nan 0 0
St. Stephen 0 0 nan nan 0 0
St. Stephen 3rd Bridge 0 0 nan nan 0 0
Stanstead 0 0 nan nan 0 0
Thousand Islands Bridge 0 0 nan nan 0 0
Windsor and Detroit Tunnel 0 0 nan nan 0 0
Woodstock Road 0 0 nan nan 0 0
Omnibus: 81.245 Durbin-Watson: 0.324
Prob(Omnibus): 0.000 Jarque-Bera (JB): 453.220
Skew: 2.832 Prob(JB): 3.84e-99
Kurtosis: 11.757 Cond. No. 1.00e+16
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.98e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Это формат, который я хочу, который включает все фиктивные переменные, но он имеет много предупреждений, R ^ 2 равно 0, и я точно не могу сделать прогноз на основе этого.
Я хочу получить сводную информацию о каждой фиктивной переменной
Я пытался это сделать
x_train = np.array(x_train).reshape(1,-1)
y_train = np.array(y_train).reshape(1,-1)
modelx = sm.OLS(y_train.astype(float), x_train.astype(float)).fit()
modelx.summary()
Я получу
MemoryError Traceback (most recent call last)
<ipython-input-668-312de7f7e808> in <module>()
1 x_train = np.array(x_train).reshape(1,-1)
2 y_train = np.array(y_train).reshape(1,-1)
----> 3 modelx = sm.OLS(y_train.astype(float), x_train.astype(float)).fit()
4 modelx.summary()
~\Anaconda3-1\lib\site-packages\statsmodels\regression\linear_model.py in fit(self, method, cov_type, cov_kwds, use_t, **kwargs)
273 self.pinv_wexog, singular_values = pinv_extended(self.wexog)
274 self.normalized_cov_params = np.dot(
--> 275 self.pinv_wexog, np.transpose(self.pinv_wexog))
276
277 # Cache these singular values for use later.
MemoryError:
Я новичок вpython, нужна большая помощь, спасибо!