ValueError при подгонке модели даже после вменения - PullRequest
0 голосов
/ 23 сентября 2019

Я использую Мельбурнский Жилищный Набор данных от Kaggle, чтобы приспособить к нему регрессионную модель, с ценой, являющейся целевым значением.Вы можете найти набор данных здесь

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
from sklearn.preprocessing import Imputer

cols_to_use = ['Distance', 'Landsize', 'BuildingArea']
data = pd.read_csv('data/melb_house_pricing.csv')
# drop rows where target is NaN
data = data.loc[~(data['Price'].isna())]
y = data.Price
X = data[cols_to_use]
my_imputer = Imputer()
imputed_X = my_imputer.fit_transform(X)

print(f"Contains NaNs in training data: {np.isnan(imputed_X).sum()}")
print(f"Contains NaNs in target data: {np.isnan(y).sum()}")
print(f"Contains Infinity: {np.isinf(imputed_X).sum()}")
print(f"Contains Infinity: {np.isinf(y).sum()}")

my_model = GradientBoostingRegressor()
my_model.fit(imputed_X, y)

# Here we make the plot
my_plots = plot_partial_dependence(my_model,       
                                   features=[0, 2], # column numbers of plots we want to show
                                   X=X,            # raw predictors data.
                                   feature_names=['Distance', 'Landsize', 'BuildingArea'], # labels on graphs
                                   grid_resolution=10) # number of values to plot on x axis

Даже после использования Imputer от sklearn я получаю следующую ошибку -

Contains NaNs in training data: 0
Contains NaNs in target data: 0
Contains Infinity: 0
Contains Infinity: 0
/Users/adimyth/.local/lib/python3.7/site-packages/sklearn/utils/deprecation.py:85: DeprecationWarning: Function plot_partial_dependence is deprecated; The function ensemble.plot_partial_dependence has been deprecated in favour of sklearn.inspection.plot_partial_dependence in  0.21 and will be removed in 0.23.
  warnings.warn(msg, category=DeprecationWarning)
Traceback (most recent call last):
  File "partial_dependency_plots.py", line 29, in <module>
    grid_resolution=10) # number of values to plot on x axis
  File "/Users/adimyth/.local/lib/python3.7/site-packages/sklearn/utils/deprecation.py", line 86, in wrapped
    return fun(*args, **kwargs)
  File "/Users/adimyth/.local/lib/python3.7/site-packages/sklearn/ensemble/partial_dependence.py", line 286, in plot_partial_dependence
    X = check_array(X, dtype=DTYPE, order='C')
  File "/Users/adimyth/.local/lib/python3.7/site-packages/sklearn/utils/validation.py", line 542, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "/Users/adimyth/.local/lib/python3.7/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
    raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Как вы можете видеть, когдаЯ печатаю количество NaN в imputed_X, получаю 0. Итак, почему я все еще получаю ValueError.Любая помощь?

1 Ответ

0 голосов
/ 23 сентября 2019

Просто измените код для plot_partial_dependence:

my_plots = plot_partial_dependence(my_model,       
                                   features=[0, 2], # column numbers of plots we want to show
                                   X=imputed_X,            # raw predictors data.
                                   feature_names=['Distance', 'Landsize', 'BuildingArea'], # labels on graphs
                                   grid_resolution=10) # num

Это будет работать.

...