Проблемы с созданием функций тестирования / обучения для пересмотра меньшинства - PullRequest
0 голосов
/ 20 октября 2019

Я пытаюсь воссоздать учебник, созданный Ником Беккером. Он расположен по адресу https://beckernick.github.io/oversampling-modeling/

Код, который он опубликовал, работает, когда вы копируете и вставляете его в Блокнот Jupyter.

Я пытаюсь воссоздать его с другим набором данных, который такжеочень несбалансированный. Это набор данных Airbnb, предоставленный Inside Airbnb, которым я манипулировал и перезагружал здесь: https://drive.google.com/file/d/0B4EEyCnbIf1fLTd2UU5SWVNxV29oNHVkc3ZyY2JId3UyRWtv/view?usp=drivesdk

Я создал блокнот, в котором я отбросил строки с нулевыми значениями, усреднил оценку обзора и составил 1,2,3 = до 1 или отрицательный и 4,5 = 0 или положительный.

Затем я следовал точным шагам, которые были предоставлены в модели Ника Беккерса, и когда я добрался до «Создание учебных и тестовых наборов»«Я получаю сообщение об ошибке.

**** Я добавил дополнительный вопрос к концу, потому что ошибка была решена в комментариях ****

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-21-1c632a59b870> in <module>
      1 training_features, test_features, \
----> 2 training_target, test_target, = train_test_split(price_relevant_enconded.drop(['average_review_score'], axis=1)

KeyError: "['average_review_score'] not found in axis"

Вышесокращенная версия полного сообщения об ошибке. Я заметил это в коде Ника, хотя он устанавливает «bad_loans» в своих model_variables, для которых он затем создает макеты. Когда вы на самом деле смотрите на фрейм данных "price_relevant_encoded", на самом деле для "bad_loans" не создаются макеты. Мой эквивалент "bad_loans" - "Average_review_score", и для этого созданы манекены. Я считаю, что это моя проблема. Плохая часть для меня в том, что я не знаю, как обойти это. Моя конечная цель - получить более реалистичную модель прогнозирования для рейтингов в зависимости от типа собственности и типа комнаты.

У меня есть такой код:

%matplotlib inline
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import warnings
import tensorflow as tf
import tensorflow_hub as hub
import bert
import imblearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from scipy import stats
plt.style.use('seaborn')
warnings.filterwarnings(action='ignore')
output_dir = 'modelOutput'

airbnbdata = pd.read_excel('Z:\\Business\\AA Project\\listings_cleaned_v1.xlsm')

dfclean = airbnbdata
dfclean.iloc[0]

#drop rows with nulls in columns
dfclean = dfclean.dropna(subset=['id'])
dfclean = dfclean.dropna(subset=['listing_url'])
dfclean = dfclean.dropna(subset=['name'])
dfclean = dfclean.dropna(subset=['summary'])
dfclean = dfclean.dropna(subset=['space'])
dfclean = dfclean.dropna(subset=['description'])
dfclean = dfclean.dropna(subset=['host_id'])
dfclean = dfclean.dropna(subset=['host_name'])
dfclean = dfclean.dropna(subset=['host_listings_count'])
dfclean = dfclean.dropna(subset=['neighbourhood_cleansed'])
dfclean = dfclean.dropna(subset=['city'])
dfclean = dfclean.dropna(subset=['state'])
dfclean = dfclean.dropna(subset=['zipcode'])
dfclean = dfclean.dropna(subset=['country'])
dfclean = dfclean.dropna(subset=['latitude'])
dfclean = dfclean.dropna(subset=['longitude'])
dfclean = dfclean.dropna(subset=['property_type'])
dfclean = dfclean.dropna(subset=['room_type'])
dfclean = dfclean.dropna(subset=['price'])
dfclean = dfclean.dropna(subset=['number_of_reviews'])
dfclean = dfclean.dropna(subset=['review_scores_rating'])
dfclean = dfclean.dropna(subset=['average_review_score'])
dfclean = dfclean.dropna(subset=['reviews_per_month'])
#round score rating
dfclean['average_review_score'] = dfclean['average_review_score']/2
dfclean.average_review_score = dfclean.average_review_score.round()

dfclean.neighbourhood_cleansed=dfclean.neighbourhood_cleansed.replace(' ', '_', regex=True)
#pd.Series(' '.join(dfclean.neighbourhood_cleansed).split()).value_counts()[:20]

dfclean.average_review_score[dfclean['average_review_score']== 1] = '1'
dfclean.average_review_score[dfclean['average_review_score']== 2] = '1'
dfclean.average_review_score[dfclean['average_review_score']== 3] = '1'
dfclean.average_review_score[dfclean['average_review_score']== 4] = '0'
dfclean.average_review_score[dfclean['average_review_score']== 5] = '0'

dfclean['average_review_score'].value_counts()/dfclean['average_review_score'].count()

dfclean.average_review_score.value_counts()

model_variables = ['neighbourhood_cleansed', 'property_type','room_type','average_review_score']
price_data_relevent = dfclean[model_variables]

price_relevant_enconded = pd.get_dummies(price_data_relevent)

training_features, test_features, \
training_target, test_target, = train_test_split(price_relevant_enconded.drop(['average_review_score'], axis=1),
                                               price_relevant_enconded['average_review_score'],
                                               test_size = .15,
                                               random_state=12)


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-21-1c632a59b870> in <module>
      1 training_features, test_features, \
----> 2 training_target, test_target, = train_test_split(price_relevant_enconded.drop(['average_review_score'], axis=1),
      3                                                price_relevant_enconded['average_review_score'],
      4                                                test_size = .15,
      5                                                random_state=12)

~\Anaconda3\lib\site-packages\pandas\core\frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   4115             level=level,
   4116             inplace=inplace,
-> 4117             errors=errors,
   4118         )
   4119 

~\Anaconda3\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   3912         for axis, labels in axes.items():
   3913             if labels is not None:
-> 3914                 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   3915 
   3916         if inplace:

~\Anaconda3\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors)
   3944                 new_axis = axis.drop(labels, level=level, errors=errors)
   3945             else:
-> 3946                 new_axis = axis.drop(labels, errors=errors)
   3947             result = self.reindex(**{axis_name: new_axis})
   3948 

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in drop(self, labels, errors)
   5338         if mask.any():
   5339             if errors != "ignore":
-> 5340                 raise KeyError("{} not found in axis".format(labels[mask]))
   5341             indexer = indexer[~mask]
   5342         return self.delete(indexer)

KeyError: "['average_review_score'] not found in axis"

вывод для

for col in price_relevant_enconded.columns:
    print(col)

neighbourhood_cleansed_Acton
neighbourhood_cleansed_Adams-Normandie
neighbourhood_cleansed_Agoura_Hills
neighbourhood_cleansed_Agua_Dulce
neighbourhood_cleansed_Alhambra
neighbourhood_cleansed_Alondra_Park
neighbourhood_cleansed_Altadena
neighbourhood_cleansed_Angeles_Crest
neighbourhood_cleansed_Arcadia
neighbourhood_cleansed_Arleta
neighbourhood_cleansed_Arlington_Heights
neighbourhood_cleansed_Artesia
neighbourhood_cleansed_Athens
neighbourhood_cleansed_Atwater_Village
neighbourhood_cleansed_Avalon
neighbourhood_cleansed_Avocado_Heights
neighbourhood_cleansed_Azusa
neighbourhood_cleansed_Baldwin_Hills/Crenshaw
neighbourhood_cleansed_Baldwin_Park
neighbourhood_cleansed_Bel-Air
neighbourhood_cleansed_Bell
neighbourhood_cleansed_Bell_Gardens
neighbourhood_cleansed_Bellflower
neighbourhood_cleansed_Beverly_Crest
neighbourhood_cleansed_Beverly_Grove
neighbourhood_cleansed_Beverly_Hills
neighbourhood_cleansed_Beverlywood
neighbourhood_cleansed_Boyle_Heights
neighbourhood_cleansed_Bradbury
neighbourhood_cleansed_Brentwood
neighbourhood_cleansed_Broadway-Manchester
neighbourhood_cleansed_Burbank
neighbourhood_cleansed_Calabasas
neighbourhood_cleansed_Canoga_Park
neighbourhood_cleansed_Carson
neighbourhood_cleansed_Carthay
neighbourhood_cleansed_Castaic
neighbourhood_cleansed_Castaic_Canyons
neighbourhood_cleansed_Central-Alameda
neighbourhood_cleansed_Century_City
neighbourhood_cleansed_Cerritos
neighbourhood_cleansed_Charter_Oak
neighbourhood_cleansed_Chatsworth
neighbourhood_cleansed_Chesterfield_Square
neighbourhood_cleansed_Cheviot_Hills
neighbourhood_cleansed_Chinatown
neighbourhood_cleansed_Citrus
neighbourhood_cleansed_Claremont
neighbourhood_cleansed_Commerce
neighbourhood_cleansed_Compton
neighbourhood_cleansed_Covina
neighbourhood_cleansed_Culver_City
neighbourhood_cleansed_Cypress_Park
neighbourhood_cleansed_Del_Aire
neighbourhood_cleansed_Del_Rey
neighbourhood_cleansed_Desert_View_Highlands
neighbourhood_cleansed_Diamond_Bar
neighbourhood_cleansed_Downey
neighbourhood_cleansed_Downtown
neighbourhood_cleansed_Duarte
neighbourhood_cleansed_Eagle_Rock
neighbourhood_cleansed_East_Hollywood
neighbourhood_cleansed_East_La_Mirada
neighbourhood_cleansed_East_Los_Angeles
neighbourhood_cleansed_East_Pasadena
neighbourhood_cleansed_East_San_Gabriel
neighbourhood_cleansed_Echo_Park
neighbourhood_cleansed_El_Monte
neighbourhood_cleansed_El_Segundo
neighbourhood_cleansed_El_Sereno
neighbourhood_cleansed_Elysian_Park
neighbourhood_cleansed_Elysian_Valley
neighbourhood_cleansed_Encino
neighbourhood_cleansed_Exposition_Park
neighbourhood_cleansed_Fairfax
neighbourhood_cleansed_Florence
neighbourhood_cleansed_Florence-Firestone
neighbourhood_cleansed_Gardena
neighbourhood_cleansed_Glassell_Park
neighbourhood_cleansed_Glendale
neighbourhood_cleansed_Glendora
neighbourhood_cleansed_Gramercy_Park
neighbourhood_cleansed_Granada_Hills
neighbourhood_cleansed_Green_Meadows
neighbourhood_cleansed_Green_Valley
neighbourhood_cleansed_Griffith_Park
neighbourhood_cleansed_Hacienda_Heights
neighbourhood_cleansed_Hancock_Park
neighbourhood_cleansed_Harbor_City
neighbourhood_cleansed_Harbor_Gateway
neighbourhood_cleansed_Harvard_Heights
neighbourhood_cleansed_Harvard_Park
neighbourhood_cleansed_Hasley_Canyon
neighbourhood_cleansed_Hawaiian_Gardens
neighbourhood_cleansed_Hawthorne
neighbourhood_cleansed_Hermosa_Beach
neighbourhood_cleansed_Highland_Park
neighbourhood_cleansed_Historic_South-Central
neighbourhood_cleansed_Hollywood
neighbourhood_cleansed_Hollywood_Hills
neighbourhood_cleansed_Hollywood_Hills_West
neighbourhood_cleansed_Huntington_Park
neighbourhood_cleansed_Hyde_Park
neighbourhood_cleansed_Industry
neighbourhood_cleansed_Inglewood
neighbourhood_cleansed_Irwindale
neighbourhood_cleansed_Jefferson_Park
neighbourhood_cleansed_Koreatown
neighbourhood_cleansed_La_Cañada_Flintridge
neighbourhood_cleansed_La_Crescenta-Montrose
neighbourhood_cleansed_La_Habra_Heights
neighbourhood_cleansed_La_Mirada
neighbourhood_cleansed_La_Puente
neighbourhood_cleansed_La_Verne
neighbourhood_cleansed_Ladera_Heights
neighbourhood_cleansed_Lake_Balboa
neighbourhood_cleansed_Lake_Hughes
neighbourhood_cleansed_Lake_Los_Angeles
neighbourhood_cleansed_Lake_View_Terrace
neighbourhood_cleansed_Lakewood
neighbourhood_cleansed_Lancaster
neighbourhood_cleansed_Larchmont
neighbourhood_cleansed_Lawndale
neighbourhood_cleansed_Leimert_Park
neighbourhood_cleansed_Lennox
neighbourhood_cleansed_Leona_Valley
neighbourhood_cleansed_Lincoln_Heights
neighbourhood_cleansed_Lomita
neighbourhood_cleansed_Long_Beach
neighbourhood_cleansed_Lopez/Kagel_Canyons
neighbourhood_cleansed_Los_Feliz
neighbourhood_cleansed_Lynwood
neighbourhood_cleansed_Malibu
neighbourhood_cleansed_Manchester_Square
neighbourhood_cleansed_Manhattan_Beach
neighbourhood_cleansed_Mar_Vista
neighbourhood_cleansed_Marina_del_Rey
neighbourhood_cleansed_Mayflower_Village
neighbourhood_cleansed_Maywood
neighbourhood_cleansed_Mid-City
neighbourhood_cleansed_Mid-Wilshire
neighbourhood_cleansed_Mission_Hills
neighbourhood_cleansed_Monrovia
neighbourhood_cleansed_Montebello
neighbourhood_cleansed_Montecito_Heights
neighbourhood_cleansed_Monterey_Park
neighbourhood_cleansed_Mount_Washington
neighbourhood_cleansed_North_El_Monte
neighbourhood_cleansed_North_Hills
neighbourhood_cleansed_North_Hollywood
neighbourhood_cleansed_North_Whittier
neighbourhood_cleansed_Northeast_Antelope_Valley
neighbourhood_cleansed_Northridge
neighbourhood_cleansed_Northwest_Antelope_Valley
neighbourhood_cleansed_Northwest_Palmdale
neighbourhood_cleansed_Norwalk
neighbourhood_cleansed_Pacific_Palisades
neighbourhood_cleansed_Pacoima
neighbourhood_cleansed_Palmdale
neighbourhood_cleansed_Palms
neighbourhood_cleansed_Palos_Verdes_Estates
neighbourhood_cleansed_Panorama_City
neighbourhood_cleansed_Paramount
neighbourhood_cleansed_Pasadena
neighbourhood_cleansed_Pico-Robertson
neighbourhood_cleansed_Pico-Union
neighbourhood_cleansed_Pico_Rivera
neighbourhood_cleansed_Playa_Vista
neighbourhood_cleansed_Playa_del_Rey
neighbourhood_cleansed_Pomona
neighbourhood_cleansed_Porter_Ranch
neighbourhood_cleansed_Quartz_Hill
neighbourhood_cleansed_Ramona
neighbourhood_cleansed_Rancho_Dominguez
neighbourhood_cleansed_Rancho_Palos_Verdes
neighbourhood_cleansed_Rancho_Park
neighbourhood_cleansed_Redondo_Beach
neighbourhood_cleansed_Reseda
neighbourhood_cleansed_Ridge_Route
neighbourhood_cleansed_Rolling_Hills
neighbourhood_cleansed_Rolling_Hills_Estates
neighbourhood_cleansed_Rosemead
neighbourhood_cleansed_Rowland_Heights
neighbourhood_cleansed_San_Dimas
neighbourhood_cleansed_San_Fernando
neighbourhood_cleansed_San_Gabriel
neighbourhood_cleansed_San_Marino
neighbourhood_cleansed_San_Pasqual
neighbourhood_cleansed_San_Pedro
neighbourhood_cleansed_Santa_Clarita
neighbourhood_cleansed_Santa_Fe_Springs
neighbourhood_cleansed_Santa_Monica
neighbourhood_cleansed_Sawtelle
neighbourhood_cleansed_Sepulveda_Basin
neighbourhood_cleansed_Shadow_Hills
neighbourhood_cleansed_Sherman_Oaks
neighbourhood_cleansed_Sierra_Madre
neighbourhood_cleansed_Signal_Hill
neighbourhood_cleansed_Silver_Lake
neighbourhood_cleansed_South_El_Monte
neighbourhood_cleansed_South_Gate
neighbourhood_cleansed_South_Park
neighbourhood_cleansed_South_Pasadena
neighbourhood_cleansed_South_San_Gabriel
neighbourhood_cleansed_South_San_Jose_Hills
neighbourhood_cleansed_South_Whittier
neighbourhood_cleansed_Southeast_Antelope_Valley
neighbourhood_cleansed_Stevenson_Ranch
neighbourhood_cleansed_Studio_City
neighbourhood_cleansed_Sun_Valley
neighbourhood_cleansed_Sun_Village
neighbourhood_cleansed_Sunland
neighbourhood_cleansed_Sylmar
neighbourhood_cleansed_Tarzana
neighbourhood_cleansed_Temple_City
neighbourhood_cleansed_Toluca_Lake
neighbourhood_cleansed_Topanga
neighbourhood_cleansed_Torrance
neighbourhood_cleansed_Tujunga
neighbourhood_cleansed_Tujunga_Canyons
neighbourhood_cleansed_Unincorporated_Catalina_Island
neighbourhood_cleansed_Unincorporated_Santa_Monica_Mountains
neighbourhood_cleansed_Unincorporated_Santa_Susana_Mountains
neighbourhood_cleansed_Universal_City
neighbourhood_cleansed_University_Park
neighbourhood_cleansed_Val_Verde
neighbourhood_cleansed_Valinda
neighbourhood_cleansed_Valley_Glen
neighbourhood_cleansed_Valley_Village
neighbourhood_cleansed_Van_Nuys
neighbourhood_cleansed_Venice
neighbourhood_cleansed_Vermont-Slauson
neighbourhood_cleansed_Vermont_Knolls
neighbourhood_cleansed_Vermont_Square
neighbourhood_cleansed_Vermont_Vista
neighbourhood_cleansed_Vernon
neighbourhood_cleansed_Veterans_Administration
neighbourhood_cleansed_View_Park-Windsor_Hills
neighbourhood_cleansed_Vincent
neighbourhood_cleansed_Walnut
neighbourhood_cleansed_Watts
neighbourhood_cleansed_West_Adams
neighbourhood_cleansed_West_Carson
neighbourhood_cleansed_West_Covina
neighbourhood_cleansed_West_Hills
neighbourhood_cleansed_West_Hollywood
neighbourhood_cleansed_West_Los_Angeles
neighbourhood_cleansed_West_Puente_Valley
neighbourhood_cleansed_West_Whittier-Los_Nietos
neighbourhood_cleansed_Westchester
neighbourhood_cleansed_Westlake
neighbourhood_cleansed_Westlake_Village
neighbourhood_cleansed_Westmont
neighbourhood_cleansed_Westwood
neighbourhood_cleansed_Whittier
neighbourhood_cleansed_Willowbrook
neighbourhood_cleansed_Wilmington
neighbourhood_cleansed_Windsor_Square
neighbourhood_cleansed_Winnetka
neighbourhood_cleansed_Woodland_Hills
property_type_Aparthotel
property_type_Apartment
property_type_Barn
property_type_Bed and breakfast
property_type_Boat
property_type_Boutique hotel
property_type_Bungalow
property_type_Bus
property_type_Cabin
property_type_Camper/RV
property_type_Campsite
property_type_Casa particular (Cuba)
property_type_Castle
property_type_Chalet
property_type_Condominium
property_type_Cottage
property_type_Dome house
property_type_Dorm
property_type_Earth house
property_type_Farm stay
property_type_Guest suite
property_type_Guesthouse
property_type_Hostel
property_type_Hotel
property_type_House
property_type_Houseboat
property_type_Hut
property_type_Island
property_type_Loft
property_type_Other
property_type_Resort
property_type_Serviced apartment
property_type_Tent
property_type_Tiny house
property_type_Tipi
property_type_Townhouse
property_type_Train
property_type_Treehouse
property_type_Villa
property_type_Yurt
room_type_Entire home/apt
room_type_Hotel room
room_type_Private room
room_type_Shared room
average_review_score_0
average_review_score_1

вывод для

price_relevant_enconded.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27557 entries, 1 to 35953
Columns: 306 entries, neighbourhood_cleansed_Acton to average_review_score_1
dtypes: uint8(306)
memory usage: 8.3 MB

Я продолжил с кодом следующим образом:

#Create Training and Test Sets
training_features, test_features, \
training_target, test_target, = train_test_split(price_relevant_enconded.drop(['average_review_score'], axis=1),
                                               price_relevant_enconded['average_review_score'],
                                               test_size = .15,
                                               random_state=12)

#Oversample minority class on training data.
x_train, x_val, y_train, y_val = train_test_split(training_features, training_target,
                                                  test_size = .1,
                                                  random_state=12)
sm = SMOTE(random_state=12, ratio = 1.0)
x_train_res, y_train_res = sm.fit_sample(x_train, y_train)

clf_rf = RandomForestClassifier(n_estimators=25, random_state=12)
clf_rf.fit(x_train_res, y_train_res)

print('Validation Results')
print('Mean Accuracy:',clf_rf.score(x_val, y_val))
print('Recall:',recall_score(y_val, clf_rf.predict(x_val))) 
print('\nTest Results') 
print('Mean Accuracy:',clf_rf.score(test_features, test_target))
print('Recall:',recall_score(test_target, clf_rf.predict(test_features))) 

Validation Results
Mean Accuracy: 0.9709773794280837
Recall: 0.0625

Test Results
Mean Accuracy: 0.9775036284470247
Recall: 0.03225806451612903

Есть ли у кого-нибудь какие-либо идеи о том, как мне стать лучшеоптимизировать мою модель или внести изменения, чтобы сделать более точные прогнозы на основе этих данных?

...