Итак, я собрал некоторый код в Интернете для своей исследовательской работы и практики. Я работаю над набором данных о преступности в Денвере. который выглядит примерно так:
INCIDENT_ID 446399 non-null int64
OFFENSE_ID 446399 non-null int64
OFFENSE_CODE 446399 non-null int64
OFFENSE_CODE_EXTENSION 446399 non-null int64
OFFENSE_TYPE_ID 446399 non-null object
OFFENSE_CATEGORY_ID 446399 non-null object
FIRST_OCCURRENCE_DATE 446399 non-null object
LAST_OCCURRENCE_DATE 149714 non-null object
REPORTED_DATE 446399 non-null object
INCIDENT_ADDRESS 400668 non-null object
GEO_X 442927 non-null float64
GEO_Y 442927 non-null float64
GEO_LON 442927 non-null float64
GEO_LAT 442927 non-null float64
DISTRICT_ID 446399 non-null int64
PRECINCT_ID 446399 non-null int64
NEIGHBORHOOD_ID 446399 non-null object
IS_CRIME 446399 non-null int64
IS_TRAFFIC 446399 non-null int64
dtypes: float64(4), int64(8), object(7)
изображение начальных записей преступления.csv
Я применил этот код к нему:
def normalize(data): #feature normalization
data = (data - data.mean()) / (data.max() - data.min())
return data
num2month= {1:'jan',2:'feb',3:'mar',4:'apr',5:'may',6:'jun',7:'jul',8:'aug',9:'sep',10:'oct',11:'nov',12:'dec'}
crime = pd.read_csv('crime.csv')
train, test = train_test_split(crime, test_size=0.2)
test.to_csv('test.csv')
train.to_csv('train.csv')
train=pd.read_csv('train.csv', parse_dates = ['FIRST_OCCURRENCE_DATE'])
test=pd.read_csv('test.csv', parse_dates = ['FIRST_OCCURRENCE_DATE'])
#for training data
le_crime = preprocessing.LabelEncoder()
crime = le_crime.fit_transform(train.OFFENSE_CATEGORY_ID)
train['FIRST_OCCURRENCE_DATE'] = pd.to_datetime(train['FIRST_OCCURRENCE_DATE'])
train['FIRST_OCCURRENCE_DATE(DAYOFWEEK)'] = train['FIRST_OCCURRENCE_DATE'].dt.weekday_name
train['FIRST_OCCURRENCE_DATE(YEAR)'] = train['FIRST_OCCURRENCE_DATE'].dt.year
train['FIRST_OCCURRENCE_DATE(MONTH)'] = train['FIRST_OCCURRENCE_DATE'].dt.month
train['FIRST_OCCURRENCE_DATE(DAY)'] = train['FIRST_OCCURRENCE_DATE'].dt.day
train['Year'] = train['FIRST_OCCURRENCE_DATE'].dt.year
train['PdDistrict'] = train['OFFENSE_CATEGORY_ID']
#Get binarized weekdays, districts, and hours.
train['Days'] = train['FIRST_OCCURRENCE_DATE(DAYOFWEEK)']
days = pd.get_dummies(train.Days)
district = pd.get_dummies(train.PdDistrict)
month = pd.get_dummies(train.FIRST_OCCURRENCE_DATE.dt.month.map(num2month))
hour = train.FIRST_OCCURRENCE_DATE.dt.hour
submit = pd.read_csv('submit.csv')
#Build new array
new_datatr = pd.concat([hour, month, days, district], axis=1)
new_datatr['X']=normalize(train.GEO_LON)
new_datatr['Y']=normalize(train.GEO_LAT)
new_datatr['hour']=normalize(train.FIRST_OCCURRENCE_DATE.dt.hour)
new_datatr['crime']=crime
new_datatr['dark'] = train.FIRST_OCCURRENCE_DATE.dt.hour.apply(lambda x: 1 if (x >= 18 or x < 6) else 0)
train_proc = new_datatr
#and similarly same code for test data set
test_proc = new_datatr
features = [1,2,
'jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec',
'Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday',
#'X','Y'
]
training, validation = train_test_split(train_proc, train_size=.67)
model = BernoulliNB()
model.fit(training[features], training['crime'])
predicted = np.array(model.predict_proba(validation[features]))
log_loss(validation['crime'], predicted)
model = BernoulliNB()
model.fit(train_proc[features], train_proc['crime'])
predicted = model.predict_proba(test_proc[features])
le_crime = preprocessing.LabelEncoder()
crime = le_crime.fit_transform(train.OFFENSE_CATEGORY_ID)
result=pd.DataFrame(predicted, columns=le_crime.classes_)
result.to_csv('submit.csv', index = True, index_label = 'Id' )
В конце концов, когда я открываю файл отправки, я нахожу percatages членства в классе для каждого экземпляра, это выглядит как-то
Я хочу получить документ, который предсказывает точную категорию offense id
, а не членство в классе.