pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MLPClassifier(hidden_layer_sizes=(1,1,1),shuffle=False)),
])
parameters = {
'vect__max_df': (0.5, 0.75, 1),
'vect__ngram_range': ((1, 1), (1, 2)), # unigrams or bigrams
}
if __name__ == "__main__":
# multiprocessing requires the fork to happen in a __main__ protected
# block
# find the best parameters for both the feature extraction and the
# classifier
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
pprint(parameters)
t0 = time()
grid_search.fit(train_data['body'].values.tolist(),
train_data['category'].values.tolist())
print("done in %0.3fs" % (time() - t0))
print()
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters_a = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters_a[param_name]))
y_true, y_pred = test_data['category'].values.tolist(), grid_search.predict(test_data['body'].values.tolist())
print(classification_report(y_true, y_pred))
Пока мой OuutPut выглядит следующим образом
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'vect__max_df': (0.5, 0.75, 1), 'vect__ngram_range': ((1, 1), (1, 2))}
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 30 out of 30 | elapsed: 29.4min finished
done in 1779.666s
Best score: 0.352
Best parameters set:
vect__max_df: 0.75
vect__ngram_range: (1, 1)
precision recall f1-score support
anorexia 0.33 1.00 0.50 5000
neither 0.00 0.00 0.00 5000
obesity 0.00 0.00 0.00 5000
accuracy 0.33 15000
macro avg 0.11 0.33 0.17 15000
weighted avg 0.11 0.33 0.17 15000
c:\users\haeir\appdata\local\programs\python\python36\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Я использую Print (y_true), он дает ровно 5000 значений анорексии, 5000 значений ни одного, ни 5000 значений ожирения, но для печати (y_pred) он показывает только анорексию, мой набор данных обучения и тестирования содержит 15000, 15000 строк соответственно и 5000 для каждой категории как в обучении, так и в тестировании