Я пытаюсь классифицировать данные на уровне токена с помощью scikit-learn. У меня уже есть сплит train
и test
. Данные представлены в следующем формате \t
с разделением:
-----------------
token label
-----------------
way 6
to 6
reduce 6
the 6
amount 6
of 6
traffic 6
....
public 2
transport 5
is 5
a 5
key 5
factor 5
to 5
minimize 5
....
Данные распределяются следующим образом:
Training Data Test Data
# Total: 119490 29699
# Class 0: 52631 13490
# Class 1: 35116 8625
# Class 2: 17968 4161
# Class 3: 8658 2088
# Class 4: 3002 800
# Class 5: 1201 302
# Class 6: 592 153
Я пытаюсь SVM
, а F1-score
- довольно плохо.
Код:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import KFold
if __name__ == '__main__':
# reading Files
train_df = pd.read_csv(TRAINING_DATA_PATH, names=['token', 'label'], sep='\t')
test_df = pd.read_csv(TEST_DATA_PATH, names=['token', 'label'], sep='\t')
# getting training and testing data
train_X = train_df['token'].astype('U')
test_X = test_df['token'].astype('U')
train_y = train_df['label']
test_y = test_df['label']
# Linear SVM
sgd = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=100, tol=None)
])
f1_list = []
acc_list = []
cv = KFold(n_splits=5)
for train_index, test_index in cv.split(train_X):
X_train, X_val = train_X[train_index], train_X[test_index]
y_train, y_val = train_y[train_index], train_y[test_index]
sgd.fit(X_train, y_train)
predicted = sgd.predict(X_val)
f1 = f1_score(y_val, predicted, average='macro')
acc = accuracy_score(y_val, predicted)
f1_list.append(f1)
acc_list.append(acc)
print(f1_list)
print(acc_list)
sgd_pred = sgd.predict(test_X)
print('SVM accuracy: %s' % accuracy_score(sgd_pred, test_y))
print('SVM F1-macro: %s' % f1_score(sgd_pred, test_y, average='macro'))
print('SVM F1-weighted: %s' % f1_score(sgd_pred, test_y, average='weighted'))
Результаты для линейного SVM следующие:
SVM accuracy: 0.49493248930940437
SVM F1-macro: 0.2677988484198396
Как улучшить производительность?