Я хочу использовать Isolation Forest для нахождения выбросов в текстовых данных. Я использую Pythons Scikit-Learn lib. Мне удалось найти выбросы в моем фрейме данных, но проблема в том, что я всегда получаю эту ошибку, когда хочу преобразовать данные, когда я делаю прогноз с моим выводом:
ValueError: Expected 2D array, got 1D array instead:
array=['good work'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Что я делаю не так? Также я должен использовать OneHotEncoder
или LabelEncoer
или CountVectorizer
?
Вот мой код:
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
textual_data = ['i love you', 'Ilove your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'such a nice day', 'very nice', 'bad', 'beautiful', 'i like you',
'thats nice', 'terrible work', 'you are ugly', 'wow thats so cool', 'thats cool', 'you are funny', 'great job', 'good job', 'wrong aswer', 'lol',
'thats right', 'you have done amazing work', 'your job is good', 'you have nice eyes', 'this was wrong', 'thats terrible', 'ugly', 'i do not like that',
'we had great time on vecation', 'watter is so dirty', 'this smells very bad', 'nice smell', 'smells nice', 'have a great day', 'enjoy your summer']
df = pd.DataFrame({'my text': textual_data})
x = df
# Transform the features
encoder = ColumnTransformer(transformers=[('onehot', OneHotEncoder(), ['my text'])])
x = encoder.fit_transform(x)
isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
model = isolation_forest.fit(x)
#the error is in this part of code
list_of_val = ['good work', 'you are wrong', 'this was amazing', 'great work', 'terrible work from you']
for val in list_of_val:
input_par = encoder.transform([val])#ERROR
outlier = model.predict(input_par)
#print(outlier)
if outlier[0] == -1:
print('Values', val, 'are outliers')
else:
print('Values', val, 'are not outliers')