Я работаю над классификацией текста с помощью scikitlearn TfIdfVectorizer, и в документах есть свободное место Для моей классификации пробел является частью моего словаря, вопрос: как вставить пробел в мой словарь?
Пример кода:
vocab = [' ', '<', '>', '"', '#', '\'']
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer='char', vocabulary=set(vocab))
X = vectorizer.fit_transform(df['x'])
y = df['y']
print(vectorizer.vocabulary_)
Выдает ошибку:
Traceback (most recent call last):
File "/empty/path/script.py", line 158, in <module>
tf_idf_analysis(http_df);
File "/empty/path/script.py", line 96, in tf_idf_analysis
X = vectorizer.fit_transform(df['x']);
File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 1381, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
self.fixed_vocabulary_)
File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 792, in _count_vocab
for feature in analyze(doc):
File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 255, in <lambda>
return lambda doc: self._char_ngrams(preprocess(self.decode(doc)))
File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 158, in _char_ngrams
text_document = self._white_spaces.sub(" ", text_document)
TypeError: expected string or bytes-like object