Как использовать sklearn.countvectorizer? - PullRequest
0 голосов
/ 09 сентября 2018

Я пытался использовать sklearn.countvectorizer, но это не сработало. Я использовал корпус с двумя образцами строк (я собираюсь импортировать данные из Википедии позже, но сейчас мне нужно, чтобы система работала):

__label__1 Buyer beware: This is a self-published book, and if you want to know why--read a few paragraphs! Those 5 star reviews must have been written by Ms. Haddon's family and friends--or perhaps, by herself! I can't imagine anyone reading the whole thing--I spent an evening with the book and a friend and we were in hysterics reading bits and pieces of it to one another. It is most definitely bad enough to be entered into some kind of a "worst book" contest.
__label__2 Glorious story: I loved Whisper of the wicked saints. The story was amazing and I was pleasantly surprised at the changes in the book. I am not normaly someone who is into romance novels, but the world was raving about this book and so I bought it. I loved it !

Это мой код для создания векторизатора мирового уровня:

# load the dataset
data = open('corpus.txt').read()
labels, texts = [], []
for i, line in enumerate(data.split("\n")):
    content = line.split()
    labels.append(content[0])
    texts.append(content[1:])

# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels

# split the dataset into training and validation datasets 
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])

# label encode the target variable 
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(train_x)
xvalid_count =  count_vect.transform(valid_x)

Я получаю следующую ошибку:

AttributeError: 'list' object has no attribute 'lower'

Даже когда я конвертирую trainDF ['text'] в строку, я получаю еще одну ошибку:

ValueError: Iterable over raw text documents expected, string object received.

Что мне делать?

1 Ответ

0 голосов
/ 23 января 2019

строчный = False

в качестве аргумента для CountVectorizer ()

...