Есть ли способ токенизации строк с диапазоном ngram?Например, когда вы получаете функции от CountVectorizer.Например, (диапазон wgram = (1,2)):
strings = ['this is the first sentence','this is the second sentence']
до
[['this','this is','is','is the','the','the first',''first','first sentence','sentence'],['this','this is','is','is the','the','the second',''second','second sentence','sentence']]
Обновление: итерация по ni get:
sentence = 'this is the first sentence'
nrange_array = []
for n in range(1,3):
nrange = ngrams(sentence.split(),n)
nrange_array.append(nrange)
for nrange in nrange_array:
for grams in nrange:
print(grams)
вывод:
('this',)
('is',)
('the',)
('first',)
('sentence',)
('this', 'is')
('is', 'the')
('the', 'first')
('first', 'sentence')
и я хочу:
('this','this is','is','is the','the','the first','first','first sentence','sentence')