dataframe Я пытаюсь удалить стоп-слова (английский) из двух столбцов в кадре данных.Смотрите скриншот.Однако я обнаружил, что после применения этого процесса смысл обзора изменился.Например, Не рекомендуется было изменено на Рекомендуемое.Каков наилучший способ удалить стоп-слова, оставив идею оригинального текста без изменений?Это мой код и результаты:
from nltk import word_tokenize
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
df['Text_after_removed_stopwords'] = df['Text'].apply(lambda x: '
'.join([word for word in x.split() if word not in (stop)]))
print()
print('###Text after removed
stopwords###'+'\n'+df['Text_after_removed_stopwords'][1])
print()
print('###Text before removed stopwords###'+'\n'+ df['Text'][1])
print()
df['Summary_after_removed_stopwords'] = df['Summary'].apply(lambda
x: ' '.join([word for word in x.split() if word not in (stop)]))
print('###Summary after removed stopwords###'+ '
\n'+df['Summary_after_removed_stopwords'][1])
print()
print('###Summary before removed stopwords###'+'\n'+df['Summary'][
1])
###Text after removed stopwords###
product arrived labeled jumbo salted peanutsthe peanuts actually
small sized unsalted sure error vendor intended represent product
jumbo
###Text before removed stopwords###
product arrived labeled as jumbo salted peanutsthe peanuts were
actually small sized unsalted not sure if this was an error or if
the vendor intended to represent the product as jumbo
###Summary after removed stopwords###
advertised
###Summary before removed stopwords###
not as advertised