Проблема кодировки NLTK Tokenizer - PullRequest
0 голосов
/ 23 ноября 2018

После токенизации мое предложение содержит много странных символов.Как я могу удалить их?Это мой код:

def summary(filename, method):
    list_names = glob.glob(filename)
    orginal_data = []
    topic_data = []
    print(list_names)
    for file_name in list_names:
        article = []
        article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()
        for line in article_temp:
            print(line)
            if (line.strip()):
                tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')
                sentences = tokenizer.tokenize(line)
                print(sentences)
                article = article + sentences
        orginal_data.append(article)
        topic_data.append(preprocess_data(article))
    if (method == "orig"):
        summary = generate_summary_origin(topic_data, 100, orginal_data)
    elif (method == "best-avg"):
        summary = generate_summary_best_avg(topic_data, 100, orginal_data)
    else:
        summary = generate_summary_simplified(topic_data, 100, orginal_data)
    return summary

print(line) печатает строку текста.И print(sentences) печатает токенизированные предложения в строке.

Но иногда предложения содержат странные символы после обработки nltk.

Assaly, who is a fan of both Pusha T and Drake, said he and his friends 
wondered if people in the crowd might boo Pusha T during the show, but 
said he never imagined actual violence would take place.

[u'Assaly, who is a fan of both Pusha T and Drake, said he and his 
friends wondered if people in\xa0the crowd might boo Pusha\xa0T during 
the show, but said he never imagined actual violence would take 
place.']

Как и в примере выше, где \xa0 и \xa0T от?

1 Ответ

0 голосов
/ 23 ноября 2018
x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in\xa0the crowd might boo Pusha\xa0T during the show, but said he never imagined actual violence would take place.'

# method 1 
x.replace('\xa0', ' ')

# method 2
import unicodedata
unicodedata.normalize('NFKD', x)

print(x)

Вывод:

Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.

Ссылка: unicodedata.normalize ()

...