Я пытаюсь реализовать токенизацию spaCy через Cython для оптимизации скорости. Входные данные представляют собой список текстов, а ожидаемый выходной результат представляет собой вложенный 2-й список, где каждый подсписок представляет собой список токенов. Я ссылаюсь на пост по адресу https://github.com/huggingface/100-times-faster-nlp/blob/master/100-times-faster-nlp-in-python.ipynb для реализации.
# cython: infer_types=True
import re
cimport numpy
from cpython cimport *
from cymem.cymem cimport Pool
from spacy.tokens.doc cimport Doc
from spacy.lexeme cimport Lexeme
from spacy.structs cimport TokenC
from spacy.typedefs cimport hash_t
from spacy.attrs cimport IS_SPACE, IS_PUNCT, LIKE_NUM
cdef struct DocElement:
TokenC* c
int length
cdef list tokenize(DocElement* docs, int n_docs):
cdef list out = []
cdef int i
cdef list sub_out
for doc in docs[:n_docs]:
sub_out = []
for c in doc.c[:doc.length]:
if (not Lexeme.c_check_flag(c.lex, IS_SPACE) and not Lexeme.c_check_flag(c.lex, IS_PUNCT)
and not Lexeme.c_check_flag(c.lex, LIKE_NUM)):
sub_out.append(c.lex.lower)
out.append(sub_out)
return out
def tokenization(text_ls, stopwords, tokenizer):
cdef int i, n_out, n_docs = len(text_ls)
cdef Pool mem = Pool()
cdef DocElement* docs = <DocElement*>mem.alloc(n_docs, sizeof(DocElement))
cdef Doc doc
text_ls = [re.sub('[^\\w ]+', ' ', text).lower() if text else '' for text in text_ls]
text_ls = tokenizer.tokenizer.pipe(text_ls)
for i, doc in enumerate(text_ls):
docs[i].c = doc.c
docs[i].length = (<Doc>doc).length
stops = set()
for word in stopwords:
stops.add(tokenizer.vocab.strings[word])
out = tokenize(docs, n_docs)
out = [[tokenizer.vocab.strings[item] for item in subset] for subset in out]
out = [[item for item in subset if item not in stops] for subset in out]
return out
Я использую setup.py для установки этого пакета с языком C ++. Меня беспокоит этот дизайн, в функции токенизации, которая требует ввода данных C -структуры, объявленный вывод - список python. Я проверил с одной текстовой строкой, и она работает:
import spacy
text = 'SINGAPORE - Protecting the jobs of Singaporeans and ensuring the survival of businesses will be the Government\'s primary focus, said Trade and Industry Minister Chan Chun Sing, as the country hunkers down for what could be a protracted battle with the novel coronavirus.\n\n"I would like to reassure Singaporean businesses and workers that we stand together with them. We do have the means to help them tide over this difficult moment but we must do this with a long-term perspective," said Mr Chan.\n\nThe impact of the Wuhan virus could be "wider, deeper and longer" than that of the severe acute respiratory syndrome (Sars) epidemic in 2003, and Singaporeans need to be mentally prepared for this, he said, adding that measures put in place must be sustainable.\n\n\nHe was speaking to reporters after visiting Oasia Hotel Downtown with Manpower Minister Josephine Teo, where they inspected precautionary measures put in place by the hotel after a hotel guest was found to have come down with the virus.\n\nSingapore\'s 13th case of the novel coronavirus, a 73-year-old female Chinese national, had stayed there.\n\nMr Chan\'s comments echoed those he made earlier in the day at a Chinese New Year lunch for residents of Tanjong Pagar GRC and Radin Mas constituency.\n\n\nIn that speech, Mr Chan called on Singaporeans to gird themselves "psychologically, emotionally, economically and socially" as the battle with the virus could be one for the long haul.\n\nGet Wuhan virus alerts\nReceive e-mail updates and top stories from The Straits Times.\n\nEnter your e-mail\n Sign up\nBy signing up, you agree to our Privacy Policy and Terms and Conditions.\n\nPrevious epidemics have lasted from a few months to a year, but they have had wide implications, disrupting global supply chains and affecting industries from tourism to manufacturing.\n\n"Because we don\'t know how long this situation will last, all the measures we take, be it in health, or economics and jobs... must be sustainable. We cannot just be taking measures for the short haul, thinking that it will blow over," said Mr Chan.\n\nThe novel coronavirus, which first emerged in the Chinese city of Wuhan in December last year, has so far proved to be more infectious than Sars.\n\nIt seems, however, to be less deadly, with a fatality rate of 2 to 3 per cent in China, said Mr Chan. On the other hand, Sars had a fatality rate of about 9.6 per cent.\n\nRelated Story\nWuhan virus: Get latest updates\nRelated Story\nInteractive: What we know so far about the Wuhan virus\nRelated Story\nWuhan virus: 15 people refused entry into Singapore following new travel restrictions\nChina has been grappling with containing the infectious virus, which has sickened thousands and killed over 300 people. So far 18 people, including two Singaporeans, have been found infected by the virus here.\n\nLater at the hotel, while Mr Chan said it was still too early to to put a number to the economic hit from the outbreak, he said the Government would be taking several measures with immediate effect to help tourism businesses mitigate the impact.\n\nIt will waive licence fees for hotels, travel agents and tourist guides, as well as defray the cleaning and disinfection costs of hotels that had confirmed and suspected cases of the novel coronavirus.\n\nThis initial package is part of a full raft of measures that will be detailed by Finance Minister Heng Swee Keat at the upcoming Budget speech on Feb 18.\n\nHotel operators typically have to pay between $300 and $500 to renew their licences yearly, depending on the number of rooms each hotel has.\n\nThose hotels where suspected and confirmed cases of the virus had been found, have also had to do enhanced environmental cleaning and disinfection - the Singapore Tourism Board will bear up to half the cost of such cleaning fees.'
tokenizer = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'ner', 'textcat'])
spacy_test.tokenization([text], {}, tokenizer)
Однако, когда я использую длинный список текста, для этого текста я повторяюсь 50 раз, это показывает ошибку сегментации. Если я сделаю еще больше, он выдаст мне «KeyError:» [E018] Не удалось получить строку для ha sh '10283392'. " Объявление списка в функции токенизации, но оно не решается. Я новичок в Cython, и до этого я писал только C -extension, какой должен быть лучший способ написания таких функций?