Подготовка данных для NER в формате CONLL 2003 BIO - PullRequest
1 голос
/ 17 марта 2019

Для обучения моего собственного NER по пользовательским объектам мне нужно, чтобы мой набор данных был предварительно подготовлен в формате CONLL-2003, как указано в - https://github.com/yongyuwen/sequence-tagging-ner.

Как мне преобразовать мои файлы текстовых документов (.txt) в указанный формат CONLL-U - например, [Word POS CHUNK NER].

Примечание. Для заданных текстовых документов у меня уже есть пользовательские теги NER.

Пример данных (training_data.txt):

(Sample 1)
This Agreement of Work is made pursuant to the Global Developer Master Services Agreement effective as  of May 24, 2018, as amended on March 28, 2016, between MA[CUSTOM_ENTITY], lnc.[CUSTOM_ENTITY] whose registered office or principal place of  business is at 520 Madison Avenue, Ahmedabad, India, whose registered  office or principal place of business is at Building A, Atlantis de la,  Switzerland, collectively and ABC[CUSTOM_ENTITY] LLC[CUSTOM_ENTITY] a wholly owned subsidiary of  Amazon Services Ltd and having its registered office at 113 Red Avenue, 10th Floor, New York, NY 13027.

(Sample 2)
This Agreement of Work is subject to the terms and conditions of the Master Agreement for Technology  Consulting Services between Vignesh[CUSTOM_ENTITY] Services[CUSTOM_ENTITY] Limited[CUSTOM_ENTITY] and ABD[CUSTOM_ENTITY] LLC[CUSTOM_ENTITY], an  entity wholly owned by ABC[CUSTOM_ENTITY] Holdings[CUSTOM_ENTITY] LLC[CUSTOM_ENTITY].

(Sample 3)
This Agreement of Work dated October 22, 2013 between Google[CUSTOM_ENTITY] Services[CUSTOM_ENTITY] Limited[CUSTOM_ENTITY]  and Avaya[CUSTOM_ENTITY] Communications[CUSTOM_ENTITY] Management[CUSTOM_ENTITY], LLC[CUSTOM_ENTITY] and any of its operating subsidiaries and  affiliates which receive Services from Vendor incorporates and is governed by the terms and  conditions contained in the Master Services Agreement Services, by and between Avaya and Vendor.

Где [CUSTOM_ENTITY] - тег для новой сущности, которая будет обучаться с NER.

...