Для обучения моего собственного NER по пользовательским объектам мне нужно, чтобы мой набор данных был предварительно подготовлен в формате CONLL-2003, как указано в - https://github.com/yongyuwen/sequence-tagging-ner.
Как мне преобразовать мои файлы текстовых документов (.txt) в указанный формат CONLL-U - например, [Word POS CHUNK NER].
Примечание. Для заданных текстовых документов у меня уже есть пользовательские теги NER.
Пример данных (training_data.txt):
(Sample 1)
This Agreement of Work is made pursuant to the Global Developer Master Services Agreement effective as of May 24, 2018, as amended on March 28, 2016, between MA[CUSTOM_ENTITY], lnc.[CUSTOM_ENTITY] whose registered office or principal place of business is at 520 Madison Avenue, Ahmedabad, India, whose registered office or principal place of business is at Building A, Atlantis de la, Switzerland, collectively and ABC[CUSTOM_ENTITY] LLC[CUSTOM_ENTITY] a wholly owned subsidiary of Amazon Services Ltd and having its registered office at 113 Red Avenue, 10th Floor, New York, NY 13027.
(Sample 2)
This Agreement of Work is subject to the terms and conditions of the Master Agreement for Technology Consulting Services between Vignesh[CUSTOM_ENTITY] Services[CUSTOM_ENTITY] Limited[CUSTOM_ENTITY] and ABD[CUSTOM_ENTITY] LLC[CUSTOM_ENTITY], an entity wholly owned by ABC[CUSTOM_ENTITY] Holdings[CUSTOM_ENTITY] LLC[CUSTOM_ENTITY].
(Sample 3)
This Agreement of Work dated October 22, 2013 between Google[CUSTOM_ENTITY] Services[CUSTOM_ENTITY] Limited[CUSTOM_ENTITY] and Avaya[CUSTOM_ENTITY] Communications[CUSTOM_ENTITY] Management[CUSTOM_ENTITY], LLC[CUSTOM_ENTITY] and any of its operating subsidiaries and affiliates which receive Services from Vendor incorporates and is governed by the terms and conditions contained in the Master Services Agreement Services, by and between Avaya and Vendor.
Где [CUSTOM_ENTITY] - тег для новой сущности, которая будет обучаться с NER.