Я написал скрипт на Python с тегом NER Стэнфорда. Из-за нехватки места я приведу лишь часть кода
Сначала прочитайте файл
lines=[]
with open('dialogue.txt', encoding='utf-8-sig') as outfile:
for line in outfile:
line = line.strip()
lines.append(line)
Затем разверните сокращения и удалите пространственные символы
def remove_special_characters(text, remove_digits=False):
pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
text = re.sub(pattern, '', text)
return text
Мой вывод пока
'[If we are all here let us get started First of all I would like you to please join me in welcoming Jack Peterson our Southwest Area Sales Vice President Thank you for having me I am looking forward to todays meeting I would also like to introduce Margaret Simmons who recently joined our team May I also introduce my assistant Bob Hamp Welcome Bob I am afraid our national sales director Anne Trusting cannot be with us today She is in Kobe at the moment developing our Far East sales force Let us get started We are here today to discuss ways of improving sales in rural market areas First let us go over the report from the last meeting which was held on June 24th Right Tom over to you Thank you Mark Let me just summarize the main points of the last meeting We began the meeting by approving the changes in our sales reporting system discussed on May 30th After briefly revising the changes that will take place we moved on to a brainstorming session concerning after customer support improvements You will find a copy of the main ideas developed and discussed in these sessions in the photocopies posted in front of you The meeting was declared closed at 1130 Petors is not coming todayprivate reasons Thank you Tom So if there is nothing else we need to discuss let us move on to todays agenda Have you all received a copy of todays agenda If you do not mind I would like to skip item 1 and move on to item 2 Sales improvement in rural market areas Jack has kindly agreed to give us a report on this matter Jack ]'
Теперь мы используем
nltk.tag.stanford.StanfordNERTagger
r=st.tag(doc.split())
for tag, chunk in groupby(r, lambda x:x[1]):
if tag != "O":
print("%-12s"%tag, " ".join(w for w, t in chunk))
Конечный выход
PERSON Jack
LOCATION Southwest
PERSON Margaret Simmons
LOCATION Kobe
LOCATION Far East
PERSON Jack
Не очень хорошо, такие же имена, как Анна, Том или Петрос, не были распознаны. OK Petros - греческое имя, но я не могу найти объяснения первым двум.