Я использую SpaCy для получения именованных сущностей.Однако он всегда неправильно маркирует символы новой строки как именованные объекты.
Ниже вводимый текст.
mytxt = """<?xml version="1.0"?>
<nitf>
<head>
<title>KNOW YOUR ROLE ON SUPER BOWL LIII.</title>
</head>
<body>
<body.head>
<hedline>
<hl1>KNOW YOUR ROLE ON SUPER BOWL LIII.</hl1>
</hedline>
<distributor>Gale Group</distributor>
</body.head>
<body.content>
<p>Montpelier: <org>Department of Motor Vehicles</org>, has issued the following
news release:</p>
<p>Be a designated sober driver, help save lives. Remember these tips
on game night:</p>
<p>Know your State's laws: refusing to take a breath test in many
jurisdictions could result in arrest, loss of your driver's
license, and impoundment of your vehicle. Not to mention the
embarrassment in explaining your situation to family, friends, and
employers.</p>
<p>In case of any query regarding this article or other content needs
please contact: <a href="mailto:editorial@plusmediasolutions.com">editorial@plusmediasolutions.com</a></p>
</body.content>
</body>
</nitf>
"""
Ниже мой код:
CONTENT_XML_TAG = ('p', 'ul', 'h3', 'h1', 'h2', 'ol')
soup = BeautifulSoup(mytxt, 'xml')
spacy_model = spacy.load('en_core_web_sm')
content = "\n".join([p.get_text() for p in soup.find('body.content').findAll(CONTENT_XML_TAG)])
print(content)
section_spacy = spacy_model(content)
tokenized_sentences = []
for sent in section_spacy.sents:
tokenized_sentences.append(sent)
for s in tokenized_sentences:
labels = [(ent.text, ent.label_) for ent in s.ents]
print(Counter(labels))
Распечатка:
Counter({('\n', 'GPE'): 2, ('Department of Motor Vehicles', 'ORG'): 1})
Counter({('\n', 'GPE'): 1})
Counter({('\n', 'GPE'): 2, ('State', 'ORG'): 1})
Counter({('\n', 'GPE'): 3})
Counter({('\n', 'GPE'): 1})
Я не могуЯ считаю, что у SpaCy есть такая неправильная классификация.Я что-то пропустил?