Я использую MWETokenizer от NLTK, чтобы получить многословную пометку.Вот мой пример кода:
import nltk
import pickle
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import MWETokenizer
# initializing Wordnet Lemmatizer
lmtzr = WordNetLemmatizer()
# values to tag/extract
values = ["net income","net income (loss)" \
,"net income (loss) attributable to 'company'","net income (loss) attributable to bank","net income (loss) attributable to bank and noncontrolling (minority) interests","net income (loss) attributable to bank and noncontrolling interests","net income (loss) attributable to bank and noncontrolling minority interests","net income (loss) attributable to noncontrolling interests","net income after tax","net income associated to minority interests","net income associated to partners","net income attributable to \"company name\"","accumulated distributions in excess of net income","antidilutive securities excluded from computation of net income, per outstanding unit, amount","cash from net income","consolidated net income attributable to foreign offices","consolidated net income in foreign offices","consolidated net income of foreign offices","decrease in net income","diluted net income","diluted net income attributable to common shareholders","diluted net income per share","eliminations of net income to foreign offices","foreign offices consolidated net income","foreign offices net income","foreign offices net income before internal allocation of income and expense","income excluded from net income","increase in net income","less net income attributable to noncontrolling interests", \
"less: net income (loss) attributable income taxes to noncontrolling (minority) interests","net income attributable to bank","net income attributable to bank and minority interests","net income attributable to class a and class b common stockholders","net income attributable to common shareholders (in dollars per share) diluted","net income attributable to company", \
"net income attributable to foreign offices","net income attributable to income taxes","net income attributable to noncontrolling interests","net income attributable to noncontrolling parties","net income attributable to participating securities","net income attributed to bank","net income basic earnings","net income before given to non-controlling interests","net income before minority controlling interests","net income before non-controlling interests","net income diluted to common shareholders","net income from cash flows","net income from operations in cash flow","net income generated from investment","net income generated from joint venture partnerships","net income in foreign offices","net income including noncontrolling interests","net income loss attributable to bank and noncontrolling minority interests", \
"net income of interest", \
"net income of foreign offices before allocations","net income of interest","net income or loss attributable to bank and noncontrolling (minority) interests","net income or loss attributable to bank and noncontrolling interests","net income or loss attributed to bank","net income per share - basic (in usd per share)","net income per share - diluted (in usd per share)","net income per share diluted","net income to all parties","net income to foreign offices before internal allocations of income and expense","net income/loss to 'company name'","net income/loss to noncontrolling interests","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2012-01-31)", \
"other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, net of tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2012-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2013-01-31)","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, availabe-for-sale securities, before tax","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, held-to-maturity securities, before tax", \
"revision of net income to foreign offices","net income of foreign offices"
]
# Initializing MWETokenizet with a starter value
tokenizer = MWETokenizer([('total', 'expense')])
# Populating Tokenizer
for item in values:
tokenizer.add_mwe((item.split()))
# Sample target sample
sentence = 'what is the net incomes of banks of america for q2 2014'
# Splitting for stammer
tokens = sentence.split()
# changing nouns to singular
singles = [lmtzr.lemmatize(plural,'n') for plural in tokens]
# Joining back and trying extraction/Tags
result= tokenizer.tokenize(' '.join(singles).split())
print(result)
# Result:
# Actual: ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']
# Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']
В моих тегах первое значение - «Чистый доход», а следующие значения - «Чистый доход».В отличие от ожидания, токенизатор по какой-то причине не может распознать первое значение.
# Result:
# Actual: ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']
# Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']
Есть ли ограничение или что-то, чего я не знаю.Как мне отладить это?
Кроме того, если есть другой способ сделать пометку Multiword, которая дала бы мне знать, это было бы очень полезно.