BERT-токенайзер Huggingface не добавляет токен - PullRequest
1 голос
/ 26 апреля 2020

Из документации не совсем ясно, но я вижу, что BertTokenizer инициализируется с pad_token='[PAD]', поэтому я предполагаю, что когда вы кодируете с add_special_tokens=True, это автоматически заполняет его. Учитывая, что pad_token_id=0, я не вижу никаких 0 s в token_ids, однако:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text, add_special_tokens=True, max_length=2048)

# Print the original sentence.
print('Original: ', text)

# Print the sentence split into tokens.
print('\nTokenized: ', tokens)

# Print the sentence mapped to token ids.
print('\nToken IDs: ', token_ids)

Вывод:

Original:  Toronto's key stock index ended higher in brisk trading on Thursday, extending Wednesday's rally despite being weighed down by losses on Wall Street.
The TSE 300 Composite Index rose 29.80 points to close at 5828.62, outperforming the Dow Jones Industrial Average which slumped 21.27 points to finish at 6658.60.
Toronto added to Wednesday's 55-point rally while investors took profits in New York after the Dow's 92-point gains, said MMS International analyst Katherine Beattie.
"That shows that the markets are very fragile," Beattie said. "They (investors) want to take advantage of any strength to sell," she said.
Toronto was also buoyed by its heavyweight gold group which jumped nearly 2.2 percent, aided by firmer COMEX gold prices. The key June contract rose $1.00 to $344.30.
Ten of Toronto's 14 sub-indices posted gains, led by golds, transportation, forestry products and consumer products.
The weak side included conglomerates, base metals and utilities.
Trading was heavy at 100 million shares worth C$1.54 billion ($1.1 billion).
Advancing stocks outnumbered declines 556 to 395, with 276 issues flat.
Among hot stocks, Bre-X Minerals Ltd. rose 0.13 to 2.30 on 5.0 million shares as investors continued to consider the viability of its Busang gold discovery in Indonesia.
Kenting Energy Services Inc. rose 0.25 to 9.05 after Precision Drilling Corp. amended its takeover offer
Bakery and foodstuffs maker George Weston Ltd. jumped 4.50 to close at 74.50, the TSE's top gainer.


Tokenized:  ['toronto', "'", 's', 'key', 'stock', 'index', 'ended', 'higher', 'in', 'brisk', 'trading', 'on', 'thursday', ',', 'extending', 'wednesday', "'", 's', 'rally', 'despite', 'being', 'weighed', 'down', 'by', 'losses', 'on', 'wall', 'street', '.', 'the', 'ts', '##e', '300', 'composite', 'index', 'rose', '29', '.', '80', 'points', 'to', 'close', 'at', '58', '##28', '.', '62', ',', 'out', '##per', '##form', '##ing', 'the', 'dow', 'jones', 'industrial', 'average', 'which', 'slumped', '21', '.', '27', 'points', 'to', 'finish', 'at', '66', '##58', '.', '60', '.', 'toronto', 'added', 'to', 'wednesday', "'", 's', '55', '-', 'point', 'rally', 'while', 'investors', 'took', 'profits', 'in', 'new', 'york', 'after', 'the', 'dow', "'", 's', '92', '-', 'point', 'gains', ',', 'said', 'mm', '##s', 'international', 'analyst', 'katherine', 'beat', '##tie', '.', '"', 'that', 'shows', 'that', 'the', 'markets', 'are', 'very', 'fragile', ',', '"', 'beat', '##tie', 'said', '.', '"', 'they', '(', 'investors', ')', 'want', 'to', 'take', 'advantage', 'of', 'any', 'strength', 'to', 'sell', ',', '"', 'she', 'said', '.', 'toronto', 'was', 'also', 'bu', '##oy', '##ed', 'by', 'its', 'heavyweight', 'gold', 'group', 'which', 'jumped', 'nearly', '2', '.', '2', 'percent', ',', 'aided', 'by', 'firm', '##er', 'come', '##x', 'gold', 'prices', '.', 'the', 'key', 'june', 'contract', 'rose', '$', '1', '.', '00', 'to', '$', '344', '.', '30', '.', 'ten', 'of', 'toronto', "'", 's', '14', 'sub', '-', 'indices', 'posted', 'gains', ',', 'led', 'by', 'gold', '##s', ',', 'transportation', ',', 'forestry', 'products', 'and', 'consumer', 'products', '.', 'the', 'weak', 'side', 'included', 'conglomerate', '##s', ',', 'base', 'metals', 'and', 'utilities', '.', 'trading', 'was', 'heavy', 'at', '100', 'million', 'shares', 'worth', 'c', '$', '1', '.', '54', 'billion', '(', '$', '1', '.', '1', 'billion', ')', '.', 'advancing', 'stocks', 'outnumbered', 'declines', '55', '##6', 'to', '395', ',', 'with', '276', 'issues', 'flat', '.', 'among', 'hot', 'stocks', ',', 'br', '##e', '-', 'x', 'minerals', 'ltd', '.', 'rose', '0', '.', '13', 'to', '2', '.', '30', 'on', '5', '.', '0', 'million', 'shares', 'as', 'investors', 'continued', 'to', 'consider', 'the', 'via', '##bility', 'of', 'its', 'bus', '##ang', 'gold', 'discovery', 'in', 'indonesia', '.', 'kent', '##ing', 'energy', 'services', 'inc', '.', 'rose', '0', '.', '25', 'to', '9', '.', '05', 'after', 'precision', 'drilling', 'corp', '.', 'amended', 'its', 'takeover', 'offer', 'bakery', 'and', 'foods', '##tu', '##ffs', 'maker', 'george', 'weston', 'ltd', '.', 'jumped', '4', '.', '50', 'to', 'close', 'at', '74', '.', '50', ',', 'the', 'ts', '##e', "'", 's', 'top', 'gain', '##er', '.']

Token IDs:  [101, 4361, 1005, 1055, 3145, 4518, 5950, 3092, 3020, 1999, 28022, 6202, 2006, 9432, 1010, 8402, 9317, 1005, 1055, 8320, 2750, 2108, 12781, 2091, 2011, 6409, 2006, 2813, 2395, 1012, 1996, 24529, 2063, 3998, 12490, 5950, 3123, 2756, 1012, 3770, 2685, 2000, 2485, 2012, 5388, 22407, 1012, 5786, 1010, 2041, 4842, 14192, 2075, 1996, 23268, 3557, 3919, 2779, 2029, 14319, 2538, 1012, 2676, 2685, 2000, 3926, 2012, 5764, 27814, 1012, 3438, 1012, 4361, 2794, 2000, 9317, 1005, 1055, 4583, 1011, 2391, 8320, 2096, 9387, 2165, 11372, 1999, 2047, 2259, 2044, 1996, 23268, 1005, 1055, 6227, 1011, 2391, 12154, 1010, 2056, 3461, 2015, 2248, 12941, 9477, 3786, 9515, 1012, 1000, 2008, 3065, 2008, 1996, 6089, 2024, 2200, 13072, 1010, 1000, 3786, 9515, 2056, 1012, 1000, 2027, 1006, 9387, 1007, 2215, 2000, 2202, 5056, 1997, 2151, 3997, 2000, 5271, 1010, 1000, 2016, 2056, 1012, 4361, 2001, 2036, 20934, 6977, 2098, 2011, 2049, 8366, 2751, 2177, 2029, 5598, 3053, 1016, 1012, 1016, 3867, 1010, 11553, 2011, 3813, 2121, 2272, 2595, 2751, 7597, 1012, 1996, 3145, 2238, 3206, 3123, 1002, 1015, 1012, 4002, 2000, 1002, 29386, 1012, 2382, 1012, 2702, 1997, 4361, 1005, 1055, 2403, 4942, 1011, 29299, 6866, 12154, 1010, 2419, 2011, 2751, 2015, 1010, 5193, 1010, 13116, 3688, 1998, 7325, 3688, 1012, 1996, 5410, 2217, 2443, 22453, 2015, 1010, 2918, 11970, 1998, 16548, 1012, 6202, 2001, 3082, 2012, 2531, 2454, 6661, 4276, 1039, 1002, 1015, 1012, 5139, 4551, 1006, 1002, 1015, 1012, 1015, 4551, 1007, 1012, 10787, 15768, 21943, 26451, 4583, 2575, 2000, 24673, 1010, 2007, 25113, 3314, 4257, 1012, 2426, 2980, 15768, 1010, 7987, 2063, 1011, 1060, 13246, 5183, 1012, 3123, 1014, 1012, 2410, 2000, 1016, 1012, 2382, 2006, 1019, 1012, 1014, 2454, 6661, 2004, 9387, 2506, 2000, 5136, 1996, 3081, 8553, 1997, 2049, 3902, 5654, 2751, 5456, 1999, 6239, 1012, 5982, 2075, 2943, 2578, 4297, 1012, 3123, 1014, 1012, 2423, 2000, 1023, 1012, 5709, 2044, 11718, 15827, 13058, 1012, 13266, 2049, 15336, 3749, 18112, 1998, 9440, 8525, 21807, 9338, 2577, 12755, 5183, 1012, 5598, 1018, 1012, 2753, 2000, 2485, 2012, 6356, 1012, 2753, 1010, 1996, 24529, 2063, 1005, 1055, 2327, 5114, 2121, 1012, 102]

1 Ответ

2 голосов
/ 27 апреля 2020

Нет, не будет. Существует другой параметр pad_to_max_length , для которого необходимо установить значение True, чтобы добавить токены заполнения. add_special_tokens добавит токен [CLS] и [SEP] (101 и 102 соответственно).

...