Fastbert: ошибка BertDataBunch для классификации текста с несколькими метками - PullRequest
2 голосов
/ 15 апреля 2020

Я следую учебному пособию по FastBert от huggingface https://medium.com/huggingface/introducing-fastbert-a-simple-deep-learning-library-for-bert-models-89ff763ad384

Проблема в том, что этот код не совсем воспроизводим. Основная проблема, с которой я сталкиваюсь, - это подготовка набора данных. В учебном пособии этот набор данных используется https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data

Но, если я настрою структуру папок в соответствии с учебным пособием и поместу файлы набора данных в папки, я получу ошибки с набор данных.

databunch = BertDataBunch(args['data_dir'], LABEL_PATH, args.model_name, train_file='train.csv', val_file='val.csv',
                          test_data='test.csv',
                          text_col="comment_text", label_col=label_cols,
                          batch_size_per_gpu=args['train_batch_size'], max_seq_length=args['max_seq_length'], 
                          multi_gpu=args.multi_gpu, multi_label=True, model_type=args.model_type)

Жалуется на неправильный формат файла. Как мне отформатировать набор данных, метки для этого набора данных с помощью fastbert?

1 Ответ

1 голос
/ 15 апреля 2020
  1. Прежде всего, вы можете использовать ноутбук от GitHub для FastBert.

https://github.com/kaushaltrivedi/fast-bert/blob/master/sample_notebooks/new-toxic-multilabel.ipynb

В FastBert README есть небольшое руководство по обработке набора данных перед использованием.

Создание объекта DataBunch



The databunch object takes training, validation and test csv files and converts the data into internal representation for BERT, RoBERTa, DistilBERT or XLNet. The object also instantiates the correct data-loaders based on device profile and batch_size and max_sequence_length.

from fast_bert.data_cls import BertDataBunch

databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                          tokenizer='bert-base-uncased',
                          train_file='train.csv',
                          val_file='val.csv',
                          label_file='labels.csv',
                          text_col='text',
                          label_col='label',
                          batch_size_per_gpu=16,
                          max_seq_length=512,
                          multi_gpu=True,
                          multi_label=False,
                          model_type='bert')

File format for train.csv and val.csv
index   text    label
0   Looking through the other comments, I'm amazed that there aren't any warnings to potential viewers of what they have to look forward to when renting this garbage. First off, I rented this thing with the understanding that it was a competently rendered Indiana Jones knock-off.    neg
1   I've watched the first 17 episodes and this series is simply amazing! I haven't been this interested in an anime series since Neon Genesis Evangelion. This series is actually based off an h-game, which I'm not sure if it's been done before or not, I haven't played the game, but from what I've heard it follows it very well     pos
2   his movie is nothing short of a dark, gritty masterpiece. I may be bias, as the Apartheid era is an area I've always felt for.  pos

In case the column names are different than the usual text and labels, you will have to provide those names in the databunch text_col and label_col parameters.

labels.csv will contain a list of all unique labels. In this case the file will contain:

pos
neg

For multi-label classification, labels.csv will contain all possible labels:

severe_toxic
obscene
threat
insult
identity_hate

The file train.csv will then contain one column for each label, with each column value being either 0 or 1. Don't forget to change multi_label=True for multi-label classification in BertDataBunch.
id  text    toxic   severe_toxic    obscene     threat  insult  identity_hate
0   Why the edits made under my username Hardcore Metallica Fan were reverted?  0   0   0   0   0   0
0   I will mess you up  1   0   0   1   0   0

label_col will be a list of label column names. In this case it will be:

['toxic','severe_toxic','obscene','threat','insult','identity_hate']

Итак , просто сохраните train.csv, val.csv (просто сделайте копию train.csv) и test.csv внутри data /

В папке меток сохраните файл label.csv со следующим содержимым.

severe_toxic
obscene
threat
insult
identity_hate

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...