Releases: machamp-nlp/machamp
v 0.4.1
This release has some new functionality:
- Diverse batching: allow multiple datasets in a single batch: https://github.com/machamp-nlp/machamp/blob/master/docs/diverse.md
- Accordingly, MachampSampler has been redone
- Fixed --raw_text
- Reset the weights of the language model
- Set the threshold for multi* tasks in the predict.py function
- More detailed metrics reporting: https://github.com/machamp-nlp/machamp/blob/master/docs/metrics.md
- Automatically skips huge inputs (because exploding memory) during training, now the length is set to batch_size*max_words_in_batch
Note that this release is not fully backwards compatible, since the parameters configuration now expects reset_transformer_model
and batching/diverse
.
v 0.4.2
- For MLM: divide the data to the number of epochs. Now it sees every instance of the train data just once, and the number of instances per batch = total_instances/num_epochs.
- Added pearson metric
- Fixed sentence counts for sentence level datasets
- Log the stderr
- Don't report 0.0 scores for out-of-dataset metrics
- Multi-seq fixed for CPU use
- Progress bar is now correct
- Updated some documentation
known issues:
- for generative models it should probably use the last subword for classification instead of the first
- freezing the language model seems to be broken
V 0.4 beta 2
- Much lower memory by having a maximum number of tokens per batch
- support lower torch versions (and updated requirements.txt accordingly)
- fixed output predictions
- added multiseq and multiclas task type (and multi-accuracy)
- log the losses of each task
- Support also language models that have no start/end token. Tested with: [facebook/nllb-200-distilled-600M', 'google/mt5-base', 't5-base', 'google/byt5-base', 'Helsinki-NLP/opus-mt-mul-en', 'google/canine-s', 'google/canine-c', 'facebook/xglm-564M', 'facebook/xglm-564M', 'facebook/mgenre-wiki', 'setu4993/LaBSE', 'bigscience/bloom-560m', 'facebook/mbart-large-50', "microsoft/mdeberta-v3-base", "studio-ousia/mluke-large", "google/rembert", "cardiffnlp/twitter-xlm-roberta-base", "xlm-roberta-large", "bert-base-multilingual-cased", "xlm-roberta-base", 'distilbert-base-multilingual-cased', 'microsoft/infoxlm-large', 'bert-base-multilingual-uncased', 'Peltarion/xlm-roberta-longformer-base-4096', 'Peltarion/xlm-roberta-longformer-base-4096', 'studio-ousia/mluke-base', 'xlm-mlm-100-1280
v 0.4 beta
Note that this is a major update, almost all code has been re-written. However, due to the amount of changes we have been unable to add all functionality from the previous versions. Main things that are now missing are:
- multiseq task type
- seq2seq task type
- pearson correlation metric
- specify layers
- predict-more.py to load the model once and predict on multiple files
- --resume to resume training if it is interrupted
- --raw to run a model on raw text
- label balancing
Some new functionality:
- Much easier debugging and adding of functionality
- Regression task type
- Better topn output support
- No need to install AllenNLP
- Print graphs of scores after each epoch
- Renamed validation_data_set to dev_data_set (the only difference in usage)
- Counts the number of UNKS and prints dataset statistics
- Can now also use autoregressive language models (at least the ones with a special token in position 0)
- Automatically detects size of language model
- Print machamp asci art only once
- Fixed bug with macro-f1, which used a score of 0 for the padding label before.
- Almost all code is now documented
It should be noted that this version is less thoroughly tested than our previous versions, which were mostly incremental to each other and used in countless experiments.
v0.4
This is a major update, there is no backwards compatibility, and performance is known to be different. This version is written from scratch and reduces the dependence on other python packages. New features include:
- multiclas task type
- regression task type
- Log the losses of each task
- Support a larger variety of language models (autoregressive models, models without special tokens)
- Layer attention per task (and logging of its weights)
- Plot the scores each epoch
- Report dataset statistics
- Better topn output support
- Automatically detects size of language model
- Code easier to debug
The main difference in normal usage is that validation_data_set is renamed to dev_data_set
Missing features (Some of these might be included in updates):
- seq2seq task type
- pearson correlation metric
- dataset embeddings
- --raw
- label balancing
Please note that this version is tested less than the previous version, as it was already used for thousands of experiments. Please let us know if you find any bugs.
V 0.3 beta 2
Fixed filelock version
Switched learningrate back (results in better performance for most datasets)
V 0.3 (Beta)
New features:
- Updated to AllenNLP 2.8.0 (can now use RemBERT)
- Added option to skip the first line of a dataset (skip_first_line)
- Added probdistr tasktype
- Added regression tasktype
- Fixed bug so that all training data is used (previously one sample was lost for every batch)
- Added functionality to balance labels
- Fixed --raw_text
- Can now predict on data without annotation
- Switched to | for splitting labels in multiseq, and
- Support accuarcy metric for multiseq
- Redid tuning on xtreme (with mBERT and RemBERT!), details will be published later
- Completely reimplemented dataset readers, should be easier to maintain in the future
- Removed option to lowercase data, as it is done automatically
- Added encoder and decoder embeddings
- Removed hack when some, but not all sentences in a batch are > max_len, as it is resolved in the underlying libraries
- Use segment ID's like 000011110000 for a three sentence input (where all 0s before)
the following issues are known:
V 0.2
Version described in Arxiv v3: https://arxiv.org/abs/2005.14672v3
Many new options:
- mlm as tasktype
- seq2seq as tasktype (NMT)
- Rewritten README
- Enable use of multiple dataset_configs
- Use --sequential for sequential training (can also be done afterwards with --finetuning)
- Dataset smoothing
- Updated to Allennlp 1.3, allowing for all huggingface embeddings to be used
- Dataset embeddings can be used
- Last but not least: Ascii art after loading the model succesfully
V 0.1
This version corresponds to the first Arxiv paper: https://arxiv.org/abs/2005.14672v1 .
It contains support for 4 task-types: sequence labeling, string2string, dependency parsing, and text classification.
This release is based on AllenNLP 0.9