Adding callbacks are probably the cleanest method. Implementing it in other ways may cause problems if the library version is updated and the behavior changes.
from transformers import TrainerCallback
class MyLogger(TrainerCallback):
def __init__(self):
...
def on_log(self, args, state, control, logs=None, **kwargs):
...
trainer = CustomTrainer(
...,
callbacks=[MyLogger()],
)
If you don’t use gradient accumulation, then I usually just hack by overwriting Trainer.compute_loss and tucking in one line of self.log(compute_my_metric(output)
If you use gradient accumulation, one alternative is to trigger a CustomCallback per Metrics for Training Set in Trainer - #7 by Kaveri . For example, you can do one forward pass on the entire train set on_epoch_end or on_evaluate. It will be repeated work, slow and coarse.
And let me know if you figured out an easy way to log custom …
opened 08:33AM - 24 Mar 21 UTC
closed 03:02PM - 01 May 21 UTC
## Environment info
- `transformers` version: 4.4.2
- Platform: Darwin-20.… 3.0-x86_64-i386-64bit
- Python version: 3.7.4
- PyTorch version (GPU?): 1.3.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
### Who can help
@sgugger
## Information
Model I am using (Bert, XLNet ...): Bert
The problem arises when using:
* [x] the official example scripts: (give details below)
* [ ] my own modified scripts: (give details below)
The tasks I am working on is:
* [x] an official GLUE/SQUaD task: NER
* [ ] my own task or dataset: (give details below)
## To reproduce
The bug is for the PR #8016.
Steps to reproduce the behavior:
1. MlFlow installed and the following env variables exported
```
export HF_MLFLOW_LOG_ARTIFACTS=TRUE
export MLFLOW_S3_ENDPOINT_URL=<custom endpont>
export MLFLOW_TRACKING_URI=<custom uri>
export MLFLOW_TRACKING_TOKEN=<custom token>
```
2. Run the token classification example with the following command
```
python run_ner.py \
--model_name_or_path bert-base-uncased \
--dataset_name conll2003 \
--output_dir /tmp/test-ner \
--do_train \
--do_eval
```
## Expected behavior
When the training finishes, before the evaluation is performed, the `integrations.MLflowCallback` executes the method `on_train_end`, where if the env variable `HF_MLFLOW_LOG_ARTIFACTS` is set to `TRUE`, it logs the model artifacts to mlflow.
The problem is, however, when the method `on_train_end` is called and the following line is executed: `self._ml_flow.log_artifacts(args.output_dir)`, the model is not stored on the `args.output_dir`. The model artefacts are stored once the `trainer.save_model()` is called, which is after the training ending. There is no callback in the `trainer.save_model()` that can be called from a `TrainerCallback` to save the model. There is a method `TrainierCallback.on_save()` method, that is called `trainer._maybe_log_save_evaluate()`, but even then the model is not available on the `output_dir`.
Possible solutions would be to extend the `TrainierCallback` with `on_model_save()` callback method, insert the callback in the `trainer.save_model()`.
Or, a workaround I have now is to change `on_train_end ` with `on_evaluate` in `integrations.MLflowCallback`, that is called after the model is saved in the example script. However, this is not the right solution since it depends on having set the `do_eval` parameter, and it is not semantically correct.