wikimedia/wikipedia
Viewer • Updated • 61.6M • 266k • 1.22k
How to use AntonXue/BERT-DLM with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("fill-mask", model="AntonXue/BERT-DLM") # Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("AntonXue/BERT-DLM")
model = AutoModelForMaskedLM.from_pretrained("AntonXue/BERT-DLM")BERT-base (110M params) trained from scratch with a modern diffusion language model (DLM) objective using absorbing-state diffusion with a uniform noise schedule.
This model is part of a paired experiment comparing classic BERT MLM training against modern DLM training. See AntonXue/BERT-MLM for the counterpart.
Absorbing-state diffusion with uniform schedule: sample t ~ U(0,1), mask each token independently with probability t (replacing with [MASK]), then predict original tokens at masked positions. Cross-entropy loss on masked positions with uniform time weighting (time_weight = 1).
Key differences from classic BERT MLM:
| Parameter | Value |
|---|---|
| Architecture | (fresh random init) |
| Parameters | 109.5M |
| Sequence length | 512 |
| Global batch size | 256 (128 per GPU x 2 GPUs) |
| Training steps | 100,000 |
| Tokens seen | ~13.1B |
| Optimizer | AdamW |
| Learning rate | 1e-4 |
| LR schedule | Constant with warmup |
| Warmup steps | 500 |
| Adam betas | (0.9, 0.999) |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Precision | bf16 |
| Hardware | 2x NVIDIA H100 NVL |
Training code: github.com/AntonXue/dBERT