Hmm…
Short version: your trainer evaluation is not using the same generation setup as your manual model.generate() loop, and there is a small risk that compute_metrics is decoding the wrong tensor shape. Both will destroy an alignment-based metric like identity similarity.
I will walk through:
- How
Seq2SeqTrainerevaluates a T5 model. - What is probably different from your manual evaluation.
- Concrete causes for the 85–100% vs 14% gap.
- Exact configuration and code changes to fix it.
- A small debugging checklist.
1. What Seq2SeqTrainer.evaluate() actually does
For seq2seq tasks, the intended pattern is:
- Use
Seq2SeqTrainer. - Set
predict_with_generate=True. - Trainer then calls
model.generate()insideprediction_step. - The generated token IDs and labels are passed to
compute_metrics, where you decode and compute ROUGE, BLEU, etc. (Hugging Face)
The Hugging Face docs show exactly this pattern with T5 summarization:
Seq2SeqTrainer+predict_with_generate=True.compute_metrics(eval_preds)gets(preds, labels)as integer token IDs.- It replaces
-100withpad_token_id, decodes both, then runs a text metric. (Hugging Face)
So in the “happy path” your compute_metrics should be seeing generated IDs from model.generate, not logits.
Where generation parameters come from
In evaluation, Seq2SeqTrainer builds gen_kwargs for model.generate:
- If
Seq2SeqTrainingArguments.generation_max_lengthis set, it becomesmax_lengthforgenerate. - If not, it falls back to
model.generation_config.max_length. - Similarly,
generation_num_beamsoverridesnum_beams, otherwise it usesmodel.generation_config.num_beams. (GitHub)
The official summarization script clarifies that generation_max_length is used specifically to override model.generate(..., max_length=...) during evaluate and predict. (GitHub)
So: trainer.evaluate() does not magically reuse whatever arguments you used in your own manual model.generate calls. It only uses what you encode into Seq2SeqTrainingArguments or the model’s generation_config.
2. What is different from your manual evaluation
Your current setup:
training_args = Seq2SeqTrainingArguments(
output_dir=f"./finetuning/temp",
predict_with_generate=True,
per_device_eval_batch_size=args.batch_size,
report_to='tensorboard',
logging_dir='./finetuning/logs/eval/' + args.ext,
fp16=True,
greater_is_better=True,
# no generation_max_length
# no generation_num_beams
)
In your manual evaluation you do something like (pseudo):
outputs = model.generate(
input_ids,
attention_mask=attn_mask,
max_length=MANUAL_MAX,
# or max_new_tokens=...
num_beams=MANUAL_BEAMS,
# maybe other options
)
But in the trainer:
generation_max_lengthis not set.generation_num_beamsis not set.
So during evaluate():
max_lengthwill come frommodel.generation_config.max_length(which is often a small default).num_beamswill come frommodel.generation_config.num_beams(often 1 if you never changed it).- If you used sampling or different beams manually, none of that is reflected in trainer evaluation. (Hugging Face Forums)
This is exactly the class of mismatch described in a HF forum thread where evaluation metrics during training and final trainer.evaluate() differed because generation_num_beams and generation_max_length were not aligned with num_beams and max_length used elsewhere. (Hugging Face Forums)
Your identity similarity metric is a global alignment score. It is very sensitive to:
- Truncated outputs.
- Different decoding search (greedy vs multi-beam).
- Outputs that diverge after a few tokens.
So if trainer.evaluate() is generating shorter or greedier outputs than your manual loop, you can easily drop from ~90% per sample to something like 10–20%.
3. Other likely issues in compute_metrics
3.1 Decoding logits instead of token IDs
Your metric function:
predictions, label_ids = eval_pred
if isinstance(predictions, (tuple, list)):
predictions = predictions[0]
predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
pred_seq = tokenizer.batch_decode(predictions, skip_special_tokens=True)
This assumes predictions is an integer array of shape (batch, seq_len).
However, in general Trainer semantics, EvalPrediction.predictions is “everything your model returns apart from loss”. If the model returns logits, past key values, etc., predictions can be a tuple with multiple arrays, often including (batch, seq_len, vocab_size) logits. This is documented in the forums: Trainer packs into predictions all non-loss outputs from the forward pass. (Hugging Face Forums)
- If you decode logits as IDs, you get garbage text and metrics close to random.
- With seq2seq and
predict_with_generate=Trueyou should get generated IDs, but a misconfiguration, different trainer, or customprediction_stepcan still leave you with logits inpredictions.
A robust pattern is:
if isinstance(predictions, (tuple, list)):
predictions = predictions[0]
if predictions.ndim == 3:
# (batch, seq_len, vocab_size) logits
predictions = predictions.argmax(-1)
There is a very similar bug report in the Unsloth repo: the model generates coherent text in normal inference, but compute_metrics receives jumbled text during evaluation. The root cause there is also about how predictions are prepared and decoded in compute_metrics. (GitHub)
3.2 Multi-GPU and _state.is_main_process
You do:
if not _state.is_main_process:
return {'GAS':0,'Levensgtein_score':0,'Identity_Similarity_Score':0}
Using accelerate.PartialState().is_main_process is correct: Accelerate gathers all predictions and labels across ranks, and then compute_metrics is called on each process. The usual pattern is to compute metrics only on rank 0 and return {} on others. (Hugging Face)
Your zeros from non-main ranks normally do not get logged, so this guard is not the cause of the 14% score. It is fine to change it to return {} for clarity.
4. Concrete causes for your “85–100 vs 14” gap
Putting this together:
-
Generation config mismatch (most likely)
- Trainer uses
generation_max_length/generation_num_beams(or model defaults) during evaluation. (GitHub) - Your manual loop uses different
max_length/max_new_tokens/num_beams. - This yields different output texts, especially truncation and beam differences, which destroy a global identity similarity metric.
- Trainer uses
-
Potential decoding of logits
- If for any reason
predictionsis(batch, seq_len, vocab_size)and you decode withoutargmax, you get nonsense text. - That would turn a good model into low identity similarity numbers inside
compute_metrics. HF docs explicitly show that in non-generation setups,predictionsis logits and you mustargmaxbefore decoding. (Hugging Face)
- If for any reason
-
Dataset / preprocessing mismatch
- Manual evaluation might run on a different split or with extra cleaning steps applied; trainer uses
eval_datasetas passed. - This could explain smaller deviations but not usually as drastic as 85–100 vs 14 by itself.
- Manual evaluation might run on a different split or with extra cleaning steps applied; trainer uses
The first point matches closely the T5 summarization thread where ROUGE scores differ between training-time and final evaluation because generation_num_beams and generation_max_length did not match the num_beams and max_length used elsewhere. (Hugging Face Forums)
5. How to fix it: code changes
5.1 Make Trainer use the same generation config as your manual code
Take whatever you use in manual model.generate and set the equivalents in Seq2SeqTrainingArguments:
training_args = Seq2SeqTrainingArguments(
output_dir="./finetuning/temp",
predict_with_generate=True,
per_device_eval_batch_size=args.batch_size,
report_to="tensorboard",
logging_dir="./finetuning/logs/eval/" + args.ext,
fp16=True,
greater_is_better=True,
# match your manual generate()
generation_max_length=MANUAL_MAX_LENGTH, # e.g. 128 or 256
generation_num_beams=MANUAL_NUM_BEAMS, # e.g. 4
)
If you rely on max_new_tokens instead of max_length, set it in the model’s generation config:
model.generation_config.max_new_tokens = MANUAL_MAX_NEW_TOKENS
Then either:
- Let
generation_max_length=Noneand rely on model config, or - Compute
generation_max_lengthasmax_input_len + max_new_tokensand set it explicitly.
The summarization example and the forum thread both confirm this is the correct way to keep evaluation behavior consistent. (GitHub)
5.2 Make compute_metrics robust to logits
Update the metric function:
def compute_metrics(eval_pred):
if not _state.is_main_process:
return {}
predictions, label_ids = eval_pred
if isinstance(predictions, (tuple, list)):
predictions = predictions[0]
# If we ever get logits, collapse to ids
if predictions.ndim == 3:
predictions = predictions.argmax(-1)
predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
labels = np.where(label_ids != -100, label_ids, tokenizer.pad_token_id)
pred_seq = tokenizer.batch_decode(predictions, skip_special_tokens=True)
label_seq = tokenizer.batch_decode(labels, skip_special_tokens=True)
gas = []
i_sim_score = []
lev_score = []
for t, p in zip(label_seq, pred_seq):
global_alignment_score, identity_similarity_score = get_global_alignment_score(t, p, aligner)
gas.append(global_alignment_score)
i_sim_score.append(identity_similarity_score)
lev_score.append(get_levenshtein_score(t, p))
avg_gas = sum(gas) / len(gas)
avg_lev = sum(lev_score) / len(lev_score)
avg_i_sim_score = sum(i_sim_score) / len(i_sim_score)
return {
"GAS": avg_gas,
"Levensgtein_score": avg_lev,
"Identity_Similarity_Score": avg_i_sim_score,
}
This preserves your logic but guards against “logits instead of IDs” issues that are documented in Trainer discussions and real bug reports. (Hugging Face Forums)
5.3 Verify with trainer.predict
Run once:
predictions, label_ids, _ = trainer.predict(trainer.eval_dataset)
pred_ids = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
label_ids = np.where(label_ids != -100, label_ids, tokenizer.pad_token_id)
pred_seq = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_seq = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
Then compute identity similarity manually on these pred_seq and label_seq.
- If you still get ~14%, the difference is entirely from generation config or dataset.
- If you get ~85–100%, then
compute_metricshas an implementation error and you focus there.
6. Minimal debugging checklist
-
Print shapes inside
compute_metricson the main process:print(predictions.shape, label_ids.shape).- If
predictions.ndim == 3, you are seeing logits and mustargmax.
-
Inspect a few decoded pairs:
- Print the first 3
(label_seq[i], pred_seq[i]). - Compare to what you see in the manual loop for the same examples.
- Print the first 3
-
Set generation parameters explicitly:
- Add
generation_max_lengthandgeneration_num_beamsthat match your manualgenerate. - Or configure
model.generation_configand let Trainer reuse it.
- Add
-
Ensure same dataset:
- Confirm that manual evaluation uses the same
eval_datasetpassed into the trainer.
- Confirm that manual evaluation uses the same
Summary
Seq2SeqTrainerwithpredict_with_generate=Trueusesmodel.generate()during evaluation, but the exact behavior depends ongeneration_max_length,generation_num_beams, and the model’sgeneration_config, not on your separate manualgeneratecalls. (Hugging Face)- If you do not set those explicitly, trainer evaluation often produces shorter or different outputs than your manual evaluation, which is enough to drop a sensitive alignment-based metric like identity similarity from ~90% to ~14%. (Hugging Face Forums)
- There is an additional risk of decoding logits or complex tuples as IDs inside
compute_metrics; guarding withpredictions.ndim == 3: predictions = predictions.argmax(-1)prevents this and matches documented Trainer behavior. (Hugging Face Forums) - Align generation parameters, add the logits guard, and sanity-check
trainer.predictoutput. After that,trainer.evaluate()should report identity similarity close to what you observe with your manualmodel.generate()loop.