Query re: Llama-3.1-8B SFT/GRPO Model Versions (Scalpel vs. Hammer)

by chenth - opened Nov 14

Nov 14

Great work on the "Scalpel vs. Hammer" (GRPO vs. SFT) models. I'm trying to use them for my research.I see a few different versions for each experiment, could you please tell me the difference between each version, and which version is the final or most stable one to use?
Thanks!

Neelectric

Owner Nov 14

Hi there, thank you for your kind words and your interest in our work! Indeed, the HuggingFace checkpoints are not well sign-posted or well maintained. The main results were taken from Neelectric/OLMo-2-1124-7B-Instruct_GRPOv01.14 for GRPO and Neelectric/OLMo-2-1124-7B-Instruct_SFTv02.00 for SFT, and should replicate using the https://github.com/huggingface/open-r1 repository with all the details in our paper reproducibility section. Other model versions stem from other experiments with different hyperparameters. While those two models were the ones with the best performance in our testing, we would also caution that they are by no means well-polished or suitable for any real use, they are mainly artefacts from research into reasoning training. All the Llama-3.1 checkpoints are also divergent from the Scalpel vs. Hammer methodology, the paper was focused exclusively on OLMo-2.

Do feel free to reach out with any further questions :)

chenth

about 1 month ago

Thank you so much, this is a critical clarification! We understand now that the paper's core results used OLMo-2, and the Llama-3.1 models were separate experiments. We are still very interested in those Llama-3.1 models (like ...GRPO_MoT_mathv00.19). Did you also happen to benchmark the math reasoning performance of these Llama-3.1 SFT and GRPO (RL) models? Thanks again!

Neelectric

Owner about 1 month ago

Indeed yes, though all of this is under active development with different research questions and circumstances, and many of the training runs failed or were attempts with different hyperparameters that turned out to be more harmful than beneficial. Below are some benchmark results for the very recent SFT runs, ie. Neelectric/Llama-3.1-8B-Instruct_SFT_Math-220kv00.07 - v00.10

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment