What’s the Pre-training Data Strategy Behind Nemotron-H?

by Zieksy - opened Jun 14

Jun 14

I understand that the Nemotron-H-56B-Base model was pre-trained on a dataset of 20T tokens, while the Nemotron-H-8B-Base model was trained on 15T tokens. However, the publicly available nemotronCC dataset is 6.3T tokens.

Could anyone provide insights into how this 6.3T dataset is expanded or supplemented to reach the required sizes of 15T and 20T tokens? Are there additional datasets used, or are there specific methods for augmenting the existing nemotronCC dataset?

Looking forward to any information or guidance.

Thank you very much!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment