What’s the Pre-training Data Strategy Behind Nemotron-H?
#3
by
Zieksy
- opened
I understand that the Nemotron-H-56B-Base model was pre-trained on a dataset of 20T tokens, while the Nemotron-H-8B-Base model was trained on 15T tokens. However, the publicly available nemotronCC dataset is 6.3T tokens.
Could anyone provide insights into how this 6.3T dataset is expanded or supplemented to reach the required sizes of 15T and 20T tokens? Are there additional datasets used, or are there specific methods for augmenting the existing nemotronCC dataset?
Looking forward to any information or guidance.
Thank you very much!