You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Base LLM 400M

Self-mined multi-modal pretraining corpus targeting 200B tokens of high-quality text, video, audio, and image data.

Structure

  • data/shards/shard_{idx:08d}.bin โ€” raw uint32 token IDs, ~128MB per shard
  • data/shards/shard_{idx:08d}.meta.json โ€” 18-field sidecar with quality stats
  • state/ โ€” runtime checkpoints (.state extension)

18-field sidecar schema

Field Type Description
shard_idx int Shard number
filename str Filename
num_tokens int Token count
dtype str uint32
size_bytes int File size
created_at str ISO 8601 timestamp
tokens int Token count (duplicate for compatibility)
avg_score float Mean quality score
min_score float Min quality score
max_score float Max quality score
std_score float Std dev of quality scores
n_above_3 int Chunks with score > 3.0
n_above_5 int Chunks with score > 5.0
score_hist dict 21-bin histogram (0.0โ€“10.0 step 0.5)
modality_comp dict Modality composition counts
pillar_comp dict Pillar composition counts
ts float Unix timestamp
modality str Primary modality

Loading (Python)

from datasets import load_dataset
ds = load_dataset("morningstarxcdcode/base-llm-400m", split="train", streaming=True)
for example in ds:
    print(example["tokens"])
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support