The current logic seems to be like this. I was able to reproduce it in code execution, so it’s probably correct…
The validation split is inferred as JSON because, inside validation/, the Hugging Face datasets library sees more .json files than .arrow files and naively concludes “this split is a JSON dataset”.
Concretely, for your layout:
dataset_path/
dataset_dict.json
train/
data-00000-of-00010.arrow
...
data-00009-of-00010.arrow
dataset_info.json
state.json
validation/
data-00000-of-00001.arrow
dataset_info.json
state.json
-
train/ has:
- 10×
*.arrow files
- 2× JSON files:
dataset_info.json, state.json
-
validation/ has:
- 1×
*.arrow file
- 2× JSON files:
dataset_info.json, state.json
So:
- In
train/, Arrow wins (10 vs 2) → split inferred as Arrow.
- In
validation/, JSON wins (2 vs 1) → split inferred as JSON.
Then the library checks that all splits use the same format. It finds:
NamedSplit('train') → ('arrow', {})
NamedSplit('validation') → ('json', {})
and raises:
Couldn't infer the same data file format for all splits.
Got {NamedSplit('train'): ('arrow', {}), NamedSplit('validation'): ('json', {})}
FileFormatMismatchBetweenSplitsError
This is not a configuration error in dataset_dict.json. It is a consequence of how load_dataset guesses the file format when you point it at a directory.
Background: what kind of directory you have
Your directory shape:
is exactly what DatasetDict.save_to_disk("dataset_path") produces. (Hugging Face)
So this is:
- An Arrow-backed dataset stored on disk, with small JSON metadata files next to the Arrow data.
- It is not meant to be treated as a folder of raw JSON data files.
The intended loader for such a directory is:
from datasets import load_from_disk
ds = load_from_disk("dataset_path")
not load_dataset("dataset_path"). (Hugging Face)
When you pass this directory to load_dataset, the library assumes “this is a raw data folder, let me infer the file format”, and that is where the inference bug bites.
Background: how load_dataset decides “Arrow vs JSON”
When load_dataset is called on a path without a Python builder script, it uses HubDatasetModuleFactoryWithoutScript / LocalDatasetModuleFactoryWithoutScript internally. These factories:
- Build a
DataFilesDict describing which files belong to each split.
- For each split, call
infer_module_for_data_files(...) to decide which format module to use (JSON, CSV, Parquet, Arrow, etc.).
- Check that all splits yield the same format. If not, they throw
FileFormatMismatchBetweenSplitsError. (Hugging Face)
The key function, infer_module_for_data_files, looks at file extensions and counts them:
- It collects all suffixes for all files in each split.
- It counts how many times each extension appears.
- It picks the most common extension.
- It maps that extension to a module:
"json" → Json, "csv" → Csv, "arrow" → Arrow, etc. (Hugging Face)
That means:
- The logic is purely frequency-based: “which extension appears the most in this split’s files?”
- It does not distinguish between “real data” JSON and “metadata” JSON like
dataset_info.json or state.json.
How this produces exactly your error
GitHub issue #7018 shows the same error and explains the root cause in almost identical terms. The reported code is:
from datasets import load_dataset
from transformers import AutoTokenizer
MODEL = "google-bert/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
dataset = load_dataset("yelp_review_full")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.save_to_disk("dataset")
tokenized_datasets = load_dataset("dataset/") # raises
The error:
ValueError: Couldn't infer the same data file format for all splits.
Got {NamedSplit('train'): ('arrow', {}), NamedSplit('test'): ('json', {})}
In that issue, the reporter inspects the `test` split:
```shell
$ ls -l dataset/test
-rw-r--r-- 1 ... data-00000-of-00001.arrow
-rw-r--r-- 1 ... dataset_info.json
-rw-r--r-- 1 ... state.json
and explicitly writes:
I believe this bug is caused by the logic that tries to infer dataset format. It counts the most common file extension. However, a small dataset can fit in a single .arrow file and have two JSON metadata files, causing the format to be inferred as JSON. (GitHub)
This is exactly your situation:
- Your train split has many Arrow shards → Arrow extension dominates → train is Arrow.
- Your validation split has just one Arrow shard but two JSON metadata files → JSON extension dominates → validation is JSON.
So for your directory:
train → inferred as Arrow (correct).
validation → inferred as JSON (incorrect, but consistent with the counting heuristic).
- The formats differ →
FileFormatMismatchBetweenSplitsError.
The core reason the validation split is inferred as JSON is:
The validator split directory has more .json files than .arrow files, and the format inference counts every extension, including metadata files, and blindly chooses the majority extension.
That is why dataset_info.json and state.json “outvote” your single Arrow file in validation/.
Relation to your dataset_dict.json
Your dataset_dict.json:
{"splits": ["train", "validation"]}
is fine. It is a simple file telling load_from_disk which splits exist in the saved dataset. It does not control, or override, the format detection in load_dataset.
The mismatch arises before dataset_dict.json is even used:
load_dataset("dataset_path") sees a local path.
- It tries to infer data files and formats from the directory.
- Only after a consistent format is found would it proceed; in your case it never reaches that stage because the split formats disagree.
So the reason validation is classified as JSON is not a problem with the dataset_dict.json content; it is purely the extension-counting heuristic applied to your files.
Conceptual summary
Restating the key point in simple terms:
- You have an Arrow dataset saved to disk (
save_to_disk structure: Arrow + metadata JSON).
load_dataset is meant for raw data folders and tries to guess the file format based on which extension appears most often per split.
- In
train/ you have “many Arrow + 2 JSON” → looks like Arrow.
- In
validation/ you have “1 Arrow + 2 JSON” → looks like JSON.
- The library refuses to mix Arrow and JSON across splits and throws
FileFormatMismatchBetweenSplitsError.
Thus, the validation split is inferred as JSON only because its metadata JSONs outnumber its Arrow shards, and the current inference logic treats metadata files the same as real data files when counting extensions.
That is the precise, mechanical reason behind your error.