Arrow dataset inferred as json dataset

Error details:

```
Couldn’t infer the same data file format for all splits. Got {NamedSplit(‘train’): (‘arrow’, {}), NamedSplit(‘validation’): (‘json’, {})}
Error code: FileFormatMismatchBetweenSplitsError
```

My data structure is as follows:

```
dataset_path
– README.md
– dataset_dict.json
– train
– data-00000-of-00010.arrow
– data-00001-of-00010.arrow
– data-00002-of-00010.arrow
– data-00003-of-00010.arrow
– data-00004-of-00010.arrow
– data-00005-of-00010.arrow
– data-00006-of-00010.arrow
– data-00007-of-00010.arrow
– data-00008-of-00010.arrow
– data-00009-of-00010.arrow
– dataset_info.json
– state.json
– validation
– data-00000-of-00001.arrow
– dataset_info.json
– state.json
```

dataset_dict.json:

```
{“splits”: [“train”, “validation”]}
```

what could be the reason that the validation split is inferred as json format?

1 Like

surprisingly load_from_disk works while load_datasetfrom HF url fails with the above error

The current logic seems to be like this. I was able to reproduce it in code execution, so it’s probably correct…


The validation split is inferred as JSON because, inside validation/, the Hugging Face datasets library sees more .json files than .arrow files and naively concludes “this split is a JSON dataset”.

Concretely, for your layout:

dataset_path/
  dataset_dict.json
  train/
    data-00000-of-00010.arrow
    ...
    data-00009-of-00010.arrow
    dataset_info.json
    state.json
  validation/
    data-00000-of-00001.arrow
    dataset_info.json
    state.json
  • train/ has:

    • 10× *.arrow files
    • 2× JSON files: dataset_info.json, state.json
  • validation/ has:

    • *.arrow file
    • 2× JSON files: dataset_info.json, state.json

So:

  • In train/, Arrow wins (10 vs 2) → split inferred as Arrow.
  • In validation/, JSON wins (2 vs 1) → split inferred as JSON.

Then the library checks that all splits use the same format. It finds:

NamedSplit('train')      → ('arrow', {})
NamedSplit('validation') → ('json', {})

and raises:

Couldn't infer the same data file format for all splits.
Got {NamedSplit('train'): ('arrow', {}), NamedSplit('validation'): ('json', {})}
FileFormatMismatchBetweenSplitsError

This is not a configuration error in dataset_dict.json. It is a consequence of how load_dataset guesses the file format when you point it at a directory.


Background: what kind of directory you have

Your directory shape:

  • dataset_dict.json at the root

  • one subdirectory per split (train/, validation/)

  • under each split:

    • Arrow shards: data-00000-of-0000X.arrow
    • metadata: dataset_info.json, state.json

is exactly what DatasetDict.save_to_disk("dataset_path") produces. (Hugging Face)

So this is:

  • An Arrow-backed dataset stored on disk, with small JSON metadata files next to the Arrow data.
  • It is not meant to be treated as a folder of raw JSON data files.

The intended loader for such a directory is:

from datasets import load_from_disk

ds = load_from_disk("dataset_path")

not load_dataset("dataset_path"). (Hugging Face)

When you pass this directory to load_dataset, the library assumes “this is a raw data folder, let me infer the file format”, and that is where the inference bug bites.


Background: how load_dataset decides “Arrow vs JSON”

When load_dataset is called on a path without a Python builder script, it uses HubDatasetModuleFactoryWithoutScript / LocalDatasetModuleFactoryWithoutScript internally. These factories:

  1. Build a DataFilesDict describing which files belong to each split.
  2. For each split, call infer_module_for_data_files(...) to decide which format module to use (JSON, CSV, Parquet, Arrow, etc.).
  3. Check that all splits yield the same format. If not, they throw FileFormatMismatchBetweenSplitsError. (Hugging Face)

The key function, infer_module_for_data_files, looks at file extensions and counts them:

  • It collects all suffixes for all files in each split.
  • It counts how many times each extension appears.
  • It picks the most common extension.
  • It maps that extension to a module: "json" → Json, "csv" → Csv, "arrow" → Arrow, etc. (Hugging Face)

That means:

  • The logic is purely frequency-based: “which extension appears the most in this split’s files?”
  • It does not distinguish between “real data” JSON and “metadata” JSON like dataset_info.json or state.json.

How this produces exactly your error

GitHub issue #7018 shows the same error and explains the root cause in almost identical terms. The reported code is:

from datasets import load_dataset
from transformers import AutoTokenizer

MODEL = "google-bert/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

dataset = load_dataset("yelp_review_full")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.save_to_disk("dataset")

tokenized_datasets = load_dataset("dataset/")  # raises

The error:

ValueError: Couldn't infer the same data file format for all splits.
Got {NamedSplit('train'): ('arrow', {}), NamedSplit('test'): ('json', {})}

In that issue, the reporter inspects the `test` split:

```shell
$ ls -l dataset/test
-rw-r--r-- 1 ... data-00000-of-00001.arrow
-rw-r--r-- 1 ... dataset_info.json
-rw-r--r-- 1 ... state.json

and explicitly writes:

I believe this bug is caused by the logic that tries to infer dataset format. It counts the most common file extension. However, a small dataset can fit in a single .arrow file and have two JSON metadata files, causing the format to be inferred as JSON. (GitHub)

This is exactly your situation:

  • Your train split has many Arrow shards → Arrow extension dominates → train is Arrow.
  • Your validation split has just one Arrow shard but two JSON metadata files → JSON extension dominates → validation is JSON.

So for your directory:

  • train → inferred as Arrow (correct).
  • validation → inferred as JSON (incorrect, but consistent with the counting heuristic).
  • The formats differ → FileFormatMismatchBetweenSplitsError.

The core reason the validation split is inferred as JSON is:

The validator split directory has more .json files than .arrow files, and the format inference counts every extension, including metadata files, and blindly chooses the majority extension.

That is why dataset_info.json and state.json “outvote” your single Arrow file in validation/.


Relation to your dataset_dict.json

Your dataset_dict.json:

{"splits": ["train", "validation"]}

is fine. It is a simple file telling load_from_disk which splits exist in the saved dataset. It does not control, or override, the format detection in load_dataset.

The mismatch arises before dataset_dict.json is even used:

  1. load_dataset("dataset_path") sees a local path.
  2. It tries to infer data files and formats from the directory.
  3. Only after a consistent format is found would it proceed; in your case it never reaches that stage because the split formats disagree.

So the reason validation is classified as JSON is not a problem with the dataset_dict.json content; it is purely the extension-counting heuristic applied to your files.


Conceptual summary

Restating the key point in simple terms:

  • You have an Arrow dataset saved to disk (save_to_disk structure: Arrow + metadata JSON).
  • load_dataset is meant for raw data folders and tries to guess the file format based on which extension appears most often per split.
  • In train/ you have “many Arrow + 2 JSON” → looks like Arrow.
  • In validation/ you have “1 Arrow + 2 JSON” → looks like JSON.
  • The library refuses to mix Arrow and JSON across splits and throws FileFormatMismatchBetweenSplitsError.

Thus, the validation split is inferred as JSON only because its metadata JSONs outnumber its Arrow shards, and the current inference logic treats metadata files the same as real data files when counting extensions.

That is the precise, mechanical reason behind your error.

Thanks @John6666 , I can confirm that with more arrow shards in the validation split load_dataset infer the data format correctly as an arrow dataset.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.