Document Type Detection (Swin Transformer)

This model performs document type classification from document images such as invoices, letters, resumes, memos, and forms.
It is fine-tuned from a pretrained Swin Transformer using PyTorch.

Model Description

The goal of this model is to automatically classify scanned or digital document images into predefined document categories.
It is useful for document management systems, OCR pipelines, and enterprise automation workflows.

Input: Document image (RGB)
Output: Document type label
Model: Swin Transformer (pretrained)
Framework: PyTorch

Dataset

The dataset consists of document images grouped by document type.
Images were split into training and validation (dev) sets with an 80:20 ratio.

Dataset Structure

dataset/
├── train/
│   ├── invoice/
│   ├── letter/
│   ├── memo/
│   ├── resume/
│   └── form/
└── dev/
    ├── invoice/
    ├── letter/
    ├── memo/
    ├── resume/
    └── form/

Document Classes

Invoice
Letter
Memo
Resume
Form

Model Architecture

Backbone: Swin Transformer
Pretraining: ImageNet
Loss Function: Cross-Entropy Loss
Optimizer: AdamW
Fine-Tuning Strategy:
- Pretrained weights loaded
- Partial unfreezing of final layers

Training Details

Image Size: 224 × 224
Batch Size: 16
Epochs: Custom (based on convergence)
Data Augmentation:
- Random resize & crop
- Horizontal flip
- Normalization

Evaluation

Model performance was evaluated on the validation dataset using:

Overall Accuracy
Confusion Matrix

The model shows strong performance across visually distinct document types.

Inference

Load Model

import timm
import torch

checkpoint = torch.load("swin_doc_classifier.pth", map_location="cpu")

model = timm.create_model(
    checkpoint["model_name"],
    pretrained=False,
    num_classes=checkpoint["num_classes"]
)

model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

Predict on an Image

with torch.no_grad():
    outputs = model(image_tensor)
    prediction = outputs.argmax(dim=1)

Model Files

swin_doc_classifier.pth – Trained model checkpoint

Hardware

NVIDIA T4 / RTX GPUs
Apple Silicon (MPS)
CPU supported

Limitations

Single-page document classification
Performance depends on image quality
No OCR/text-based features used

Future Work

Multi-page document support
OCR + Vision hybrid model
Larger document class coverage
Deployment as Hugging Face Space

Citation

If you use this model in your work, please cite:

@misc{document_type_detection,
  title={Document Type Detection using Swin Transformer},
  author={Bharath},
  year={2026}
}

License

MIT License

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for bharath-shanmugasundaram/Document-Type-Detection-using-Vision-Transformer

Base model

microsoft/renderformer-v1.1-swin-large

Finetuned

(1)

this model

bharath-shanmugasundaram
/

Document-Type-Detection-using-Vision-Transformer