Document Type Detection (Swin Transformer)
This model performs document type classification from document images such as invoices, letters, resumes, memos, and forms.
It is fine-tuned from a pretrained Swin Transformer using PyTorch.
Model Description
The goal of this model is to automatically classify scanned or digital document images into predefined document categories.
It is useful for document management systems, OCR pipelines, and enterprise automation workflows.
- Input: Document image (RGB)
- Output: Document type label
- Model: Swin Transformer (pretrained)
- Framework: PyTorch
Dataset
The dataset consists of document images grouped by document type.
Images were split into training and validation (dev) sets with an 80:20 ratio.
Dataset Structure
dataset/
βββ train/
β βββ invoice/
β βββ letter/
β βββ memo/
β βββ resume/
β βββ form/
βββ dev/
βββ invoice/
βββ letter/
βββ memo/
βββ resume/
βββ form/
Document Classes
- Invoice
- Letter
- Memo
- Resume
- Form
Model Architecture
- Backbone: Swin Transformer
- Pretraining: ImageNet
- Loss Function: Cross-Entropy Loss
- Optimizer: AdamW
- Fine-Tuning Strategy:
- Pretrained weights loaded
- Partial unfreezing of final layers
Training Details
- Image Size: 224 Γ 224
- Batch Size: 16
- Epochs: Custom (based on convergence)
- Data Augmentation:
- Random resize & crop
- Horizontal flip
- Normalization
Evaluation
Model performance was evaluated on the validation dataset using:
- Overall Accuracy
- Confusion Matrix
The model shows strong performance across visually distinct document types.
Inference
Load Model
import timm
import torch
checkpoint = torch.load("swin_doc_classifier.pth", map_location="cpu")
model = timm.create_model(
checkpoint["model_name"],
pretrained=False,
num_classes=checkpoint["num_classes"]
)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
Predict on an Image
with torch.no_grad():
outputs = model(image_tensor)
prediction = outputs.argmax(dim=1)
Model Files
swin_doc_classifier.pthβ Trained model checkpoint
Hardware
- NVIDIA T4 / RTX GPUs
- Apple Silicon (MPS)
- CPU supported
Limitations
- Single-page document classification
- Performance depends on image quality
- No OCR/text-based features used
Future Work
- Multi-page document support
- OCR + Vision hybrid model
- Larger document class coverage
- Deployment as Hugging Face Space
Citation
If you use this model in your work, please cite:
@misc{document_type_detection,
title={Document Type Detection using Swin Transformer},
author={Bharath},
year={2026}
}
License
MIT License
Model tree for bharath-shanmugasundaram/Document-Type-Detection-using-Vision-Transformer
Base model
microsoft/renderformer-v1.1-swin-large