Document Type Detection (Swin Transformer)

This model performs document type classification from document images such as invoices, letters, resumes, memos, and forms.
It is fine-tuned from a pretrained Swin Transformer using PyTorch.


Model Description

The goal of this model is to automatically classify scanned or digital document images into predefined document categories.
It is useful for document management systems, OCR pipelines, and enterprise automation workflows.

  • Input: Document image (RGB)
  • Output: Document type label
  • Model: Swin Transformer (pretrained)
  • Framework: PyTorch

Dataset

The dataset consists of document images grouped by document type.
Images were split into training and validation (dev) sets with an 80:20 ratio.

Dataset Structure

dataset/
β”œβ”€β”€ train/
β”‚   β”œβ”€β”€ invoice/
β”‚   β”œβ”€β”€ letter/
β”‚   β”œβ”€β”€ memo/
β”‚   β”œβ”€β”€ resume/
β”‚   └── form/
└── dev/
    β”œβ”€β”€ invoice/
    β”œβ”€β”€ letter/
    β”œβ”€β”€ memo/
    β”œβ”€β”€ resume/
    └── form/

Document Classes

  • Invoice
  • Letter
  • Memo
  • Resume
  • Form

Model Architecture

  • Backbone: Swin Transformer
  • Pretraining: ImageNet
  • Loss Function: Cross-Entropy Loss
  • Optimizer: AdamW
  • Fine-Tuning Strategy:
    • Pretrained weights loaded
    • Partial unfreezing of final layers

Training Details

  • Image Size: 224 Γ— 224
  • Batch Size: 16
  • Epochs: Custom (based on convergence)
  • Data Augmentation:
    • Random resize & crop
    • Horizontal flip
    • Normalization

Evaluation

Model performance was evaluated on the validation dataset using:

  • Overall Accuracy
  • Confusion Matrix

The model shows strong performance across visually distinct document types.


Inference

Load Model

import timm
import torch

checkpoint = torch.load("swin_doc_classifier.pth", map_location="cpu")

model = timm.create_model(
    checkpoint["model_name"],
    pretrained=False,
    num_classes=checkpoint["num_classes"]
)

model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

Predict on an Image

with torch.no_grad():
    outputs = model(image_tensor)
    prediction = outputs.argmax(dim=1)

Model Files

  • swin_doc_classifier.pth – Trained model checkpoint

Hardware

  • NVIDIA T4 / RTX GPUs
  • Apple Silicon (MPS)
  • CPU supported

Limitations

  • Single-page document classification
  • Performance depends on image quality
  • No OCR/text-based features used

Future Work

  • Multi-page document support
  • OCR + Vision hybrid model
  • Larger document class coverage
  • Deployment as Hugging Face Space

Citation

If you use this model in your work, please cite:

@misc{document_type_detection,
  title={Document Type Detection using Swin Transformer},
  author={Bharath},
  year={2026}
}

License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for bharath-shanmugasundaram/Document-Type-Detection-using-Vision-Transformer

Finetuned
(1)
this model

Space using bharath-shanmugasundaram/Document-Type-Detection-using-Vision-Transformer 1