| Detail |
Information |
| Company Name |
Samsung R&D Institute |
| Role/Position |
GenAI Intern |
| Duration |
Sept 2024 - May 2025 |
| Location |
Bangalore, India |
| Certificate |
‣ |
| Tech Stack |
Python, Jupyter Notebooks (ipynb), PyTorch, TensorFlow, Transformers, Vision Transformers (ViTs), Attention Mechanisms, Adaptive Token Sampling (ATS), Patch Tokenization (PT), MLP Heads, Token Scoring |
| Company Website |
https://research.samsung.com/sri-b |
| Final Presentation |
‣ |
Efficient ViT model architecture for improved performance
Problem Statement
Traditional Vision Transformers (ViTs) split images into fixed-size patches regardless of content, leading to:
- Redundant tokens in uninformative areas (e.g., sky).
- Under-representation in detailed areas (e.g., petal veins).
- Inefficient computation due to quadratic self-attention cost.
Processing all tokens equally becomes expensive in high-res images and inefficient in semantically sparse scenes.
Hypothesis
Using a content-adaptive token strategy and adaptive pruning mechanisms can:
- Reduce the number of tokens processed.
- Improve model efficiency and accuracy.
- Preserve or even enhance semantic detail by prioritizing important regions.
Ideas
- Content-Adaptive Patching: Start with the idea of using variable patch sizes instead of uniform grids.
- Token Scoring Module: Consider ways to assign importance scores to tokens using MLPs or attention.
- Two-stage Adaptive Pruning: Think about pruning tokens in stages based on their scores (ATS1, ATS2).
- GC-ViT Hybrid Attention: Explore combining local attention (for details) with global attention (for context).
- Saliency-Guided Selection: Use saliency maps to retain patches that attract human or model attention.
- Dynamic Token Merging: Merge similar tokens mid-inference to reduce sequence length.
- Knowledge Distillation: Use outputs from a high-capacity model to guide pruning and patching decisions.
- RL-based Patch Selection: Brainstorm using reinforcement learning to dynamically select or resize patches.
Evaluation of Existing Transformer Architectures
| Feature/Idea |
Strengths |
Weaknesses |
| Vision Transformer |
Simple, scalable transformer architecture that performs well on large datasets. |
Computationally expensive due to fixed-size tokenization and quadratic attention. |
| Swin Transformer |
Efficient hierarchical attention structure that reduces compute while preserving accuracy. |
Struggles with capturing long-range dependencies due to localized window attention. |
| BEiT |
Achieves strong results using self-supervised masked image modeling. |
Complex pretraining and reliance on uniform tokenization reduce efficiency. |
| DeiT |
Performs well on limited data using knowledge distillation, making it data-efficient. |
Lacks CNN-like inductive biases, which limits performance on fine-grained tasks. |
Idea Effectiveness & Design Outcomes
<aside>
✅
Content-Adaptive Patching (Implemented): Helped reduce redundant tokens by focusing on high-information regions using variable patch sizes.
</aside>
<aside>
✅
GC-ViT Hybrid Attention (Implemented): Combined local and global attention to capture both fine details and global structure, improving model robustness.
</aside>
<aside>
❌
Dynamic Token Merging (Considered): Explored merging similar tokens, but it conflicted with token ordering and downstream attention consistency.
</aside>
<aside>
✅
Token Scoring Module (Implemented): Enabled effective pruning by assigning importance scores to tokens via MLPs or attention.
</aside>
<aside>
❌
Knowledge Distillation (Considered): Considered guiding pruning via a full-token teacher model, but it added training overhead and complexity.
</aside>
<aside>
✅
Two-stage Adaptive Pruning (Implemented): Reduced token count progressively after key stages while preserving classification accuracy.
</aside>
Approach
- Patch Embedding + Positional Encoding:
- Input image is divided into patches and converted into tokens.
- A
[CLS] token is prepended and positional embeddings are added.
- Stage 1 – GC-ViT (Global-Context ViT):
- Combines local attention (captures textures and fine details) and global attention (captures long-range dependencies).
- Efficient due to spatially constrained attention and subsampled global tokens.
- Adaptive Token Sampler 1 (ATS1):
- Scores each token using lightweight modules (MLPs or attention-based heads).
- Keeps only top-K important tokens + CLS token.
- Discards low-information/background tokens early.
- Stage 2 – GC-ViT:
- Operates on reduced token set from ATS1.
- Further refines features with both local and global understanding.
- Adaptive Token Sampler 2 (ATS2):
- Repeats pruning on the already reduced token set.
- Further minimizes token redundancy while preserving critical information.
- Stage 3 – Final GC-ViT Layers + Classification:
- Operates on minimal token set.
- Only
[CLS] token is used for final classification.
Results
- Validation Accuracy: 85.98%
- Weighted F1 Score: 85.82%
- Reduction in Tokens: Up to 8×
- Training Time Reduction: ~25%
- Accuracy Gain: ~15% improvement over baseline in certain adaptive configurations.


| Class |
Precision |
Recall |
F1-Score |
Support |
| 0 |
0.8000 |
0.8000 |
0.8000 |
10 |
| 1 |
1.0000 |
1.0000 |
1.0000 |
10 |
| ... |
... |
... |
... |
... |
| 98 |
0.9091 |
1.0000 |
0.9524 |
10 |
| 99 |
1.0000 |
1.0000 |
1.0000 |
10 |
| 100 |
0.6667 |
0.8000 |
0.7273 |
10 |
| 101 |
1.0000 |
0.8000 |
0.8889 |
10 |
Learnings & Outcome
- Combining Local + Global Attention allows for better contextual and detailed representation.
- Adaptive Token Sampling can significantly reduce computational cost without compromising accuracy.
- Content-aware methods are model-agnostic, meaning they can be added to any ViT backbone.
- Fine-grained classification tasks benefit immensely from region-focused attention strategies.