Detail	Information
Company Name	Samsung R&D Institute
Role/Position	GenAI Intern
Duration	Sept 2024 - May 2025
Location	Bangalore, India
Certificate	‣
Tech Stack	Python, Jupyter Notebooks (ipynb), PyTorch, TensorFlow, Transformers, Vision Transformers (ViTs), Attention Mechanisms, Adaptive Token Sampling (ATS), Patch Tokenization (PT), MLP Heads, Token Scoring
Company Website	https://research.samsung.com/sri-b
Final Presentation	‣

Efficient ViT model architecture for improved performance

Problem Statement

Traditional Vision Transformers (ViTs) split images into fixed-size patches regardless of content, leading to:

Redundant tokens in uninformative areas (e.g., sky).
Under-representation in detailed areas (e.g., petal veins).
Inefficient computation due to quadratic self-attention cost.

Processing all tokens equally becomes expensive in high-res images and inefficient in semantically sparse scenes.

Hypothesis

Using a content-adaptive token strategy and adaptive pruning mechanisms can:

Reduce the number of tokens processed.
Improve model efficiency and accuracy.
Preserve or even enhance semantic detail by prioritizing important regions.

Ideas

Content-Adaptive Patching: Start with the idea of using variable patch sizes instead of uniform grids.
Token Scoring Module: Consider ways to assign importance scores to tokens using MLPs or attention.
Two-stage Adaptive Pruning: Think about pruning tokens in stages based on their scores (ATS1, ATS2).
GC-ViT Hybrid Attention: Explore combining local attention (for details) with global attention (for context).
Saliency-Guided Selection: Use saliency maps to retain patches that attract human or model attention.
Dynamic Token Merging: Merge similar tokens mid-inference to reduce sequence length.
Knowledge Distillation: Use outputs from a high-capacity model to guide pruning and patching decisions.
RL-based Patch Selection: Brainstorm using reinforcement learning to dynamically select or resize patches.

Evaluation of Existing Transformer Architectures

Feature/Idea	Strengths	Weaknesses
Vision Transformer	Simple, scalable transformer architecture that performs well on large datasets.	Computationally expensive due to fixed-size tokenization and quadratic attention.
Swin Transformer	Efficient hierarchical attention structure that reduces compute while preserving accuracy.	Struggles with capturing long-range dependencies due to localized window attention.
BEiT	Achieves strong results using self-supervised masked image modeling.	Complex pretraining and reliance on uniform tokenization reduce efficiency.
DeiT	Performs well on limited data using knowledge distillation, making it data-efficient.	Lacks CNN-like inductive biases, which limits performance on fine-grained tasks.

Idea Effectiveness & Design Outcomes

<aside> ✅

Content-Adaptive Patching (Implemented): Helped reduce redundant tokens by focusing on high-information regions using variable patch sizes.

</aside>

<aside> ✅

GC-ViT Hybrid Attention (Implemented): Combined local and global attention to capture both fine details and global structure, improving model robustness.

</aside>

<aside> ❌

Dynamic Token Merging (Considered): Explored merging similar tokens, but it conflicted with token ordering and downstream attention consistency.

</aside>

<aside> ✅

Token Scoring Module (Implemented): Enabled effective pruning by assigning importance scores to tokens via MLPs or attention.

</aside>

<aside> ❌

Knowledge Distillation (Considered): Considered guiding pruning via a full-token teacher model, but it added training overhead and complexity.

</aside>

<aside> ✅

Two-stage Adaptive Pruning (Implemented): Reduced token count progressively after key stages while preserving classification accuracy.

</aside>

Approach

Patch Embedding + Positional Encoding:
- Input image is divided into patches and converted into tokens.
- A [CLS] token is prepended and positional embeddings are added.
Stage 1 – GC-ViT (Global-Context ViT):
- Combines local attention (captures textures and fine details) and global attention (captures long-range dependencies).
- Efficient due to spatially constrained attention and subsampled global tokens.
Adaptive Token Sampler 1 (ATS1):
- Scores each token using lightweight modules (MLPs or attention-based heads).
- Keeps only top-K important tokens + CLS token.
- Discards low-information/background tokens early.
Stage 2 – GC-ViT:
- Operates on reduced token set from ATS1.
- Further refines features with both local and global understanding.
Adaptive Token Sampler 2 (ATS2):
- Repeats pruning on the already reduced token set.
- Further minimizes token redundancy while preserving critical information.
Stage 3 – Final GC-ViT Layers + Classification:
- Operates on minimal token set.
- Only [CLS] token is used for final classification.

Results

Validation Accuracy: 85.98%
Weighted F1 Score: 85.82%
Reduction in Tokens: Up to 8×
Training Time Reduction: ~25%
Accuracy Gain: ~15% improvement over baseline in certain adaptive configurations.

GradCam

Class	Precision	Recall	F1-Score	Support
0	0.8000	0.8000	0.8000	10
1	1.0000	1.0000	1.0000	10
...	...	...	...	...
98	0.9091	1.0000	0.9524	10
99	1.0000	1.0000	1.0000	10
100	0.6667	0.8000	0.7273	10
101	1.0000	0.8000	0.8889	10

Learnings & Outcome

Combining Local + Global Attention allows for better contextual and detailed representation.
Adaptive Token Sampling can significantly reduce computational cost without compromising accuracy.
Content-aware methods are model-agnostic, meaning they can be added to any ViT backbone.
Fine-grained classification tasks benefit immensely from region-focused attention strategies.