Detail Information
Company Name Samsung R&D Institute
Role/Position GenAI Intern
Duration Sept 2024 - May 2025
Location Bangalore, India
Certificate
Tech Stack Python, Jupyter Notebooks (ipynb), PyTorch, TensorFlow, Transformers, Vision Transformers (ViTs), Attention Mechanisms, Adaptive Token Sampling (ATS), Patch Tokenization (PT), MLP Heads, Token Scoring
Company Website https://research.samsung.com/sri-b
Final Presentation

Efficient ViT model architecture for improved performance

Problem Statement

Traditional Vision Transformers (ViTs) split images into fixed-size patches regardless of content, leading to:

Processing all tokens equally becomes expensive in high-res images and inefficient in semantically sparse scenes.

Hypothesis

Using a content-adaptive token strategy and adaptive pruning mechanisms can:

Ideas

Evaluation of Existing Transformer Architectures

Feature/Idea Strengths Weaknesses
Vision Transformer Simple, scalable transformer architecture that performs well on large datasets. Computationally expensive due to fixed-size tokenization and quadratic attention.
Swin Transformer Efficient hierarchical attention structure that reduces compute while preserving accuracy. Struggles with capturing long-range dependencies due to localized window attention.
BEiT Achieves strong results using self-supervised masked image modeling. Complex pretraining and reliance on uniform tokenization reduce efficiency.
DeiT Performs well on limited data using knowledge distillation, making it data-efficient. Lacks CNN-like inductive biases, which limits performance on fine-grained tasks.

Idea Effectiveness & Design Outcomes

<aside> ✅

Content-Adaptive Patching (Implemented): Helped reduce redundant tokens by focusing on high-information regions using variable patch sizes.

</aside>

<aside> ✅

GC-ViT Hybrid Attention (Implemented): Combined local and global attention to capture both fine details and global structure, improving model robustness.

</aside>

<aside> ❌

Dynamic Token Merging (Considered): Explored merging similar tokens, but it conflicted with token ordering and downstream attention consistency.

</aside>

<aside> ✅

Token Scoring Module (Implemented): Enabled effective pruning by assigning importance scores to tokens via MLPs or attention.

</aside>

<aside> ❌

Knowledge Distillation (Considered): Considered guiding pruning via a full-token teacher model, but it added training overhead and complexity.

</aside>

<aside> ✅

Two-stage Adaptive Pruning (Implemented): Reduced token count progressively after key stages while preserving classification accuracy.

</aside>

Approach

Results

confusion_matrix.png

cnngradcam_result.png

Class Precision Recall F1-Score Support
0 0.8000 0.8000 0.8000 10
1 1.0000 1.0000 1.0000 10
... ... ... ... ...
98 0.9091 1.0000 0.9524 10
99 1.0000 1.0000 1.0000 10
100 0.6667 0.8000 0.7273 10
101 1.0000 0.8000 0.8889 10

Learnings & Outcome