Detail | Information |
---|---|
Company Name | Samsung R&D Institute |
Role/Position | GenAI Intern |
Duration | Sept 2024 - May 2025 |
Location | Bangalore, India |
Certificate | ‣ |
Tech Stack | Python, Jupyter Notebooks (ipynb), PyTorch, TensorFlow, Transformers, Vision Transformers (ViTs), Attention Mechanisms, Adaptive Token Sampling (ATS), Patch Tokenization (PT), MLP Heads, Token Scoring |
Company Website | https://research.samsung.com/sri-b |
Final Presentation | ‣ |
Traditional Vision Transformers (ViTs) split images into fixed-size patches regardless of content, leading to:
Processing all tokens equally becomes expensive in high-res images and inefficient in semantically sparse scenes.
Hypothesis
Using a content-adaptive token strategy and adaptive pruning mechanisms can:
Ideas
Feature/Idea | Strengths | Weaknesses |
---|---|---|
Vision Transformer | Simple, scalable transformer architecture that performs well on large datasets. | Computationally expensive due to fixed-size tokenization and quadratic attention. |
Swin Transformer | Efficient hierarchical attention structure that reduces compute while preserving accuracy. | Struggles with capturing long-range dependencies due to localized window attention. |
BEiT | Achieves strong results using self-supervised masked image modeling. | Complex pretraining and reliance on uniform tokenization reduce efficiency. |
DeiT | Performs well on limited data using knowledge distillation, making it data-efficient. | Lacks CNN-like inductive biases, which limits performance on fine-grained tasks. |
<aside> ✅
Content-Adaptive Patching (Implemented): Helped reduce redundant tokens by focusing on high-information regions using variable patch sizes.
</aside>
<aside> ✅
GC-ViT Hybrid Attention (Implemented): Combined local and global attention to capture both fine details and global structure, improving model robustness.
</aside>
<aside> ❌
Dynamic Token Merging (Considered): Explored merging similar tokens, but it conflicted with token ordering and downstream attention consistency.
</aside>
<aside> ✅
Token Scoring Module (Implemented): Enabled effective pruning by assigning importance scores to tokens via MLPs or attention.
</aside>
<aside> ❌
Knowledge Distillation (Considered): Considered guiding pruning via a full-token teacher model, but it added training overhead and complexity.
</aside>
<aside> ✅
Two-stage Adaptive Pruning (Implemented): Reduced token count progressively after key stages while preserving classification accuracy.
</aside>
[CLS]
token is prepended and positional embeddings are added.[CLS]
token is used for final classification.Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
0 | 0.8000 | 0.8000 | 0.8000 | 10 |
1 | 1.0000 | 1.0000 | 1.0000 | 10 |
... | ... | ... | ... | ... |
98 | 0.9091 | 1.0000 | 0.9524 | 10 |
99 | 1.0000 | 1.0000 | 1.0000 | 10 |
100 | 0.6667 | 0.8000 | 0.7273 | 10 |
101 | 1.0000 | 0.8000 | 0.8889 | 10 |