Imagine trying to describe a photograph to someone over the phone without being able to see it yourself. That’s essentially the challenge computer scientists faced in the 1950s when they first attempted to teach machines to ‘see’.
![]()
One of my favorite xkcd comics
If you’re new to the field, you might assume it’s just deep learning (turtles?) all the way down — and you wouldn’t be alone. These days, deep learning is everywhere — plastered across research papers, dominating Twitter hot takes (I refuse to call it X), and clogging up your LinkedIn feed thanks to self-declared "AI thought leaders" who discovered convolutional layers last Tuesday. It’s been hyped as the second coming of intelligence, capable of solving everything from image recognition to world peace.
But here’s the thing: computer vision wasn’t always about tossing giant neural networks at images and hoping for the best. Back before GPUs became altars for worship (and NVIDIA’s revenue rivaled the GDP of entire countries), the field was built on clever algorithms and hand-crafted features — designed by researchers who actually had to think outside the box, not just feed another prompt into one.
In this post, we’re rewinding to the days before deep learning — before AlexNet, before pixels had “brains,” and before every vision problem was solved with a 100B-parameter model and a sprinkle of hype. Back then, progress came from math, and insights had to be earned. This is the story of how computers first learned to see — and how those early hacks paved the way for the AI overload we live with today.

Anatomical Drawing of the Human Eye - Leonardo da Vinci
Humans have been fascinated by vision for centuries, long before the rise of computer algorithms. Thinkers like Leonardo da Vinci and Isaac Newton explored the optics of the human eye and the nature of color, laying early foundations. But it was Hermann von Helmholtz in the 19th century who is often credited with launching the first modern study of visual perception. By examining the eye, he realized it couldn’t possibly deliver a high-resolution image on its own; the raw input was simply too limited. Helmholtz proposed that vision wasn’t just a matter of light hitting the retina, but a process of unconscious inference: the brain fills in the gaps, drawing on prior knowledge and experience to make educated guesses about the world.
The next major breakthrough in understanding vision came from the work of Hubel and Wiesel in the 1950s and ‘60s. They discovered that visual processing happens in layers, with each layer adding structure and meaning to the raw input from the eyes. While light enters through the cornea and lens to project an image onto the retina ( a thin sheet of neural tissue lined with photoreceptors), the specialized cells translate light into electrical signals, which are then sent down the optic nerve to the brain’s visual cortex. There, a hierarchical network of neurons decodes the scene, gradually extracting features like edges, textures, and shapes; turning photons into perception.
This brings up a question.
Not quite. Digital Image Processing (DIP) operates at a lower level and focuses on manipulating pixel data to enhance or transform images without interpreting their content. Noise reduction, edge detection, contrast adjustment, and sharpening — all of these techniques are essentially tasks that improve visual quality and prepare images for further analysis. For instance, medical imaging uses DIP to clarify MRI and CT scans, while Photoshop applies filters to adjust brightness, color or create posters for the Titanic movie sequel (r/photoshopbattles is a fun place). Crucially, DIP outputs modified images, not insights.

Computer Vision (CV) focuses on enabling machines to interpret and understand visual content at a semantic level, by aiming to replicate human visual cognition. Extracting meaningful information from images or videos — such as identifying objects, recognizing faces, or analyzing scenes — and using that understanding to make decisions is the key tenet of computer vision. Level-4 autonomous vehicles like Waymo and Zoox use CV combined with other systems to detect pedestrians, interpret traffic signs and drive around city blocks with little to no human supervision. Deep learning-based CV systems learn patterns and context from data, moving beyond raw pixels to high-level comprehension.
While distinct, DIP can be considered a subset of CV and they often collaborate. Image processing often serves as a preprocessing step for computer vision: enhancing input data (e.g. removing water from underwater images) can significantly improve task accuracy. Conversely, CV can guide DIP — such as using object detection to apply selective enhancements to specific regions of an image.
To put it in simple terms:
Digital Image Processing answers: “How can I improve this image?”