Difference between Gradient Descent and Gradient Ascent

The short answer is: Mathematically, it is arbitrary. But physically and psychologically, "Descent" is much more intuitive.

There is no "deep mathematical truth" that makes Descent better than Ascent. If you multiplied every loss function in PyTorch by -1 and switched to Gradient Ascent, the AI would learn exactly the same way.

However, we chose Descent (Minimization) for three specific reasons rooted in Physics, Engineering history, and Human psychology.

1. The Physics Analogy: "Energy Landscapes"

This is the "deepest" reason. Deep Learning borrows heavily from physics (specifically statistical mechanics).

In the physical world, nature always tries to find the state of lowest energy.
- A ball rolls down a hill to stop at the bottom.
- A hot object cools down to match its environment.
- Water flows downstream.
When we visualize training an AI, we imagine the model as a "ball" navigating a landscape of possible errors. It is intuitive to imagine "gravity" pulling the model down into the valley of the correct answer.

If we used Gradient Ascent, we would have to imagine the model trying to climb a mountain to the peak. While possible, it breaks the gravity analogy that helps researchers visualize "momentum" and "friction" (which are real terms we use in optimizers like SGD and Adam).

2. The "Cost" Mental Model

Machine Learning has roots in classical engineering and optimization. In these fields, we frame problems in terms of Cost or Error.

If you build a bridge, you want to minimize the cost.3
If you transmit a signal, you want to minimize the noise.
If you predict stock prices, you want to minimize the difference between your guess and the real price.

"Zero Error" is a very hard, solid "floor" to aim for. It feels stable.

In contrast, "Maximum Utility" or "Maximum Fitness" can feel abstract and unbounded. It is psychologically satisfying to say, "The error is 0," rather than "The fitness is -0.0" (which is what Log Likelihood maximizes to).

3. Convexity ( The "Bowl" vs. The "Umbrella")

Much of the early theory of optimization relied on Convex Functions (shaped like a bowl).

It is mathematically standard to define a convex function as one you minimize.