J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone and J. C. Phillips, "GPU Computing," in Proceedings of the IEEE, vol. 96, no. 5, pp. 879-899, May 2008

paper_critique1, Yanggon Kim

I wrote the comments and summary of this paper. These comments and summaries are mainly covering the contents that are not familiar to me. I especially focused on GPU hardware history and programming models.

Following characteristics of the program produces the GPU compute capabilities; Computational requirements are large, Parallelism is substantial, Throughput is more important than latency. Among them, the emphasis on throughput over latency is interesting to me. The author argues that the resolution of the human eye is less accurate than the speed of operation in GPU hardware. The difference between them is six-order-of-magnitude while the operations take hundreds to thousands of cycles on computers which operate on giga hertz frequency.

Long ago, the GPU was a fixed-function processor which takes responsibility for some tasks that are not suitable for CPU. As time goes by, these fixed-function pipelines are substituted by full-fledged parallel programmable compute cores.

Stage Name Characteristics
Vertex operation Basic input primitives are vertices. Those vertices can be computed parallelly, so that this stage is well suited for parallel hardware
Primitive Assembly The vertices are assembled into triangles. This triangles are fundamental primitives for modern GPU
Rasterization This stage determines the locations of pixels in each triangle. Also, each pixel has an

information called fragment which includes colors, z value, and coverage. | | Fragment Operation | This stage calculates the final color for the fragment. Each fragment can be computed in parallel. Also, this stage is the most

computationally demanding in the graphics pipeline. | | Composition | This stage determines the final color of a pixel based on the closest fragment color. |

The operations on vertex and fragment stages are considered as configurable, not programmable. So, these operations are calculated using fix-function pipelines historically.

As fixed function pipelines lack the generality for more complicated shading and lighting operations, the architect tried to replace the fixed function operations with user-specified programs that are run on each vertex and fragment. As results, the GPU had been evolved as a programmable engine surrounded by supporting fixed-function units.

There are multiple parallelisms that can be utilized in the graphics program.

Name Characteristics
Task parallelism Data in multiple stages can be computed at the same time.
Data parallelism Data in the same stages can be computed at the same time.

Also, there are two different optimizations for different stages.

Divisions Characteristics hardware
Programmable stages vertex stage, fragment stage Multiple stages are mapped in time multiplexing manner.
Fixed-function stages rasterization Greater compute/area efficiency.

Those two stages maintain the task-level parallelism. But, within programmable stages, the vertex and fragment stages are time-multiplexed. So, load-balancing becomes the problem. To be specific, if the vertex program takes longer than the fragment program, then the throughput is determined by the vertex program.

This trend yields a unified shader architecture. Rather than taking advantage of task-level parallelism from multiple fixed function units, single programmable units divide their time among vertex, fragment, and geometry (DirectX 10’s new shader) work.

Using this unified shader architecture, programmers can avoid the load-balancing problem. But the unified shader units became more complex.

Gather is a read operation from memory. Scatter operation is write back operation to memory.