ATI had been experimenting with the concepts of a unified shader since the mid-2000s. ATI, like Nvidia and Microsoft, knew it was extravagant to have a shader, or even worse, an array of shaders dedicated to one function, and then sit idle when processing that function wasn’t called for. Making all shaders the same would not only make manufacturing a little easier, but would also increase the functionality and efficiency of a GPU by at least 50%. The scheduling and routing were the tricky part— how is a shader supposed to know when it is a transformation (vertex) shader and when it is a lighting (pixel) shader. The answer, and everyone knew it, was—the shader doesn’t care. An instruction is an instruction, and data is data, let’s just get on with it. However, a unified shading architecture hardware needs some form of load balancing and dynamic scheduling capability that could ensure all the computational units (shaders) were kept working as much as possible.
ATI’s erstwhile console iGPU group from the ArtX acquisition, a team of veterans from SGI that designed the first GPU which showed up in the Nintendo 64, undertook the challenge and delivered a unified-shader GPU for the Xbox 360 in 2004 they called Xenos.
The pipeline stages in Xenos were not different from other GPUs that didn’t have the unified shader model. That was due to the instructions happening at the register level and not through the API. That saved the developers from having to learn new techniques that would have, in those days, disrupted their traditional coding techniques. But, they quickly discovered the basic design concepts would benefit them in terms of functionality and performance.
The Xenos designers also added a separate cached location in the GPU, so it could notify its state to the CPU as quickly as possible.
Microsoft called that the Tail pointer write-back, and it kept both components in sync while the CPU updated the L2 cache and the GPU pulled data from it. According to Microsoft, that routine provided a theoretical bandwidth of 18 GB/sec.
The GPU in the Xbox 360 was a customized version of ATI’s R520, a revolutionary design at the time. In keeping with the “X” prefix and echoing the IBM processor, ATI called this GPU the Xenos.
The ATI Xenos, code-named C1, had 10 MB of internal eDRAM and 512 MB of 700 MHz GDDR3 RAM. ATI’s R520 GPU used the R500 architecture, made with a 90 nm production process at TSMC, a die size of 288 mm², and a transistor count of 321 million. See Figure 2 for a block diagram of the system.
The GPU’s ALUs were 32-bit IEEE 754 floating-point compliant (with typical graphics simplifications of rounding modes), denormalized numbers (flush to zero on reads), exception handling, and not-a-number handling. They were capable of vector (including dot product) and scalar operations with single-cycle throughput—that is, all operations issued every cycle. That allowed the ALUs a peak processing of 96 shader calculations per cycle while fetching textures and vertices.
The GPU had eight vertex shader units supporting the VS3.0 Shader Model of DirectX 9. Each could process one 128-bit vector instruction plus one 32-bit scalar instruction for each clock cycle. Combined, the eight vertex shader units could transform up to two vertices every clock cycle. The Xenos was the first GPU to process 10 billion vertex shader instructions per second. The vertex shader units supported dynamic flow control instructions such as branches, loops, and subroutines.
One of the significant new features of DirectX 9.0 was the support for floating-point processing and data formats known as 4 ×32 float. Compared with the integer formats used in previous API versions, floating-point formats provided much higher precision, range, and flexibility.
The Xenos introduced the unified shader
The construct of the Xenos GPU showed Microsoft the benefits of unified shaders, which found its way into the proprietary version of Direct3D 9 used in the Xbox 360. So the Xbox 360 was a semi-customized version of Direct3D 9 that accommodated the additional functions the Xenos needed. As a result, Direct3D influenced the design of ATI’s TerreScale Xenos architecture, and Xenos influenced future revisions of Direct3D beginning with version 10.
The Xbox 360 was introduced in November 2005.
ATI (and subsequently AMD—same crew, just different company badges) had a close, symbiotic relationship with Microsoft. Nvidia did, and still does also—the GPU suppliers needed Microsoft. The GPUs depended on the APIs to expose the GPU’s new features to the game developers, and the API builder, Microsoft wanted the latest performance features the GPUs offered.
In November 2006, Nvidia launched their GeForce 8800GTX, the first GPU available with unified shaders via a common API—DirectX 10, shader model 4.0.
The new unified shader used a very long instruction word (VLIW) architecture where the core executes operations in parallel.
In a unified architecture, a shader cluster is organized into five stream processing units. Each stream processing unit can retire a finished single precision floating point instruction per clock, dot product (DP, and special case by combining ALUs), and integer ADD. The 5th unit is more complex and can handle special transcendental functions such as sine and cosine. Each shader cluster can execute six instructions per clock cycle (peak), consisting of five shading instructions plus one branch.
In May 2007, AMD (even though the acquisition was almost two years old, many, maybe most, of the people in the industry still referred to the team as ATI—that would change) introduced its TeraScale architecture in the PC. The RV600 series TeraScale PC GPUs were ATI’s second-generation unified shader GPU and were designed to be fully compatible with Pixel Shader 4.0 and Microsoft’s DirectX 10.0 API. It was implemented on AMD’s Radeon HD 2000-series add-in-boards (AIB). TeraScale replaced ATI’s Xenos fixed-pipeline, hardware scheduled unified shader, and was designed to compete with Nvidia’s first unified shader microarchitecture, Tesla.
The R600 GPUs were manufactured in 80 nm and 65 nm. TeraScale was also used in the AMD Brazos Accelerated Processing Units (APUs), Llano, Richland, and Trinity.
Unified shaders led to the third era of GPUs and established the GPU as a computing element. By having none-specific shaders, which were (are) 32-bit floating-point processors, the enormous processing power of large arrays of shaders in a single-instruction multiple-data (SIMD) organization would open up new areas of development and astonishing results in multiple fields of science, medicine, manufacturing, and finance.
If you liked this story, you might also like the three-volume series Dr. Peddie has written on The History of the GPU:
Disclaimer: The author is completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE’s position nor that of the Computer Society nor its Leadership.
A not-for-profit organization, the Institute of Electrical and Electronics Engineers (IEEE) is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.