Microprocessor: Introduction to the Streaming SIMD Extensions in the Pentium III: Part I

Introduction to the Streaming SIMD Extensions in the Pentium III: Part I

Abstract

With the launch of the Intel Pentium III processor, many new features are available for application developers. Using these features, application developers can create better content for the end-user.

In addition to being faster than the Pentium II and the Pentium II Xeon, the Pentium III and the Pentium III Xeon processors have many new features, including a unique processor ID and new processor instructions. These new instructions do for the Pentium II what MMX did for the Pentium.

In this article, we introduce you to the Pentium III and describe in brief its features, concentrating on the details of the new instructions.
1. Introduction to the Pentium III

In February 1999, Intel unveiled its latest processor, the Pentium III. As with each new processor, the Pentium III has a host of new features in addition to the increase in clock speed. Previous processor releases from Intel have adhered relatively closely to Moore's Law, which states that the processor speed doubles every 18 months. The Pentium III does not double the speed of the Pentium II. It runs in the 450- to 550-MHz range, while the Pentium II and Pentium II Xeon ran at 333 MHz to 400 MHz. Though the increase in clock speed is more modest than previous gains, the Pentium III more than makes up for this with increased functionality.

A Pentium III is essentially a Pentium II running at higher speeds, with the addition of a new set of instructions: Streaming SIMD Extensions, or SSE. Though SSE adds new features, existing applications are not affected. The Pentium III architecture is compatible with the Pentium II's IA-32.

If the Pentium III does not double the MHz provided by the Pentium II, why should we go for the Pentium III at all?
2. New Features of the Pentium III

The Pentium III adds two interesting and useful features: the processor serial number and SSE. The CPU serial number has been at the center of a debate about privacy. This article will abstain from the debate, and focus instead on the new SIMD instructions.

SSE has an acronym embedded in it: SIMD, which stands for Single Instruction Multiple Data. Usually, processors process one data element in one instruction, a processing style called Single Instruction Single Data, or SISD. In contrast, processors having the SIMD capability process more than one data element in one instruction.
3. MMX versus SSE

MMX and SSE, both of which are instruction sets that have been added to existing architectures, share the concept of SIMD, but they differ in the data types they handle, and in the way they are supported in the processor.

MMX instructions are SIMD for integers, while SSE instructions are SIMD for single-precision floating-point numbers. MMX instructions operate on two 32-bit integers simultaneously, while SSE instructions operate on four 32-bit floats simultaneously.

A major difference between MMX and SSE is that no new registers were defined for MMX, while eight new registers have been defined for SSE. Each of the registers for SSE is 128 bits long and can hold four single-precision floating-point numbers (each being 32 bits long). The arrangement of the floating-point numbers in the new data type handled by SSE is illustrated in Figure 1.

Figure 1: Arrangement of numbers in the new data type.

The immediate question is: Where did the registers for MMX come from? The MMX registers were allocated out of the floating-point registers of the floating-point unit. A floating-point register is 80 bits long, of which 64 bits were used for an MMX register. A limitation of this architecture is that An application cannot execute MMX instructions and perform floating-point operations simultaneously. Additionally, a large number of processor clock cycles are needed to change the state of executing MMX instructions to the state of executing floating-point operations and vice versa. SSE does not have such a restriction. Separate registers have been defined for SSE. Hence, applications can execute SIMD integer (MMX) and SIMD floating-point (SSE) instructions simultaneously. Applications can also execute non-SIMD floating-point and SIMD floating-point instructions simultaneously.

The arrangement of the registers in MMX and SSE is illustrated in Figure 2. Figure 2(a) illustrates the mutually exclusive floating-point and MMX registers, while Figure 2(b) illustrates the SSE registers.

Figure 2: Registers in MMX and SSE.

MMX and SSE have one more similarity: Both have eight registers. MMX registers are named mm0 through mm7, while SSE registers are named xmm0 through xmm7.
4. Application Areas

The Pentium III SSE instructions allow for SIMD operations on four single-precision floating-point numbers in one instruction. Therefore, applications that utilize floating-point calculations stand to benefit the most from the usage of SSE. Applications related to 3D graphics, in particular, should see a substantial benefit. In fact, SSE was created specifically for 3D. Games and other applications that use a 3D back-end to display 2D or 2.5D, as well as applications that use vector graphics at the back end, also stand to benefit.
4.1 The Case of 3D

3D graphics typically consists of manipulating large sets of floating-point numbers used to specify the position of a point in 3D space. Manipulation involves performing floating-point calculations on the data set. With the help of SSE instructions, the application will be able to process more data per instruction. The resulting speed increase translates into a better experience for the user.

Application developers can also add more data or more complex calculations to create better effects and generate a richer experience for the user.

Some of the commonly used operations in 3D processing that stand to benefit from usage of SSE instructions are matrix multiplication, matrix transposition, matrix-matrix operations like addition, subtraction, and multiplication, matrix-vector multiplication, vector normalization, vector dot product, and lighting calculations.
5. Streaming SIMD Extensions

SSE adds 70 new instructions to the processor. With the new instructions, a new status/control word has been added. SSE requires support from the operating system, which can save and restore the processor state as required. At present, the only operating systems that support SSE are Microsoft's Windows 98 and Windows 2000 operating systems. SSE defines new instructions, new data types and instruction categories.

Not all of the new instructions are for SIMD floating point. Of the 70, 50 are SIMD for floating point, 12 are for SIMD integer, and the remaining eight are cacheability instructions. In this article, we will concentrate on the 50 instructions provided for SIMD on single-precision floating-point numbers.
5.1 Classification

The new SIMD floating-point instructions in the Pentium III can be classified in various ways, based on different criteria.

If we classify instructions based on the arrangement of the operands of the instructions, the classification will be as given in the "Data Packing" subsection.

If we classify instructions based on their behavioral characteristics, the classification will be as given in the "Instruction Categories" subsection.

If we classify instructions based on their computational characteristics, the classification will be as given in the "Instruction Groups" subsection.
5.1.1 Data Packing

If we classify the SIMD floating point instructions based on how the data to be manipulated is stored, we have two broad categories: instructions operating on packed data and instructions operating on scalar data. Hence, we have packed instructions and scalar instructions.

In the Pentium III instruction set, the packed and scalar instructions can be distinguished quite easily. Packed instructions have the "ps" suffix, while the scalar instructions have the "ss" suffix.

The new data type defined by SSE allows the storage of four single-precision floating-point numbers. A data element of the new data type can be represented as shown in Figure 3. An element "A" of the new data type can store four single-precision floating point numbers, here labeled a0, a1, a2 and a3.

Figure 3: New data type in SSE.

The arrangement of Figure 3 helps clarify the difference between packed and scalar instructions. SSE packed instructions operate on all four elements of the data type. Scalar instructions operate only on the lowest significant element of the data type, leaving the remaining three elements unchanged.

A packed instruction is illustrated in Figure 4. As shown, there are two operands, A and B, each of the new data type defined by SSE. The result of the operation op will be stored in C. C has the same data type as A and B. The elements of A are a0, a1, a2 and a3, while the elements of B are b0, b1, b2 and b3. The result of executing the instruction op is shown in Figure 4.

Figure 4: Packed operation.

The instruction is applied to each of the elements of A and B, and the results are stored in corresponding elements of C. The instruction op is applied to all the elements at the same time, computing all four elements in the same unit of computation.

A scalar instruction using the same operands, A and B, is illustrated in Figure 5. The instruction is applied to only the least-significant element of A and B. Again, the results are stored in corresponding elements of C. The instruction op is applied to all the elements at the same time, again computing all four elements in the same unit of computation, though only one element is affected.

Figure 5: Scalar operation.
5.1.2 Instruction Categories (collected according to behavioral characteristics)

Most of the packed instructions in SSE have a scalar equivalent. SSE instructions can also be classified, without respect to arrangement of data, into the following instruction categories:
computation instructions
branching instructions
cacheability instructions
data movement and ordering instructions
type conversion instructions, and
state management instructions
5.1.3 Instruction Groups (collected according to computational characteristics)

SSE instructions can also be classified into the following instruction groups:
arithmetic instructions
comparison instructions
logical instructions
shuffle instructions
conversion instructions
state management instructions
cacheability instructions
data management instructions, and
additional integer instructions
5.2 Benefits

The primary benefit of SSE is a reduction in the number of instructions executed for the given data set. Without SIMD and SSE, multiplying each of 400 floats by a number would require looping through the data set, executing the multiplication operator 400 times. With SIMD and SSE, 100 multiplications could perform the same task, as each multiplication can operate on four floats simultaneously.

The number of instructions when using SIMD and SSE is not an exact one-fourth of the non-SIMD case. Some instructions will be required to rearrange the data so that it is in a format acceptable to the SIMD instructions. In practice, SIMD instructions should be able to roughly double the efficiency of the operations.

Microprocessor

Tuesday, March 11, 2008

Introduction to the Streaming SIMD Extensions in the Pentium III: Part I

No comments:

Blog Archive

About Me