Microprocessor: Introduction to the Streaming SIMD Extensions in the Pentium III: Part II

Introduction to the Streaming SIMD Extensions in the Pentium III: Part II

1. Using SSE

After describing SSE, let's look at how we can use SSE in applications.
1.1 Assembly Language

Traditionally, programmers wishing to exploit the advantages of a new processor have developed code in assembly language. This was necessary, as higher-level development tools supporting the new instructions only appeared well after the processor release.

This is also the case with the Pentium III. At present, only the Intel C/C++ compiler and the Microsoft Macro Assembler (Version 6.11d and above) are able to understand the new SSE instructions.

The usual trade-off applies here: Programming a large and complex application in assembly language is difficult, but the resulting application will be very fast.

With the SSE SDK (Software Developers Kit), Intel provides two more programming mechanisms for using the SSE instructions: the intrinsics library, and a C++ class for the new data type defined by SSE. These mechanisms are easier to use than assembly language, particularly because programmers do not have to explicitly manage the SSE registers, and they especially make it easier to develop large applications. This comes at the cost of application speed. Code developed using these mechanisms is fast, but not as fast as corresponding code written in assembly language. Figure 6 shows the inverse relationship between application speed and ease of development among the three programming methods.

Figure 1: Application speed versus ease of development for different environments.
1.1.1 Example: Multiplication

Assume that the two 128-bit numbers a and b are stored in the registers xmm1 and xmm2 respectively. The result of the computation will be stored in the register xmm0. The code is embedded in a C program.
#include
...
_asm {
push esi;
push edi;
; a is loaded into xmm1
; b is loaded into xmm2
mov xmm0, xmm1;
mulps xmm0, xmm2;
; store result into c
pop edi;
pop esi;
}
...

Figure 7 presents a diagrammatic representation of the packed multiplication.

Figure 2: Packed Multiplication.
1.2 Intrinsics

The first additional development mechanism is the intrinsics library. The instrinsics library provides a C programming language interface to the SSE instructions. All the SSE instructions have a corresponsing C function in the intrinsics library. For example, the assembly language mnemonic for the instruction to add two packed data types is addps. Correspondingly, the function in the intrinsics library to add two packed data types is _mm_add_ps. In addition to defining functions for all instructions, the intrinsics library also defines a data type (__m128) that is 128 bits long. It allows the storage of four floating point numbers.

To use the intrinsics library, the file xmmintrin.h must be included.
1.2.1 Example: Multiplication

Assume that a and b are two 128-bit numbers. The result of the computation will be stored in another 128-bit number, c. All the numbers are of the datatype __m128. __m128 is the class that represents the 128-bit number and is declared in the file xmmintrin.h. The function _mm_set_ps takes the most-significant number as the first parameter and the least-significant number as the last parameter.
#include
...
__m128 a, b, c;
a = _mm_set_ps(4, 3, 2, 1)
b = _mm_set_ps(4, 3, 2, 1)
c = _mm_set_ps(0, 0, 0, 0)
c = _mm_mul_ps(a, b);
...
1.3 C++

The second additional development mechanism uses C++. The SSE SDK provides a C++ class, F32vec4, that defines an abstraction for the new 128-bit data type. All operations on the data type are encapsulated by the class. Internally, the class uses the functions of the intrinsics library.

To use the C++ class, the file fvec.h must be included.
1.3.1 Example: Multiplication

Again, assume a and b are two 128-bit numbers. The result of the computation will be stored in another 128-bit number, c. All the numbers are of the datatype F32vec4. The order of parameters for the constructor of F32vec4 is the same as for the function _mm_set_ps.
#include
...
F32vec4 a(4, 3, 2, 1), b(4, 3, 2, 1), c(0, 0, 0, 0);
...
c =a * b;
...
1.4 Compiler Support

As mentioned earlier, only the Intel C/C++ compiler and the Microsoft Macro Assembler support the SSE instructions. The Intel compiler integrates into the Microsoft Visual Studio programming environment. The Visual Studio environment can be configured such that either the whole project or specific files from the project can be compiled using the Intel compiler.
2. SSE Instruction Details

Before we cover some simple examples to illustrate the usage of SSE instructions, we will list all instructions defined in SSE.
Arithmetic Instructions
addps, addss
subps, subss
mulps, mulss
divps, divss
sqrtps, sqrtss
maxps, maxss
minps, minss

Logical Instructions
andps
andnps
orps
xorps

Compare Instructions
cmpps, cmpss
comiss
ucomiss

Shuffle Instructions
shufps
unpchkps
unpcklps

Conversion Instructions
cvtpi2ps, cvtpi2ss
cvtps2pi, cvtss2si

Data Movement Instructions
movaps
movups
movhps
movlps
movmskps
movss

State Management Instructions
ldmxcsr
fxsave
stmxscr
fxstor

Cacheability Control Instructions
maskmovq
movntq
movntps
prefetch
sfence

Additional SIMD Integer Instructions
pextrw
pinsrw
pmaxub, pmaxsw
pminub, pminsw
pmovmskb
pmulhuw
pshufw
3. Examples

In this section, we present a few examples to ilustrate the usage of the Pentium III SSE instructions. For each example, we will present three solutions, one each in assembly language, intrinsics and C++. Additional examples are presented in the additional examples section.
3.1 Packed Multiplication

We have already covered the multiplication of two packed numbers, while describing the different development mechanisms. The packed multiplication was illustrated using assembly language, the intrinsics library, and the C++ class.
3.2 Comparison Operation

Let us consider the case of comparison. Without SSE, we compare floating-point numbers using the less than operator. Using SSE, we can compare 4 floating point numbers in one instruction.

To compare 4 floats in C or C++, we would write a loop and put a comparison condition in the loop. We would then take action on the comparion result. The code for comparison is illustrated below.
float a[4], b[4]
int i, c[4];

// assume that a contains 4.5 6.7 2.3 and 1.2
// assume that b contains 4.3 6.9 2.0 and 1.5

for (i = 0;i < h="">

Figure 3: Comparison operator.
3.3 Branch Removal

Usually, in applications, we have conditions like the one given below.
a =(a < h="">

Microprocessor

Tuesday, March 11, 2008

Introduction to the Streaming SIMD Extensions in the Pentium III: Part II

No comments:

Blog Archive

About Me