Tuesday, March 11, 2008

Introduction to the Streaming SIMD Extensions in the Pentium III: Part III

Introduction to the Streaming SIMD Extensions in the Pentium III: Part III

1. Data Swizzling

The speedup that the Pentium III SSE achieves on floating-point operations comes at a price. The data operated on by SSE instructions has to be stored in the new data type defined by SSE. If the application stores the data in its own format, the data has to be converted into the new data type before the SSE instructions can operate on it, and has to be converted back afterward.

This conversion of data from one format into another is termed "data swizzling."

This conversion takes time and machine cycles. If an application converts data from one format to another too often, the machine cycles saved by executing SSE instructions may well be lost. Hence, care is needed.
1.1 Data Organization

Usually, 3D applications store the coordinates of a point in one structure. When handling multiple points, applications use an array of structures, also called AoS. Typical geometric operations operate differently on the x, y and z coordinates of the point. The code given below lists the typical declaration used by applications processing 3D data. When handling large data sets, this structure amounts to an array-of-structures, as illustrated in figure 9.
struct point {
float x, y, z;
};
...
point dataset[...];


Figure 1: Array of structures.

To exploit the advantages of SSE, it would be better to operate on multiple points simultaneously. This can be done by operating on the coordinates of multiple points. This is possible if we collect together the x-, the y- and z-coordinates of the points. The application can then process multiple x-, y- and z-coordinates separately. For this, the application must rearrange the data into either three separate arrays, or a structure of arrays with one array each for one coordinate of the point. This arrangement is called the SoA arrangement.

The code given below lists the declaration of the struture of arrays, while figure 10 is the diagrammatic representation of the struture of arrays.
struct point {
float *x, *y, *z;
};



Figure 2: Structure of arrays.
2. Memory Issues
2.1 Alignment


Handling and manipulating simple variables of the new data type does not create problems. However, it is recommended that variables of the new data type be aligned to 16-byte boundaries. This alignment can be enforced either by setting the appropriate compiler flags or by explicitly using align commands in the program, during variable declaration.

A variable can be specified to be aligned to a 16-byte boundary using the __declspec compiler directive, as illustrated in the following example. The variable myVar will be aligned to a 16-byte boundary due to the align directive. It is not necessary to align the new data types to 16-byte boundaries, as the compiler aligns the data types when it comes across the new data type declarations. The alignment directive is issued as shown:
__declspec(align(16)) float[4] myVar;
2.2 Dynamic Memory

The condition on the new data types stipulating that pointers accessing memory locations be aligned to 16-byte boundaries creates problems when allocating memory dynamically or at the time of accessing allocated arrays through a pointer.

When accessing arrays through pointers, we have to ensure that the pointer is aligned to a 16-byte boundary.

To allocate memory at run time we use either the malloc function or the new command. The default behaviour of both is that they do not align the pointer address to a 16-byte boundary. Hence, we have to either allocate memory and then adjust the pointer to a 16-byte boundary, or allocate the memory using the _mm_malloc function. The _mm_malloc function allocates a memory block that is aligned to a 16-byte boundary.

Just as malloc has a free, the _mm_malloc function has the function _mm_free. Memory blocks allocated using _mm_malloc have to be freed using _mm_free.
2.3 Custom Datatype

The restriction that pointers be aligned to 16-byte boundaries can be troublesome. It would be much better to be able to ignore the alignment of pointers.

When operating on 128-bit data types, it may be necessary to access the floats stored in the data type. In assembly language there is not much choice but to use assembly language constructs. Using C or C++ and the intrinsics library, however, the data will be sortd in the data type __mm128. In this data type, once the value is set, it is not possible to access the individual floating-point numbers directly. One way to access them is to transfer all floating point numbers into an array of floats, change the values and load the array of floats back into the data type. The second method is to cast the data type into a float array and then access the required element. The first method is time consuming and the second method may cause problems if not used properly.

Defining a custom data type can overcome these problems. The custom data type is defined as a union of the data type (__m128) and an array of four floats. The declaration of the new data, called sse4 for now, is given below.
union sse4 {
__m128 m;
float f[4];
};

Using this data type, it is no longer necessary to align memory locations to 16-byte boundaries. When the compiler encounters the data type __m128, it aligns it to a 16-byte boundary. An added advantage of this data type is that the individual floating-point numbers stored in the 128-bit data can be acessed directly.
2.4 Detecting the CPU

As the usage of SSE depends on the presence Pentium III, it is important that applications be able to detect the Pentium III chip. This is done using the cpuid instruction.

For the cpuid instruction to work as desired, the eax register has to be set to the appropriate value. As we are interested only in the CPU ID, we need to set the eax register to 1 before invoking the cpuid instruction.

The source code to detect the presence of the Pentium III CPU is given below. To be able to compile the code, the file fvec.h has to be included.
BOOL CheckP3HW()
{
BOOL SSEHW = FALSE;
_asm {
// Move the number 1 into eax - this will move the
// feature bits into EDX when a CPUID is issued, that
// is, EDX will then hold the key to the cpuid
mov eax, 1

// Does this processor have SSE support?
cpuid

// Perform CPUID (puts processor feature info in EDX)
// Shift the bits in edx to the right by 26, thus bit 25
// (SSE bit) is now in CF bit in EFLAGS register.
shr edx,0x1A

// If CF is not set, jump over next instruction
jnc nocarryflag

// set the return value to 1 if the CF flag is set
mov [SSEHW], 1

nocarryflag:
}
return SSEHW;
}

The SSE SDK also has an SSE emulation mode that emulates the Pentium III and the SSE registers. The code given below can be used to detect this emulation. To be able to compile the code, the file fvec.h has to be included.
// Checking for SSE emulation support
BOOL CheckP3Emu()
{
BOOL SSEEmu = TRUE;
Fvec32 pNormal = (1.0, 2.0, 3.0, 4.0);
Fvec32 pZero = 0.0;
// Checking for SSE HW emulation
__try {
_asm {
// Issue a move instruction that will cause exception
// w/out HW support emulation
movups xmm1, [pNormal]
// Issue a computational instruction that will cause
// exception w/out HW support emulation
divps xmm1, [pZero]
}
}
// If there's an exception, set emulation variable to false
__except(EXCEPTION_EXECUTE_HANDLER) {
SSEEmu = FALSE;
}
return SSEEmu;
}


3. Additional Examples

In this section, we present additional examples to illustrate the usage of the Streaming SIMD Extensions.
3.1 Array Manipulation

In this example, we take two arrays, each with 400 floats. A multiplication operation is performed on each of the array elements. The result of the multiplication is stored in a third array. The two arrays used as operands are named A and B. The result of the multiplication is stored in array C. In all the sources given below, the following declartion is assumed
#include

#define ARRSIZE 400

__declspec(align(16)) float a[ARRSIZE], b[ARRSIZE], c[ARRSIZE];
3.1.1 Assembly Language
_asm {
push esi;
push edi;
mov edi, a;
mov esi, b;
mov edx, c;
mov ecx, 100;
loop:
movaps xmm0, [edi];
movups xmm1, [esi];
mulps xmm0, xmm1;
movups [edx], xmm0;
add edi, 16;
add esi, 16;
add edx, 16;
dec ecx;
jnz loop;
pop edi;
pop esi;
}IZE; i +="4" ) {
m1= _mm_loadu_ps(a+i);
m2= _mm_loadu_ps(b+i);
m3= _mm_mul_ps(m1," m2);
_mm_storeu_ps(c+i, m3);
}
3.1.3 C++
F32vec4 f1, f2, f3;

for ( int i = 0; i loadu(f1, a+i);
loadu(f2, b+i);
f3 = f1 * f2;
storeu(c+i, f3);
}
3.2 Vector for 3D

This example presents a vector in 3D. The vector is implemented as a class. The functionality of the class is implemented using the intrinsics library.

The class declaration is given below.
union sse4 {
__m128 m;
float f[4];
};

class sVector3 {
protected:
sse4 val;
public:
sVector3(float, float, float);
float& operator [](int);
sVector3& operator +=(const sVector3&);
float length() const;
friend float dot(const sVector3&, const sVector3&);
};

The class implementation is given below.
sVector3::sVector3(float x, float y, float z) {
val.m = _mm_set_ps(0, z, y, x);
}
float& sgmVector3::operator [](int i) {
return val.f[i];
}
sVector3& sVector3::operator +=(const sVector3& v) {
val.m = _mm_add_ps(val.m, v.val.m);
return *this;
}
float sVector3::length() const {
sse4 m1;
m1.m = _mm_sqrt_ps(_mm_mul_ps(val.m, val.m));
return m1.f[0] + m1.f[1] + m1.f[2];
}
float dot(const sVector3& v1, const sVector3& v2) {
sVector3 v(v1);
v.val.m = _mm_mul_ps(v.val.m, v2.val.m);
return v.val.f[0] + v.val.f[1] + v.val.f[2];
}
3.3 4x4 Matrix

This example presents a 4x4 matrix. The matrix is implemented as a class. The functionality of the class is implemented using the intrinsics library.

The class declaration is given below.
float const sEPSILON = 1.0e-10f;

union sse16 {
__m128 m[4];
float f[4][4];
};

class sMatrix4 {
protected:
sse16 val;
sse4 sFuzzy;
public:
sMatrix4(float*);
float& operator()(int, int);
sMatrix4& operator +=(const sMatrix4&);
bool operator ==(const sMatrix4&) const;
sVector4 operator *(const sVector4&) const;
private:
float RCD(const sMatrix4& B, int i, int j) const;
};

The class implementation is given below.
sMatrix4::sMatrix4(float* fv) {
val.m[0] = _mm_set_ps(fv[3], fv[2], fv[1], fv[0]);
val.m[1] = _mm_set_ps(fv[7], fv[6], fv[5], fv[4]);
val.m[2] = _mm_set_ps(fv[11], fv[10], fv[9], fv[8]);
val.m[3] = _mm_set_ps(fv[15], fv[14], fv[13], fv[12]);
float f = sEPSILON;
sFuzzy.m = _mm_set_ps(f, f, f, f);
}
float& sMatrix4::operator()(int i, int j) {
return val.f[i][j];
}
sMatrix4& sMatrix4::operator +=(const sMatrix4& M) {
val.m[0] = _mm_add_ps(val.m[0], M.val.m[0]);
val.m[1] = _mm_add_ps(val.m[1], M.val.m[1]);
val.m[2] = _mm_add_ps(val.m[2], M.val.m[2]);
val.m[3] = _mm_add_ps(val.m[3], M.val.m[3]);
return *this;
}
bool sMatrix4::operator ==(const sMatrix4& M) const {
int res[4];
res[0] = res[1] = res[2] = res[3] = 0;
res[0] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[0], M.val.m[0]),
_mm_min_ps(val.m[0], M.val.m[0])), sFuzzy.m));
res[1] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[1], M.val.m[1]),
_mm_min_ps(val.m[1], M.val.m[1])), sFuzzy.m));
res[2] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[2], M.val.m[2]),
_mm_min_ps(val.m[2], M.val.m[2])), sFuzzy.m));
res[3] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[3], M.val.m[3]),
_mm_min_ps(val.m[3], M.val.m[3])), sFuzzy.m));
if ( (15 == res[0]) && (15 == res[1])
&& (15 == res[2]) && (15 == res[3]) )
return 1;
return 0;
}
sVector4 sMatrix4::operator *(const sVector4& v) const {
return sVector4(
val.f[0][0] * v[0] + val.f[0][1] * v[1]
+ val.f[0][2] * v[2] + val.f[0][3] * v[3],
val.f[1][0] * v[0] + val.f[1][1] * v[1]
+ val.f[1][2] * v[2] + val.f[1][3] * v[3],
val.f[2][0] * v[0] + val.f[2][1] * v[1]
+ val.f[2][2] * v[2] + val.f[2][3] * v[3],
val.f[3][0] * v[0] + val.f[3][1] * v[1]
+ val.f[3][2] * v[2] + val.f[3][3] * v[3]);
}
float sMatrix4::RCD(const sMatrix4& B, int i, int j) const {
return val.f[i][0] * B.val.f[0][j] + val.f[i][1] * B.val.f[1][j]
+ val.f[i][2] * B.val.f[2][j] + val.f[i][3] * B.val.f[3][j];
}



Introduction to the Streaming SIMD Extensions in the Pentium III: Part II

Introduction to the Streaming SIMD Extensions in the Pentium III: Part II



1. Using SSE

After describing SSE, let's look at how we can use SSE in applications.
1.1 Assembly Language

Traditionally, programmers wishing to exploit the advantages of a new processor have developed code in assembly language. This was necessary, as higher-level development tools supporting the new instructions only appeared well after the processor release.

This is also the case with the Pentium III. At present, only the Intel C/C++ compiler and the Microsoft Macro Assembler (Version 6.11d and above) are able to understand the new SSE instructions.

The usual trade-off applies here: Programming a large and complex application in assembly language is difficult, but the resulting application will be very fast.

With the SSE SDK (Software Developers Kit), Intel provides two more programming mechanisms for using the SSE instructions: the intrinsics library, and a C++ class for the new data type defined by SSE. These mechanisms are easier to use than assembly language, particularly because programmers do not have to explicitly manage the SSE registers, and they especially make it easier to develop large applications. This comes at the cost of application speed. Code developed using these mechanisms is fast, but not as fast as corresponding code written in assembly language. Figure 6 shows the inverse relationship between application speed and ease of development among the three programming methods.




Figure 1: Application speed versus ease of development for different environments.
1.1.1 Example: Multiplication

Assume that the two 128-bit numbers a and b are stored in the registers xmm1 and xmm2 respectively. The result of the computation will be stored in the register xmm0. The code is embedded in a C program.
#include
...
_asm {
push esi;
push edi;
; a is loaded into xmm1
; b is loaded into xmm2
mov xmm0, xmm1;
mulps xmm0, xmm2;
; store result into c
pop edi;
pop esi;
}
...

Figure 7 presents a diagrammatic representation of the packed multiplication.




Figure 2: Packed Multiplication.
1.2 Intrinsics

The first additional development mechanism is the intrinsics library. The instrinsics library provides a C programming language interface to the SSE instructions. All the SSE instructions have a corresponsing C function in the intrinsics library. For example, the assembly language mnemonic for the instruction to add two packed data types is addps. Correspondingly, the function in the intrinsics library to add two packed data types is _mm_add_ps. In addition to defining functions for all instructions, the intrinsics library also defines a data type (__m128) that is 128 bits long. It allows the storage of four floating point numbers.

To use the intrinsics library, the file xmmintrin.h must be included.
1.2.1 Example: Multiplication

Assume that a and b are two 128-bit numbers. The result of the computation will be stored in another 128-bit number, c. All the numbers are of the datatype __m128. __m128 is the class that represents the 128-bit number and is declared in the file xmmintrin.h. The function _mm_set_ps takes the most-significant number as the first parameter and the least-significant number as the last parameter.
#include
...
__m128 a, b, c;
a = _mm_set_ps(4, 3, 2, 1)
b = _mm_set_ps(4, 3, 2, 1)
c = _mm_set_ps(0, 0, 0, 0)
c = _mm_mul_ps(a, b);
...
1.3 C++

The second additional development mechanism uses C++. The SSE SDK provides a C++ class, F32vec4, that defines an abstraction for the new 128-bit data type. All operations on the data type are encapsulated by the class. Internally, the class uses the functions of the intrinsics library.

To use the C++ class, the file fvec.h must be included.
1.3.1 Example: Multiplication


Again, assume a and b are two 128-bit numbers. The result of the computation will be stored in another 128-bit number, c. All the numbers are of the datatype F32vec4. The order of parameters for the constructor of F32vec4 is the same as for the function _mm_set_ps.
#include
...
F32vec4 a(4, 3, 2, 1), b(4, 3, 2, 1), c(0, 0, 0, 0);
...
c =a * b;
...
1.4 Compiler Support

As mentioned earlier, only the Intel C/C++ compiler and the Microsoft Macro Assembler support the SSE instructions. The Intel compiler integrates into the Microsoft Visual Studio programming environment. The Visual Studio environment can be configured such that either the whole project or specific files from the project can be compiled using the Intel compiler.
2. SSE Instruction Details

Before we cover some simple examples to illustrate the usage of SSE instructions, we will list all instructions defined in SSE.
Arithmetic Instructions
addps, addss
subps, subss
mulps, mulss
divps, divss
sqrtps, sqrtss
maxps, maxss
minps, minss

Logical Instructions
andps
andnps
orps
xorps

Compare Instructions
cmpps, cmpss
comiss
ucomiss

Shuffle Instructions
shufps
unpchkps
unpcklps

Conversion Instructions
cvtpi2ps, cvtpi2ss
cvtps2pi, cvtss2si

Data Movement Instructions
movaps
movups
movhps
movlps
movmskps
movss

State Management Instructions
ldmxcsr
fxsave
stmxscr
fxstor

Cacheability Control Instructions
maskmovq
movntq
movntps
prefetch
sfence

Additional SIMD Integer Instructions
pextrw
pinsrw
pmaxub, pmaxsw
pminub, pminsw
pmovmskb
pmulhuw
pshufw
3. Examples

In this section, we present a few examples to ilustrate the usage of the Pentium III SSE instructions. For each example, we will present three solutions, one each in assembly language, intrinsics and C++. Additional examples are presented in the additional examples section.
3.1 Packed Multiplication

We have already covered the multiplication of two packed numbers, while describing the different development mechanisms. The packed multiplication was illustrated using assembly language, the intrinsics library, and the C++ class.
3.2 Comparison Operation

Let us consider the case of comparison. Without SSE, we compare floating-point numbers using the less than operator. Using SSE, we can compare 4 floating point numbers in one instruction.

To compare 4 floats in C or C++, we would write a loop and put a comparison condition in the loop. We would then take action on the comparion result. The code for comparison is illustrated below.
float a[4], b[4]
int i, c[4];

// assume that a contains 4.5 6.7 2.3 and 1.2
// assume that b contains 4.3 6.9 2.0 and 1.5

for (i = 0;i < h="">


Figure 3: Comparison operator.
3.3 Branch Removal

Usually, in applications, we have conditions like the one given below.
a =(a < h="">

Introduction to the Streaming SIMD Extensions in the Pentium III: Part I

Introduction to the Streaming SIMD Extensions in the Pentium III: Part I




Abstract

With the launch of the Intel Pentium III processor, many new features are available for application developers. Using these features, application developers can create better content for the end-user.

In addition to being faster than the Pentium II and the Pentium II Xeon, the Pentium III and the Pentium III Xeon processors have many new features, including a unique processor ID and new processor instructions. These new instructions do for the Pentium II what MMX did for the Pentium.

In this article, we introduce you to the Pentium III and describe in brief its features, concentrating on the details of the new instructions.
1. Introduction to the Pentium III

In February 1999, Intel unveiled its latest processor, the Pentium III. As with each new processor, the Pentium III has a host of new features in addition to the increase in clock speed. Previous processor releases from Intel have adhered relatively closely to Moore's Law, which states that the processor speed doubles every 18 months. The Pentium III does not double the speed of the Pentium II. It runs in the 450- to 550-MHz range, while the Pentium II and Pentium II Xeon ran at 333 MHz to 400 MHz. Though the increase in clock speed is more modest than previous gains, the Pentium III more than makes up for this with increased functionality.

A Pentium III is essentially a Pentium II running at higher speeds, with the addition of a new set of instructions: Streaming SIMD Extensions, or SSE. Though SSE adds new features, existing applications are not affected. The Pentium III architecture is compatible with the Pentium II's IA-32.

If the Pentium III does not double the MHz provided by the Pentium II, why should we go for the Pentium III at all?
2. New Features of the Pentium III

The Pentium III adds two interesting and useful features: the processor serial number and SSE. The CPU serial number has been at the center of a debate about privacy. This article will abstain from the debate, and focus instead on the new SIMD instructions.

SSE has an acronym embedded in it: SIMD, which stands for Single Instruction Multiple Data. Usually, processors process one data element in one instruction, a processing style called Single Instruction Single Data, or SISD. In contrast, processors having the SIMD capability process more than one data element in one instruction.
3. MMX versus SSE

MMX and SSE, both of which are instruction sets that have been added to existing architectures, share the concept of SIMD, but they differ in the data types they handle, and in the way they are supported in the processor.

MMX instructions are SIMD for integers, while SSE instructions are SIMD for single-precision floating-point numbers. MMX instructions operate on two 32-bit integers simultaneously, while SSE instructions operate on four 32-bit floats simultaneously.

A major difference between MMX and SSE is that no new registers were defined for MMX, while eight new registers have been defined for SSE. Each of the registers for SSE is 128 bits long and can hold four single-precision floating-point numbers (each being 32 bits long). The arrangement of the floating-point numbers in the new data type handled by SSE is illustrated in Figure 1.






Figure 1: Arrangement of numbers in the new data type.

The immediate question is: Where did the registers for MMX come from? The MMX registers were allocated out of the floating-point registers of the floating-point unit. A floating-point register is 80 bits long, of which 64 bits were used for an MMX register. A limitation of this architecture is that An application cannot execute MMX instructions and perform floating-point operations simultaneously. Additionally, a large number of processor clock cycles are needed to change the state of executing MMX instructions to the state of executing floating-point operations and vice versa. SSE does not have such a restriction. Separate registers have been defined for SSE. Hence, applications can execute SIMD integer (MMX) and SIMD floating-point (SSE) instructions simultaneously. Applications can also execute non-SIMD floating-point and SIMD floating-point instructions simultaneously.

The arrangement of the registers in MMX and SSE is illustrated in Figure 2. Figure 2(a) illustrates the mutually exclusive floating-point and MMX registers, while Figure 2(b) illustrates the SSE registers.






Figure 2: Registers in MMX and SSE.

MMX and SSE have one more similarity: Both have eight registers. MMX registers are named mm0 through mm7, while SSE registers are named xmm0 through xmm7.
4. Application Areas

The Pentium III SSE instructions allow for SIMD operations on four single-precision floating-point numbers in one instruction. Therefore, applications that utilize floating-point calculations stand to benefit the most from the usage of SSE. Applications related to 3D graphics, in particular, should see a substantial benefit. In fact, SSE was created specifically for 3D. Games and other applications that use a 3D back-end to display 2D or 2.5D, as well as applications that use vector graphics at the back end, also stand to benefit.
4.1 The Case of 3D

3D graphics typically consists of manipulating large sets of floating-point numbers used to specify the position of a point in 3D space. Manipulation involves performing floating-point calculations on the data set. With the help of SSE instructions, the application will be able to process more data per instruction. The resulting speed increase translates into a better experience for the user.

Application developers can also add more data or more complex calculations to create better effects and generate a richer experience for the user.

Some of the commonly used operations in 3D processing that stand to benefit from usage of SSE instructions are matrix multiplication, matrix transposition, matrix-matrix operations like addition, subtraction, and multiplication, matrix-vector multiplication, vector normalization, vector dot product, and lighting calculations.
5. Streaming SIMD Extensions

SSE adds 70 new instructions to the processor. With the new instructions, a new status/control word has been added. SSE requires support from the operating system, which can save and restore the processor state as required. At present, the only operating systems that support SSE are Microsoft's Windows 98 and Windows 2000 operating systems. SSE defines new instructions, new data types and instruction categories.

Not all of the new instructions are for SIMD floating point. Of the 70, 50 are SIMD for floating point, 12 are for SIMD integer, and the remaining eight are cacheability instructions. In this article, we will concentrate on the 50 instructions provided for SIMD on single-precision floating-point numbers.
5.1 Classification

The new SIMD floating-point instructions in the Pentium III can be classified in various ways, based on different criteria.

If we classify instructions based on the arrangement of the operands of the instructions, the classification will be as given in the "Data Packing" subsection.

If we classify instructions based on their behavioral characteristics, the classification will be as given in the "Instruction Categories" subsection.

If we classify instructions based on their computational characteristics, the classification will be as given in the "Instruction Groups" subsection.
5.1.1 Data Packing

If we classify the SIMD floating point instructions based on how the data to be manipulated is stored, we have two broad categories: instructions operating on packed data and instructions operating on scalar data. Hence, we have packed instructions and scalar instructions.

In the Pentium III instruction set, the packed and scalar instructions can be distinguished quite easily. Packed instructions have the "ps" suffix, while the scalar instructions have the "ss" suffix.

The new data type defined by SSE allows the storage of four single-precision floating-point numbers. A data element of the new data type can be represented as shown in Figure 3. An element "A" of the new data type can store four single-precision floating point numbers, here labeled a0, a1, a2 and a3.




Figure 3: New data type in SSE.

The arrangement of Figure 3 helps clarify the difference between packed and scalar instructions. SSE packed instructions operate on all four elements of the data type. Scalar instructions operate only on the lowest significant element of the data type, leaving the remaining three elements unchanged.

A packed instruction is illustrated in Figure 4. As shown, there are two operands, A and B, each of the new data type defined by SSE. The result of the operation op will be stored in C. C has the same data type as A and B. The elements of A are a0, a1, a2 and a3, while the elements of B are b0, b1, b2 and b3. The result of executing the instruction op is shown in Figure 4.





Figure 4: Packed operation.

The instruction is applied to each of the elements of A and B, and the results are stored in corresponding elements of C. The instruction op is applied to all the elements at the same time, computing all four elements in the same unit of computation.

A scalar instruction using the same operands, A and B, is illustrated in Figure 5. The instruction is applied to only the least-significant element of A and B. Again, the results are stored in corresponding elements of C. The instruction op is applied to all the elements at the same time, again computing all four elements in the same unit of computation, though only one element is affected.





Figure 5: Scalar operation.
5.1.2 Instruction Categories (collected according to behavioral characteristics)

Most of the packed instructions in SSE have a scalar equivalent. SSE instructions can also be classified, without respect to arrangement of data, into the following instruction categories:
computation instructions
branching instructions
cacheability instructions
data movement and ordering instructions
type conversion instructions, and
state management instructions
5.1.3 Instruction Groups (collected according to computational characteristics)

SSE instructions can also be classified into the following instruction groups:
arithmetic instructions
comparison instructions
logical instructions
shuffle instructions
conversion instructions
state management instructions
cacheability instructions
data management instructions, and
additional integer instructions
5.2 Benefits

The primary benefit of SSE is a reduction in the number of instructions executed for the given data set. Without SIMD and SSE, multiplying each of 400 floats by a number would require looping through the data set, executing the multiplication operator 400 times. With SIMD and SSE, 100 multiplications could perform the same task, as each multiplication can operate on four floats simultaneously.

The number of instructions when using SIMD and SSE is not an exact one-fourth of the non-SIMD case. Some instructions will be required to rearrange the data so that it is in a format acceptable to the SIMD instructions. In practice, SIMD instructions should be able to roughly double the efficiency of the operations.