1. Data Swizzling
The speedup that the Pentium III SSE achieves on floating-point operations comes at a price. The data operated on by SSE instructions has to be stored in the new data type defined by SSE. If the application stores the data in its own format, the data has to be converted into the new data type before the SSE instructions can operate on it, and has to be converted back afterward.
This conversion of data from one format into another is termed "data swizzling."
This conversion takes time and machine cycles. If an application converts data from one format to another too often, the machine cycles saved by executing SSE instructions may well be lost. Hence, care is needed.
1.1 Data Organization
Usually, 3D applications store the coordinates of a point in one structure. When handling multiple points, applications use an array of structures, also called AoS. Typical geometric operations operate differently on the x, y and z coordinates of the point. The code given below lists the typical declaration used by applications processing 3D data. When handling large data sets, this structure amounts to an array-of-structures, as illustrated in figure 9.
struct point {
float x, y, z;
};
...
point dataset[...];
Figure 1: Array of structures.
To exploit the advantages of SSE, it would be better to operate on multiple points simultaneously. This can be done by operating on the coordinates of multiple points. This is possible if we collect together the x-, the y- and z-coordinates of the points. The application can then process multiple x-, y- and z-coordinates separately. For this, the application must rearrange the data into either three separate arrays, or a structure of arrays with one array each for one coordinate of the point. This arrangement is called the SoA arrangement.
The code given below lists the declaration of the struture of arrays, while figure 10 is the diagrammatic representation of the struture of arrays.
struct point {
float *x, *y, *z;
};
Figure 2: Structure of arrays.
2. Memory Issues
2.1 Alignment
Handling and manipulating simple variables of the new data type does not create problems. However, it is recommended that variables of the new data type be aligned to 16-byte boundaries. This alignment can be enforced either by setting the appropriate compiler flags or by explicitly using align commands in the program, during variable declaration.
A variable can be specified to be aligned to a 16-byte boundary using the __declspec compiler directive, as illustrated in the following example. The variable myVar will be aligned to a 16-byte boundary due to the align directive. It is not necessary to align the new data types to 16-byte boundaries, as the compiler aligns the data types when it comes across the new data type declarations. The alignment directive is issued as shown:
__declspec(align(16)) float[4] myVar;
2.2 Dynamic Memory
The condition on the new data types stipulating that pointers accessing memory locations be aligned to 16-byte boundaries creates problems when allocating memory dynamically or at the time of accessing allocated arrays through a pointer.
When accessing arrays through pointers, we have to ensure that the pointer is aligned to a 16-byte boundary.
To allocate memory at run time we use either the malloc function or the new command. The default behaviour of both is that they do not align the pointer address to a 16-byte boundary. Hence, we have to either allocate memory and then adjust the pointer to a 16-byte boundary, or allocate the memory using the _mm_malloc function. The _mm_malloc function allocates a memory block that is aligned to a 16-byte boundary.
Just as malloc has a free, the _mm_malloc function has the function _mm_free. Memory blocks allocated using _mm_malloc have to be freed using _mm_free.
2.3 Custom Datatype
The restriction that pointers be aligned to 16-byte boundaries can be troublesome. It would be much better to be able to ignore the alignment of pointers.
When operating on 128-bit data types, it may be necessary to access the floats stored in the data type. In assembly language there is not much choice but to use assembly language constructs. Using C or C++ and the intrinsics library, however, the data will be sortd in the data type __mm128. In this data type, once the value is set, it is not possible to access the individual floating-point numbers directly. One way to access them is to transfer all floating point numbers into an array of floats, change the values and load the array of floats back into the data type. The second method is to cast the data type into a float array and then access the required element. The first method is time consuming and the second method may cause problems if not used properly.
Defining a custom data type can overcome these problems. The custom data type is defined as a union of the data type (__m128) and an array of four floats. The declaration of the new data, called sse4 for now, is given below.
union sse4 {
__m128 m;
float f[4];
};
Using this data type, it is no longer necessary to align memory locations to 16-byte boundaries. When the compiler encounters the data type __m128, it aligns it to a 16-byte boundary. An added advantage of this data type is that the individual floating-point numbers stored in the 128-bit data can be acessed directly.
2.4 Detecting the CPU
As the usage of SSE depends on the presence Pentium III, it is important that applications be able to detect the Pentium III chip. This is done using the cpuid instruction.
For the cpuid instruction to work as desired, the eax register has to be set to the appropriate value. As we are interested only in the CPU ID, we need to set the eax register to 1 before invoking the cpuid instruction.
The source code to detect the presence of the Pentium III CPU is given below. To be able to compile the code, the file fvec.h has to be included.
BOOL CheckP3HW()
{
BOOL SSEHW = FALSE;
_asm {
// Move the number 1 into eax - this will move the
// feature bits into EDX when a CPUID is issued, that
// is, EDX will then hold the key to the cpuid
mov eax, 1
// Does this processor have SSE support?
cpuid
// Perform CPUID (puts processor feature info in EDX)
// Shift the bits in edx to the right by 26, thus bit 25
// (SSE bit) is now in CF bit in EFLAGS register.
shr edx,0x1A
// If CF is not set, jump over next instruction
jnc nocarryflag
// set the return value to 1 if the CF flag is set
mov [SSEHW], 1
nocarryflag:
}
return SSEHW;
}
The SSE SDK also has an SSE emulation mode that emulates the Pentium III and the SSE registers. The code given below can be used to detect this emulation. To be able to compile the code, the file fvec.h has to be included.
// Checking for SSE emulation support
BOOL CheckP3Emu()
{
BOOL SSEEmu = TRUE;
Fvec32 pNormal = (1.0, 2.0, 3.0, 4.0);
Fvec32 pZero = 0.0;
// Checking for SSE HW emulation
__try {
_asm {
// Issue a move instruction that will cause exception
// w/out HW support emulation
movups xmm1, [pNormal]
// Issue a computational instruction that will cause
// exception w/out HW support emulation
divps xmm1, [pZero]
}
}
// If there's an exception, set emulation variable to false
__except(EXCEPTION_EXECUTE_HANDLER) {
SSEEmu = FALSE;
}
return SSEEmu;
}
3. Additional Examples
In this section, we present additional examples to illustrate the usage of the Streaming SIMD Extensions.
3.1 Array Manipulation
In this example, we take two arrays, each with 400 floats. A multiplication operation is performed on each of the array elements. The result of the multiplication is stored in a third array. The two arrays used as operands are named A and B. The result of the multiplication is stored in array C. In all the sources given below, the following declartion is assumed
#include
#define ARRSIZE 400
__declspec(align(16)) float a[ARRSIZE], b[ARRSIZE], c[ARRSIZE];
3.1.1 Assembly Language
_asm {
push esi;
push edi;
mov edi, a;
mov esi, b;
mov edx, c;
mov ecx, 100;
loop:
movaps xmm0, [edi];
movups xmm1, [esi];
mulps xmm0, xmm1;
movups [edx], xmm0;
add edi, 16;
add esi, 16;
add edx, 16;
dec ecx;
jnz loop;
pop edi;
pop esi;
}IZE; i +="4" ) {
m1= _mm_loadu_ps(a+i);
m2= _mm_loadu_ps(b+i);
m3= _mm_mul_ps(m1," m2);
_mm_storeu_ps(c+i, m3);
}
3.1.3 C++
F32vec4 f1, f2, f3;
for ( int i = 0; i
loadu(f2, b+i);
f3 = f1 * f2;
storeu(c+i, f3);
}
3.2 Vector for 3D
This example presents a vector in 3D. The vector is implemented as a class. The functionality of the class is implemented using the intrinsics library.
The class declaration is given below.
union sse4 {
__m128 m;
float f[4];
};
class sVector3 {
protected:
sse4 val;
public:
sVector3(float, float, float);
float& operator [](int);
sVector3& operator +=(const sVector3&);
float length() const;
friend float dot(const sVector3&, const sVector3&);
};
The class implementation is given below.
sVector3::sVector3(float x, float y, float z) {
val.m = _mm_set_ps(0, z, y, x);
}
float& sgmVector3::operator [](int i) {
return val.f[i];
}
sVector3& sVector3::operator +=(const sVector3& v) {
val.m = _mm_add_ps(val.m, v.val.m);
return *this;
}
float sVector3::length() const {
sse4 m1;
m1.m = _mm_sqrt_ps(_mm_mul_ps(val.m, val.m));
return m1.f[0] + m1.f[1] + m1.f[2];
}
float dot(const sVector3& v1, const sVector3& v2) {
sVector3 v(v1);
v.val.m = _mm_mul_ps(v.val.m, v2.val.m);
return v.val.f[0] + v.val.f[1] + v.val.f[2];
}
3.3 4x4 Matrix
This example presents a 4x4 matrix. The matrix is implemented as a class. The functionality of the class is implemented using the intrinsics library.
The class declaration is given below.
float const sEPSILON = 1.0e-10f;
union sse16 {
__m128 m[4];
float f[4][4];
};
class sMatrix4 {
protected:
sse16 val;
sse4 sFuzzy;
public:
sMatrix4(float*);
float& operator()(int, int);
sMatrix4& operator +=(const sMatrix4&);
bool operator ==(const sMatrix4&) const;
sVector4 operator *(const sVector4&) const;
private:
float RCD(const sMatrix4& B, int i, int j) const;
};
The class implementation is given below.
sMatrix4::sMatrix4(float* fv) {
val.m[0] = _mm_set_ps(fv[3], fv[2], fv[1], fv[0]);
val.m[1] = _mm_set_ps(fv[7], fv[6], fv[5], fv[4]);
val.m[2] = _mm_set_ps(fv[11], fv[10], fv[9], fv[8]);
val.m[3] = _mm_set_ps(fv[15], fv[14], fv[13], fv[12]);
float f = sEPSILON;
sFuzzy.m = _mm_set_ps(f, f, f, f);
}
float& sMatrix4::operator()(int i, int j) {
return val.f[i][j];
}
sMatrix4& sMatrix4::operator +=(const sMatrix4& M) {
val.m[0] = _mm_add_ps(val.m[0], M.val.m[0]);
val.m[1] = _mm_add_ps(val.m[1], M.val.m[1]);
val.m[2] = _mm_add_ps(val.m[2], M.val.m[2]);
val.m[3] = _mm_add_ps(val.m[3], M.val.m[3]);
return *this;
}
bool sMatrix4::operator ==(const sMatrix4& M) const {
int res[4];
res[0] = res[1] = res[2] = res[3] = 0;
res[0] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[0], M.val.m[0]),
_mm_min_ps(val.m[0], M.val.m[0])), sFuzzy.m));
res[1] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[1], M.val.m[1]),
_mm_min_ps(val.m[1], M.val.m[1])), sFuzzy.m));
res[2] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[2], M.val.m[2]),
_mm_min_ps(val.m[2], M.val.m[2])), sFuzzy.m));
res[3] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[3], M.val.m[3]),
_mm_min_ps(val.m[3], M.val.m[3])), sFuzzy.m));
if ( (15 == res[0]) && (15 == res[1])
&& (15 == res[2]) && (15 == res[3]) )
return 1;
return 0;
}
sVector4 sMatrix4::operator *(const sVector4& v) const {
return sVector4(
val.f[0][0] * v[0] + val.f[0][1] * v[1]
+ val.f[0][2] * v[2] + val.f[0][3] * v[3],
val.f[1][0] * v[0] + val.f[1][1] * v[1]
+ val.f[1][2] * v[2] + val.f[1][3] * v[3],
val.f[2][0] * v[0] + val.f[2][1] * v[1]
+ val.f[2][2] * v[2] + val.f[2][3] * v[3],
val.f[3][0] * v[0] + val.f[3][1] * v[1]
+ val.f[3][2] * v[2] + val.f[3][3] * v[3]);
}
float sMatrix4::RCD(const sMatrix4& B, int i, int j) const {
return val.f[i][0] * B.val.f[0][j] + val.f[i][1] * B.val.f[1][j]
+ val.f[i][2] * B.val.f[2][j] + val.f[i][3] * B.val.f[3][j];
}







