Microprocessor: 2008

Friday, July 4, 2008

Glossary of PC Processors by x86.org

Glossary of PC Processors by x86.org

CISC=Complex Instruction Set Computer. A CISC microprocessor is one in which the number of bytes needed to represent the opcode instruction is not a fixed, regular length (for example, 32-bits each).

Opcode=The data that represents a microprocessor instruction.

RISC=Reduced Instruction Set Computer. A RISC microprocessor that has fewer, simpler instructions than its CISC counterpart. RISC instructions perform simple, rudimentary functions. The simplicity of these instructions results in a very simple microprocessor design that can execute very fast. RISC instructions are typically characterized by fix length instruction sets (for example, all instructions are 32-bits each).

EPIC=Explicitely Parallel Instruction Computing. EPIC is a fancy acronym that Intel invented to obfuscate the fact that they do not want the public appearance that their Merced microprocessor is actually a VLIW design. After all, Intel didn't invent VLIW, therefore they don't want to be publicly associated with a VLIW design.

VLIW=Very Long Instruction Word. A microprocessor that packs many simple RISC-like instructions into a much longer internal instruction word format. A VLIW microprocessor will usually have execution units, capable of executing all of the instructions contained in the instruction word, in parallel.

MMX=Multi-Media eXtensions. Intel claims that "MMX" is not an acronym, meaning "Multi-Media eXtensions" because they have filed for a trademark under this name. In reality, MMX instructions are intended to enhance programs that have multi-media capabilities.

Address Lines=The number of address lines, or "address pins" on a microprocessor determine how much memory the chip can address. The amount of addressible memory can be calculated as 2^#address_lines (two raised to the power of the number of address lines). A microprocessor with 32 address lines can address 232 bytes of memory (4 G Bytes).

Cache=A cache is a bank of high speed memory that stores the most recently accessed code and data. When the microprocessor requests data that is in the cache, the amount of time to retrieve the data is many times less than the amount of time needed to access main memory. Many microprocessors have a cache inside of the chip itself. In some cases, there is a cache for the cache (known as a 2nd-level cache). A cache may hold code, data, or even recently accessed data on a hard disk. In general, a cache can be created for faster access to any slower device, beit main memory or hard disks.

PC Processors Guide by x86.org

PC Processors Guide by x86.org

"The Intel 8086, a new microcomputer, extends the midrange 8080 family into the 16-bit arena. The chip has attributes of both 8- and 16-bit processors. By executing the full set of 8080A/8085 8-bit instructions plus a powerful new set of 16-bit instructions, it enables a system designer familiar with existing 8080 devices to boost performance by a factor of as much as 10 while using essentially the same 8080 software package and development tools.

"The goals of the 8086 architectural design were to extend existing 8080 features symmetrically, across the board, and to add processing capabilities not to be found in the 8080. The added features include 16-bit arithmetic, signed 8- and 16-bit arithmetic (including multiply and divide), efficient interruptible byte-string operations, and improved bit manipulation. Significantly, they also include mechanisms for such minicomputer-type operations as reentrant code, position-independent code, and dynamically relocatable programs. In addition, the processor may directly address up to 1 megabyte of memory and has been designed to support multiple-processor configurations."

-- Intel Corporation, February, 1979

Introduction

In 1979, Intel introduced the 8086 and 8088 microprocessor extensions to the 8080 product line. Since that time, the x86 product line has gone through six generations and become the most successful microprocessor in history. Much of this success was due to the success of the IBM-PC and its clones. Therefore Intel was at the right place at the right time when IBM made their historic decision to use the 8088. Today, the x86 market is a multi-billion dollar industry, selling tens-of-millions of units per year.

The huge popularity of these x86 chips has lead to a prosperous x86 clone industry. AMD, Cyrix, IBM, TI, UMC, Siemens, NEC, Harris, and others have all dabbled in the x86 chip industry. Today, AMD, Cyrix, and Centaur are still actively competing. However, there are numerous stories that approximately one dozen other companies are actively working to create a x86 clone chip. Ultimately, the market will decide who survives and who perishes in this cutthroat chips business.
Merced

Merced is Intel's future-generation microprocessor architecture; Merced is not Intel's next-generation x86 microprocessor. Instead, Merced is a completely new microprocessor design and instruction set. Merced will have the means to run legacy x86 programs via a hardware translation mechanism.

The Merced design is a further departure from its x86 predecessors. Merced is not CISC (Complex Instruction Set Computer), RISC (Reduced Instruction Set Computer), but closely resembles a VLIW design (Very Long Instruction Word). Intel doesn't want to call this chip VLIW, ostensibly for political reasons (not invented here). Instead, Intel has coined the term EPIC -- Explicitely Parallel Instruction Computing. For all intents and purposes, EPIC is VLIW.

First versions of Merced will run between 600 MHz and 1000 MHz (1 GHz). x86 compatibility is acheived internally via a hardware translation mechanism. This translation mechanism enables Merced to run your existing Windows and DOS applications. Intel claims that Merced will never run x86 programs as fast as the current state-of-the-art x86 microprocessors. Therefore, Intel offers Merced into the server and workstation market, and will offer x86 compatibility as a matter of convenience.

Xeon

The Xeon is a Pentium II on steroids. The Pentium II contains a P6 (Pentium Pro) core with a 1/2-speed 2nd-level cache (L2 cache). Xeon differs from Pentium II in that the L2 cache runs at full processor speed. Pentium II connects to the motherboard in a slot named Slot-1. Xeon is not slot-compatible with Pentium II. Instead, Xeon uses a new slot named Slot-2.

Xeon is Intel's high-end microprocessor brand for the computer server market. As such, Xeon is priced much higher than Pentium II or Celeron.

Celeron

The Celeron microprocessor is a stripped down Pentium II. This chip is affectionately knows as "the Castrated One." The Celeron is offered without any 2nd-level cache, making its performance lackluster, and reportedly slower than a Pentium (with MMX) running at nearly one-half of its speed.

Intel was completely caught off-guard by the advent and popularity of the sub-$1000 PC. In fact, the sub-$1000 PC completely ruins Intel's marketing game plan. Therefore, Intel hastily created the Celeron in an attempt to regain some marketshare they lost to their competition.

Pentium II

The Pentium II is a glorified Pentium Pro with MMX extensions. The Pentium II has a different package than the Pentium Pro. This new package is called "Slot-1." Officially, Intel claimed technical reasons for needing Slot-1. Industry pundits claimed that slot-1 was devised to thwart industry competition with the Pentium Pro, and further the Intel monopoly. Strangely, the "technical reasons for needing slot-1" evaporated as soon as the Pentium Pro was dead; as Intel somehow made a "breakthrough" that they had previously claimed was impossible (thereby "needing" to invent the proprietary slot-1).

Slot-1 was also promised to be the upgrade path for consumers -- leading many years into the future. Unfortunately, every time Intel has made this promise, first with the 80486, then with the Pentium, the promise has been broken. As soon as Intel saturated the saturated the market with slot-1 computers, they announced the future high performance upgrade path would be "slot-2."

The Pentium II can address up to 64 GB of main memory, but has cache limitations preventing memory use above 512 MB.

The Pentium Pro

The Pentium Pro was introduced in November 1995 as Intel's 6th generation x86 design -- code-named the "P6." The Pentium Pro offered some minor programming enhancements, four more address lines, and a large 2nd-level cache. By this time, the entire "secret" programming features of the Pentium had been revealed .Therefore, Intel abandoned their attempts at keeping most of their new P6 programming features as a secret.the Pentium Pro could address up to 64 GB of main memory. The addition of the 2nd-level cache gave the Pentium Pro a nice performance boost, but was very expensive to manufacture.

Intel continued their attempts at closing the architecture to the exclusion and elimination of their competition. Intel managed to gain patent protection for some pins on the Pentium Pro socket; thus making this chip very difficult to clone without substantial legal liability. However, Intel wasn't paranoid enough. Intel introduced the Pentium II under the guise that the Pentium Pro could never achieve their performance goals. Intel alleged that the Pentium Pro 2nd-level cache could never run faster than 200 MHz, and therefore they must discontinue the development of this product line. The Pentium II abandoned the socket approach to microprocessors, and introduced the "slot" concept. The slot-1 of the Pentium II was further enshrouded in patent protection, thereby further raising the bar to the cost of competition. Now that the Pentium Pro is dead (along with the possibility that their competition will clone this product), Intel has ironically announced a 400 MHz Pentium II with a full-speed 2nd-level cache.

Regardless of Intel's continued monopolistic business practices, the P6 product line has diversified and flourished. The Pentium II added MMX enhancements and a variety of 2nd-level cache options. Intel has created the Celeron brand to compete in the sub-$1000 market. The Xeon was introduced to compete in the server market with a 100 MHz system bus. As time goes on, Intel will continue to diversify the P6 family product line, most likely with a 200 MHz system bus.

Intel's competitors have gone in a different direction than Intel. Intel's competitors have stayed with a Pentium-compatible pin-out. AMD has continued to develop the K6 processor and added MMX enhancements. Cyrix has added MMX enhancements to the MII product line. Centaur products always contained MMX enhancements. These three companies combined their abilities to create a common set MMX-3D instruction set extensions. All three companies have announced plans to create a 100 MHz system bus and an integrated 2nd-level cache (though Intel-fellow Fred Pollack has said publicly that the 2nd-level cache integration is electronically impossible).
The Pentium

The Pentium processor was a big departure from all past Intel x86 processors. The Pentium name signaled the end to the 80x86 nomenclature, and was spurred by losing a trademark dispute against AMD. The Pentium processor contained more than one execution unit -- making it superscalar. Intel no longer needed (or wanted) any second-source fabs manufacturing their microprocessors; Intel wanted all of the profits to themselves. The Pentium also added many programming enhancements though Intel tried to keep all of them secret.

Intel rapidly diversified the Pentium product line. The original Pentium product ran at 60 and 66 MHz. Shortly thereafter, Intel introduced 90, 100, 120, 133 MHz versions of the popular processor. Intel introduced low-power versions of the Pentium to be used in notebook computer applications. Finally, Intel introduced the MMX-enhanced processors.

AMD and Cyrix didn't sit idly by and watch Intel expand and dominate the market. AMD introduced the K5 processor -- their first in-house x86 design. However, the K5 was late to market, and was very slow. In response to their bleak outlook for the K5, AMD bought Nexgen. Nexgen created their own x86-compatible microprocessor, calling it the Nx586. At the time of the acquisition, Nexgen had already finished the design of their next-generation processor core -- the Nx686. AMD used the Nx686 core and created the successful K6 processor. AMD has continued to upgrade this processor to include MMX, and other enhancements.

During this time Cyrix introduced the 6x86. The 6x86 was pin-compatible with the Pentium, though the 6x86 nomenclature might lead the consumer to believe that it is a 6th-generation (Pentium Pro) compatible chip. The 6x86 has also been enhanced with MMX instructions as the 6x86 MX. Cyrix has continued to enhance this chip in their attempt to gain more market share.

During the Pentium era, a new Intel competitor emerged. Centaur Technologies (a wholly owned subsidiary of IDT) created a fast, cheap, and somewhat low power Pentium compatible chip. Centaur has a low-end (low-cost) market focus. Some industry pundits have called this marketing strategy "bottom-feeding." However, with the emergence and overwhelming popularity of the sub-$1000 PC, Centaur may end up having the last laugh.
The 80486

The 80486 offered little in the way of architectural enhancements over its 80386 predecessor. The most significant enhancement of the 486 family was the integration of the 80387-math coprocessor into the 80486-core logic. Now, all software that requires the math coprocessor could run on the 80486 without any expensive hardware upgrades.

Like the 80386 SX, Intel decided to introduce the 80486 SX as a cost-reduced 80486 DX. Unfortunately, Intel chose to ensure that these processors were neither pin-compatible, nor 100% software compatible with each other. Unlike the 80386 SX, the 80486 SX enjoyed the full data bus and address bus of its DX counterpart. Instead, Intel removed the math coprocessor, thereby rendering the 80486 SX somewhat software incompatible with its DX counterpart. To further complicate matters, Intel introduced the 80487 SX -- the "math coprocessor" to the 80486 SX. Intel convinced vendors to include a new socket on the motherboard that could accommodate the 80486 SX and 80487 SX as an expensive hardware upgrade option. Unbeknownst to the consumer, the 80486 SX was an 80486 DX with a non-functional math unit (though later versions of the chip actually removed the math unit). The 80487 SX was a full 80486 DX with a couple of pins relocated on the package -- to prevent consumers from using the cheaper 80486 DX as an upgrade option. In this regard, Intel created a marketing deception. Intel marketed the 80487 SX as a math coprocessor to the 80486 SX. In reality, the 80487 SX electronically disabled the 80486 SX when installed, thereby relegating this chip to the status of an expensive space heater. Sadly, the consumer never knew or even suspected Intel of playing such manipulative games.

Also like the 80386, the Intel began to diversify their 80486 offerings. Low-power versions of the chip were introduced. The 80486 SL was introduced along with the 80386 SL as an integrated, low-power chip for notebook applications. The 80486 DX2 and DX4 were introduced, which doubled and tripled the core clock frequency. Power-saving features from the SL were introduced in later versions of the DX4. Finally, after Intel introduced the Pentium chip, they produced a version of the Pentium that was pin-compatible with the 80486. They called this chip an "overdrive" processor.

Likewise, AMD and Cyrix continued to pursue their own 486-compatible chip solutions. AMD introduced many Am486 variants. Cyrix continued their nomenclature of calling an 80486-compatible chip, the Cyrix 5x86. TI continued to manufacture Cyrix chips, and eventually started their own in-house microprocessor design (though the effort eventually failed). UMC entered the CPU market, but later withdrew because of patent infringement problems. IBM began manufacturing for Cyrix, and still pursuing their own microprocessor designs (the Blue Lightning series).
The 80386

In 1985, Intel introduced the 80386. Like the 80286 before it, the 80386 added significant programming and addressibility enhancements. Protected mode was enhanced to allow easy transitions between it and real mode (without resetting the microprocessor). Another new operating mode (v86 mode) was introduced to allow DOS programs to execute within a protected mode environment. Addressibility was further enhanced to 32-bits, giving the 80386 four gigabytes of memory addressibility (2^32 = 4 GB).

Also like the 80286, the 80386 was not introduced in any computer systems for many years after its introduction. Compaq was the first mainstream company to introduce an 80386-based computer -- beating IBM to market. Regardless, the 80386 enjoyed a very long life for home and business computer users. This long life was largely due to the programming extensions in the 80386 -- namely the ability to create a protected mode operating system to take advantage of all 4 GB of potential memory while still being able to run legacy DOS applications.

Shortly after the 80386 was introduced, Intel introduced the 80386 SX. To avoid confusion, Intel renamed the 80386 to the 80386 DX. The SX was a cost-reduced 80386 with a 16-bit data bus, and 24-bit address bus. The 16-bit data bus meant the SX was destined to have lower memory throughput than its DX counterpart; while the 24-bit address bus mean that the SX could only address 16 MB of physical memory. Regardless of the address bus and data bus differences, the SX and DX were software compatible with each other. Intel also introduced the 80376 as part of the 80386 family. The 376 was an 80386 SX that exclusively ran in protected mode.

During its long reign, the 80386-based computer began to evolve. Chipset vendors began dreaming of ways they could help improve the performance of the computer, thus giving their products a competitive advantage. One of the innovations was the introduction of the memory cache. The memory cache within the chipset would play a huge role in Intel's future product plans. First, Intel introduced a cache . Later, they incorporated the cache into the microprocessor itself. Intel also made their second failed attempt at chip integration. The 80386 SL integrated core logic, chipset functionality, and power-saving features into the microprocessor.

During this time, the popularity of the personal computer, and most notably their Intel microprocessors, didn't escape the notice of many entrepreneurs wishing to cash in on Intel's business. AMD began their own "x86" microprocessor division. IIT began cloning the Intel math coprocessors. Other small startups, such as Cyrix and Nexgen, decided they too could design an Intel-compatible microprocessor. The aspirations of these companies didn't bode well within Intel. Shortly thereafter, Intel began taking measures to ensure their own dominance in the industry -- to the exclusion of everybody else. Hence, Intel began what many believe are anti-competitive (illegal) business practices.

In spite of Intel's business practices, many 80386 clones began to appear. AMD marketed the Am386 microprocessors in speeds from 16 MHz to 40 MHz, though it was possible to overclock this chip up to 80 MHz. IBM introduced the 386 SLC, which featured a low-power 386 with an integrated 8-KB cache. IBM created other 386/486 hybrid chips -- some that were pin-compatible with Intel, and others that were not. Chips and Technologies created their own 386 clone. Cyrix stunned everybody by offering a 386 pin-compatible CPU, but called it a 486 (a nomenclature pattern that Cyrix still uses). Texas Instruments served as a foundry for Cyrix, and negotiated rights to produce chips under their own name. Eventually, TI produced their own chips (based on the Cyrix core), with their own unique enhancements.
The 80286

In 1982, Intel introduced the 80286. For the first time, Intel did not simultaneously introduce an 8-bit bus version of this processor (ala 80288). The 80286 introduced some significant microprocessor extensions. Intel continued to extend the instruction set; but more significantly, Intel added four more address lines and a new operating mode called "protected mode." Recall that the number of address lines directly relates to amount of physical that can be addressed by the microprocessor. The 8086, 8088, 80186, and 80188 all contained 20 address lines, giving these processors one megabyte of addressibility (2^20 = 1MB). The 80286, with its 24 address lines, gives 16 megabytes of addressibility (2^24 = 16 MB).

For the most part, the new instructions of the 80286 were introduced to support the new protected mode. Real mode was still limited to the one megabyte program addressing of the 8086, et al. For all intents and purposes, a program could not take advantage of the 16-megabyte address space without using protected mode. Unfortunately, protected mode could not run real-mode (DOS) programs. These limitations thwarted attempts to adopt the 80286 programming extensions for mainstream consumer use.

IBM was spurred by the huge success of the IBM PC and decided to use the 80286 in their next generation computer, the IBM PC-AT. However, the PC-AT was not introduced until 1985 -- three years after introduction of the 80286.

During the reign of the 80286, the first "chipsets" were introduced. The computer chipset was nothing more than a set of chips that replaced dozens of other peripheral chips, while maintaining identical functionality. Chips and Technologies became one of the first popular chipset companies.

Like the IBM PC, the PC-AT was hugely successful for home and business use. Intel continued to second-source the chips to ensure an adequate supply of chips to the computer industry. Intel, AMD, IBM, and Harris were known to produce 80286 chips as OEM products; while Siemens, Fujitsu, and Kruger either cloned it, or were also second-sources. Between these various manufacturers, the 80286 was offered in speeds ranging from 6 MHz to 25 MHz.
The 80186 / 80188

Intel continued the evolution of the 8086 and 8088 by introducing the 80186 and 80188. These processors featured new instructions, new fault tolerance protection, and was Intel's first of many failed attempts at the x86 chip integration game.

The new instructions and fault tolerance additions were logical evolutions of the 8086 and 8088. Intel added instructions that made programming much more convenient for low-level (assembly language) programmers. Intel also added some fault tolerance protection. The original 8086 and 8088 would hang when they encountered an invalid computer instruction. The 80186 and 80188 added the ability to trap this condition and attempt a recovery method.

Intel integrated this processor with many of the peripheral chips already employed in the IBM-PC. The 80186 / 80188 integrated interrupt controllers, interval timers, DMA controllers, clock generators, and other core support logic. In many ways, this chip was produced a decade ahead of its time. Unfortunately, this chip didn't catch on with many hardware manufacturers; thus spelled the end of Intel's first attempt at CPU integration. However, this chip has enjoyed a tremendous success in the world of embedded processors. If you look on your high performance disk driver or disk controller, you might still see an 80186 being used.

Eventually, many embedded processor vendors began manufacturing these chips as a second source to Intel, or in clones of their own. Between the various vendors, the 80186/80188 was available in speeds ranging from 6 MHz to 40 MHz.
The 8086 / 8088

The 8086 and 8088 were binary compatible with each other, but not pin-compatible. Binary compatibility means that either microprocessor could execute the same programs. Pin-incompatibility means that you can’t plug the 8086 into the 8088 and visa versa, and expect the chips to work. The new "x86" chips implemented a Complex Instruction Set Computer (CISC) design methodology.

The 8086 and 8088 both feature twenty address pins. The number of address pins determines how much memory a microprocessor can access. Twenty address pins gave these microprocessors a total address space of one megabyte (2^20 = one megabyte).

The 8086 and 8088 featured different data bus sizes. The data bus size determines how many bytes of data the microprocessor can read in each cycle. The 8086 featured a 16-bit data bus, while the 8088 featured an 8-bit data bus. IBM chose to implement the 8088 in the IBM-PC, thus saving some cost and design complexity.

At the time IBM introduced the IBM-PC, a fledgling Intel Corporation struggled to supply enough chips to feed the hungry assembly lines of the expanding personal computer industry. Therefore to ensure sufficient supply to the personal computer industry, Intel subcontracted the fabrication rights of these chips to AMD, Harris, Hitachi, IBM, Siemens, and possibly others. Amongst Intel and their cohorts, the 8086 line of processors ran at speeds ranging from 4 MHz to 16 MHz.

It didn’t take long for the industry to start "cloning" the IBM-PC. Many companies tried; but mostly they all failed because their BIOS was not compatible with the IBM-PC BIOS. Columbia, Kayro and others went by the wayside because they were not totally PC-compatible. Compaq broke though the compatibility barrier with the introduction of the Compaq portable computer. Compaq's success created the turning point that enabled today's modern computer industry.

NEC was the first to "clone" this new Intel chip with their V20 and V30 designs. The V20 was pin-compatible with the 8088, while the V30 was pin-compatible with the 8086. The V-series ran approximately 20% faster than the Intel chips when running at the same clock speed. Therefore, the V-series chips provided a cheap "upgrade" to owners of the IBM-PC and other clones computers. The V-series chips were very interesting. These chips were introduced in 1985 at approximately the same time as Intel's introduction of the 80386. The 80386 was still years away from production, and the 80286 was just barely being accepted in the IBM-PC/AT. Even though these chips were pin-compatible with the 8086 and 8088, they also had some extensions to the architecture. They featured all of the "new" instructions on the 80186 / 80188, and also were capable of running in Z-80 mode (directly running programs written for the Z-80 microprocessor).

Branch prediction in the Pentium family

Branch prediction in the Pentium family

What is branch prediction?

Imagine a simple microprocessor where all instructions are handled in two steps: decoding and execution. The microprocessor can save time by decoding one instruction while the preceding instruction is executing. This assembly line-principle is called pipelining. In advanced microprocessors, the pipeline may have many steps so that many consecutive instructions are underway in the assembly line at the same time, one at each stage in the pipeline.

The problem now occurs when we meet a branch instruction. A branch instruction is the implementation of an if-then-else construct. If a condition is true then jump to some other location; if false then continue with the next instruction. This gives a break in the flow of instructions through the pipeline because the processor doesn't know which instruction comes next until it has finished executing the branch instruction. The longer the pipeline, the longer time it will have to wait until it knows which instruction to feed next into the pipeline. As modern microprocessors tend to have longer and longer pipelines, there has been a growing need for doing something about this problem.

The solution is branch prediction. The microprocessor tries to predict whether the branch instruction will jump or not, based on a record of what this branch has done previously. If it has jumped the last four times then chances are high that it will also jump this time. The microprocessor decides which instruction to load next into the pipeline based on this prediction, before it knows for sure. This is called speculative execution. If the prediction turns out to be wrong, then it has to flush the pipeline and discard all calculations that were based on this prediction. But if the prediction was correct, then it has saved a lot of time.

The detective work

Intel manuals have never been very specific about how the branch prediction works. However, since mispredictions are expensive in terms of execution time, I found it important to know how the prediction works in order to optimize my programs. I started to do a lot of experiments together with some clever persons I had met on the Internet, most importantly Karki J. Bahadur at the university of Colorado, and Terje Mathisen in Norway, the guy who reverse engineered system software to find out how to get access to the performance monitor counters on the Pentium chip. Well, my first finding was that the Pentium predicts a branch instruction to jump if it has jumped any of the last two times. This fitted all my experiments, but Karki pointed out that a branch which jumps every third time is predicted one time out of six, where, according to my first model, it should never be predicted correctly. Then followed a series of new experiments until Karki and I independently came out with the same state diagram, shown in fig. 1a. While we agreed on this mechanism, we disagreed on the interpretation and in particular on why it was asymmetric. In the meantime, another guy had found an old article in Microprocessor Report claiming that the mechanism was a symmetric one as illustrated in fig 1b. My opinion was that the designers had actually intended the mechanism to be as in fig. 1b and that the asymmetry was a bug. But Karki and Terje maintained that there had to be an intention behind this asymmetry. It didn't convince them that I demonstrated how the symmetric mechanism was superior to the asymmetric one in almost all cases.

Now I discovered a powerful tool to dig deeper into this mechanism. The Pentium has a set of test registers that make it possible to read or write directly into the area that holds the history information for all branches, the branch target buffer (BTB). I had found this information on the home page of another hacker, Christian Ludloff. His page was shut down (rumors say that this was due to pressure from Intel) but fortunately I had downloaded his page before it was too late. Having direct access to the BTB, I was able to see exactly what happened: When a branch does not have an entry in the BTB it is predicted to not jump. The first time it jumps it gets an entry in the BTB and immediately goes to state 3. The complication is that the designers have equated state 0 with 'vacant BTB entry'. This makes sense because state 0 is predicted to not jump anyway. But since it cannot distinguish between state 0 and a vacant BTB entry it will go to state 3 next time the branch jumps rather than to state 1. This is where the quirk comes from. Apparently, somebody at the design labs has done a lot of research to find a good branch prediction scheme, and then somebody else has messed it all up by letting state 0 mean vacant BTB entry without realizing the consequence. And the consequence is that a branch which seldom jumps will have three times as many mispredictions as it would with the symmetric design.

Figure 1 -- State diagram for branch prediction mechanism

a. asymmetric design in the Pentium:

The state follows the +arrows when the branch instruction jumps, and the -arrows when not jumping. The branch instruction is predicted to jump next time if in state 2 or 3, and to not jump when in state 0 or 1.

b. symmetric design:

This is how the branch prediction should work. The state is incremented when jumping (+arrows) and decremented when not jumping (-arrows).

More quirks

We soon found that there were more strange things about the Pentium's branch prediction. We couldn't make sense of what happened when more branch instructions came close after one another. This time Karki and Terje came with the 'wild' ideas that led to the solution, while I played the role of the sceptic. After a hectic period where we exchanged results by E-mail every day, we found that the BTB information may actually be stored several instructions ahead of the branch it refers to. If there happens to be another branch in between then the BTB information is likely to be misapplied to somewhere in the wrong branch. This can lead to many funny phenomena: a branch instruction can have more than one BTB entry; two branches can share the same BTB entry so that one branch is predicted to go to the target of the other one; an unconditional jump instruction can be predicted to not jump; and a non-jumping instruction can be predicted to jump. I will not go into detail with all these quirks here, but you can find it all on my homepage. None of these quirks are fatal, because all mispredictions eventually get corrected.

A much more powerful mechanism

The later processors in the Pentium family: the Pentium MMX, Pentium Pro, Pentium II, Celeron, and Xeon, all have a much more advanced branch prediction mechanism. I will refrain from more detective stories here and go right to the mechanism.

Figure 2 - Two level branch prediction in Pentium MMX, Pentium Pro, and Pentium II

Level two consists of 16 two-bit counters of the type in fig. 1b. Level one is a four bit shift register storing the history of the last four events. This four bit pattern is used to select which of the 16 two-bit counters to use for the next prediction.

This mechanism is based on the same fundamental idea of the state diagram in fig 1b. This is simply a two-bit counter with saturation. The counter is incremented when jumping and decremented when not jumping. The branch instruction is predicted to jump next time if the counter is in state 2 or 3, and to not jump if in state 0 or 1. This mechanism makes sure that the branch has to deviate twice from what it does most of the time before the prediction changes.

The improvement in the later processors comes from the so-called two-level branch prediction. The first level is a shift register that stores the history of the last four events for any branch instruction. This gives sixteen possible bit patterns. You get a pattern of 0000 if the branch did not jump the last four times, and a pattern of 1111 after four times of jumping. The second level in the branch prediction mechanism is constituted of sixteen 2-bit counters of the type in fig. 1b. It uses the 4-bit pattern in the first level to choose which of the sixteen counters to use in the second level. See fig. 2.

The advantage of this mechanism is that it can learn to recognize repetitive patterns. Imagine a branch that jumps every second time. You can write this pattern as 01010101 where 0 means no-jump and 1 means jump. After 0101 always comes an 0. Every times this happens, the counter with the binary number [0101] will be decremented until it reaches its lowest state. It has now learned that after 0101 comes an 0 and will therefore make this prediction correctly the next time. Similarly, counter number [1010] will be incremented until state three so that it will always predict a 1 after 1010. The remaining fourteen counters for this branch are never used as long as the pattern is the same.

This mechanism is quite powerful as it can handle complex repetitive patterns like 00101-00101-00101. In fact, it can handle any repetitive pattern with a period of up to five, most patterns of period six and seven, and even some patterns with periods as high as sixteen. To see if a pattern of period n can be handled without misprediction, write down the n 4-bit sub-sequences in the pattern. If they are all different, then you will have no mispredictions after an initial learning time of two periods.

But the two-level mechanism is more powerful than that. It is also extremely good at handling deviations from a regular pattern. If a branch instruction has an almost regular pattern with occasional deviations, then the processor will soon learn what the deviations look like, so that it can handle almost any kind of recurrent deviation with only one misprediction.

Furthermore, it can handle a situation where you alternate between two different repetitive patterns. Assume that you have given the processor one repetitive pattern until it has learned to handle it without mispredictions. Then another pattern. And then return to the first pattern. If the two patterns do not have any 4-bit subsequences in common, then they do not use the same counters, so the processor doesn't have to re-learn the first pattern. Therefore, it can handle the transitions back and forth between the two patterns with a minimum of mispredictions.

Conclusion

The first microprocessor in the Pentium family introduced a simple one-level branch prediction mechanism with many ludicrous quirks. The later versions, Pentium MMX, Pentium Pro, Pentium II, etc. have longer pipelines and therefore a higher need for effective branch prediction. This need has been met by the incredibly powerful two-level mechanism with its ability to learn and recognize repetitive patterns and even deviations from the regular patterns. This mechanism is also quite economical in terms of chip area as the history of a branch can be stored in only 32 bits.

The most important shortcoming of the two-level branch prediction is that it is not very good at predicting the branch pattern of a loop control. If, for example, you have a program with a loop that always repeats ten times, then the control instruction at the bottom of the loop will branch back nine times and fall through the tenth time at the cost of one misprediction. For the Pentium Pro and Pentium II, where branch misprediction costs a lot of time, it may acually be advantageous to replace a loop that executes ten times with two nested loops that execute five and two times, in order to avoid mispredictions.

AMD 3DNow! undocumented instructions

AMD 3DNow! undocumented instructions

Introduction

Being involved in computer architecture and computer graphics, I am quite familiar with contemporary processors and graphics stuff. With my hacking attitude I frequently dig into some undocumented details of CPUs and graphics hardware and later publish some of my findings on my website. This "professional hacking" turned out into one of my favorite hobbies, and my recent discovery is simply a result of this activity.
How it all started

June 23rd was unusually cold and rainy day in Warsaw. Approaching my computer at the University in the morning I decided to free some precious space on my tiny, 1GB "working" hard disk. Half-consciously I begun to browse through deep and complicated structure of folders, trying to find something that I would never need again. I entered the folder with various 3DNow! related files downloaded from the net long time ago. While viewing one of them I noticed the names and opcodes of instructions, and suddenly I realized that not all of the names look familiar to me. I've compared the names with instructions listed in official 3DNow! documents. Clearly, there were three names not matching anything from the manual: PF2IW, PI2FW and PSWAPW.
Tracking the instructions

Judging from their names, the instructions were not unwanted artifacts. Rather they looked like something carefully designed and later abandoned. Quickly I opened the text editor and entered the these three mnemonics together with the necessary assembler directives:
.model small
.586
.k3d
.code
pf2iw mm0, mm1
pi2fd mm0, mm1
pswapw mm0, mm1
end

Then I tried to assemble the program with MASM 6.13, which I use daily for the development of hybrid (C+assembler) utilities. MASM didn't complain about the instructions, it recognized them and translated properly, showing the machine codes (suffixes), that turned out to be 1C, 0C and BB hex respectively.
Checking the functions

I quickly wrote five short assembler functions with the purpose to simply invoke the three mentioned instructions as well as documented instructions, PF2ID and PI2FD, for comparing the results with the undocumented ones. Then I wrote a simple program in C, invoking these functions with user-entered parameters and printing the results. I compiled both modules using old 16-bit Borland C and MASM 6.13, getting a simple, DOS-based, command line driven program. It all took about 15 minutes. Then I realized that I have no computer to test the stuff (the PC on my desk has IDT C6 in it). Fortunately it turned out that my colleague in the next room has K6-2 in his PC. In a few minutes I was able to run several simple tests devised to discover the exact functions of the new instructions. Given the names of instructions, their functions were not surprising:
PF2IW is similar to PF2ID, but it returns the 16-bits results in lower halves of 32-bit words. The out-of range values are saturated at -32768 and 32767.
PI2FW is similar to PI2FD, but it considers as its input just the lower halves of two 32-bit words, treating them as signed integers.
PSWAPW swaps the order of 16-bit words inside 64-bit word -- the least significant 16-bit word becomes the most significant one etc..

The more formal description of the new instructions is right below.
PF2IW mmreg1, mmreg2/mem64

Opcode: 0Fh 0Fh / 1Ch

Converts packed floating-point operand to packed 16-bit integer. The instruction is similar to PF2ID, but the result qword contains two 16-bit signed integers in bits 47..32 and 15..0. Result bits 63..48 and 31..16 are cleared.
Function:
IF (mmreg2/mem64[31:0] >= 2^15)
THEN mmreg1[31:0] = 7FFFh
ELSEIF (mmreg2/mem64[31:0] <= -2^15)
THEN mmreg1[31:0] = 8000h
ELSE mmreg1[31:0] = int(mmreg2/mem64[31:0])
IF (mmreg2/mem64[63:32] >= 2^15 )
THEN mmreg1[63:32] = 7FFFh
ELSEIF (mmreg2/mem64[63:32] <= -2^15)
THEN mmreg1[63:32] = 8000h
ELSE mmreg1[63:32] = int(mmreg2/mem64[63:32])
PI2FW mmreg1, mmreg2/mem64

Opcode: 0Fh 0Fh / 0Ch

Packed 16-bit integer to floating-point conversion.
Function:
mmreg1[31:0] = float(mmreg2/mem64[15:0])
mmreg1[63:32] = float(mmreg2/mem64[47:32])
PSWAPW mmreg1, mmreg2/mem64

Opcode: 0Fh 0Fh / 0BBh

Swap 16-bit words within 64-bit MMX word.
Function:
mmreg1[15..0] = mmreg2/mem64[63..48]
mmreg1[31..16] = mmreg2/mem64[47..32]
mmreg1[47..32] = mmreg2/mem64[31..16]
mmreg1[63..48] = mmreg2/mem64[15..0]
Significance and usefulness

As usual, the undocumented instructions should not be treated too seriously -- they may disappear any time from a future product. From the reliable source somewhere on the net I got the information that the above three instructions were abandoned from 3DNow! spec because the lack of commitment from IDT and Cyrix, so I expect that IDT WinChip2 does not support them. In near future we will see if they are supported by AMD K7. Until then we can treat them just as a curiosity.

Thinking about using the undocumented instructions, I can't see any serious application for PF2ID instruction, although I believe that its designer had something in mind. The instruction returns its results in two non-adjacent 16-bit fields. It is not easy to convert the results to a more useful form. It's even easier to convert 4 floats to packed 16-bit integers using PF2ID and PACKSSDW than using PF2IW, since the MMX instructions set does not provide for packing dwords to words without signed saturation.

PI2FW in turn, can be effectively used to convert packed shorts to floats. The same task is not easy to achieve using only documented instructions, as it requires 16-bit to 32-bit signed int conversion, not available in MMX instruction set.

PSWAPW is just what it is -- it may be effectively used to reverse the order of 16-bit data in a quadword.

Tuesday, March 11, 2008

Introduction to the Streaming SIMD Extensions in the Pentium III: Part III

Introduction to the Streaming SIMD Extensions in the Pentium III: Part III

1. Data Swizzling

The speedup that the Pentium III SSE achieves on floating-point operations comes at a price. The data operated on by SSE instructions has to be stored in the new data type defined by SSE. If the application stores the data in its own format, the data has to be converted into the new data type before the SSE instructions can operate on it, and has to be converted back afterward.

This conversion of data from one format into another is termed "data swizzling."

This conversion takes time and machine cycles. If an application converts data from one format to another too often, the machine cycles saved by executing SSE instructions may well be lost. Hence, care is needed.
1.1 Data Organization

Usually, 3D applications store the coordinates of a point in one structure. When handling multiple points, applications use an array of structures, also called AoS. Typical geometric operations operate differently on the x, y and z coordinates of the point. The code given below lists the typical declaration used by applications processing 3D data. When handling large data sets, this structure amounts to an array-of-structures, as illustrated in figure 9.
struct point {
float x, y, z;
};
...
point dataset[...];

Figure 1: Array of structures.

To exploit the advantages of SSE, it would be better to operate on multiple points simultaneously. This can be done by operating on the coordinates of multiple points. This is possible if we collect together the x-, the y- and z-coordinates of the points. The application can then process multiple x-, y- and z-coordinates separately. For this, the application must rearrange the data into either three separate arrays, or a structure of arrays with one array each for one coordinate of the point. This arrangement is called the SoA arrangement.

The code given below lists the declaration of the struture of arrays, while figure 10 is the diagrammatic representation of the struture of arrays.
struct point {
float *x, *y, *z;
};

Figure 2: Structure of arrays.
2. Memory Issues
2.1 Alignment

Handling and manipulating simple variables of the new data type does not create problems. However, it is recommended that variables of the new data type be aligned to 16-byte boundaries. This alignment can be enforced either by setting the appropriate compiler flags or by explicitly using align commands in the program, during variable declaration.

A variable can be specified to be aligned to a 16-byte boundary using the __declspec compiler directive, as illustrated in the following example. The variable myVar will be aligned to a 16-byte boundary due to the align directive. It is not necessary to align the new data types to 16-byte boundaries, as the compiler aligns the data types when it comes across the new data type declarations. The alignment directive is issued as shown:
__declspec(align(16)) float[4] myVar;
2.2 Dynamic Memory

The condition on the new data types stipulating that pointers accessing memory locations be aligned to 16-byte boundaries creates problems when allocating memory dynamically or at the time of accessing allocated arrays through a pointer.

When accessing arrays through pointers, we have to ensure that the pointer is aligned to a 16-byte boundary.

To allocate memory at run time we use either the malloc function or the new command. The default behaviour of both is that they do not align the pointer address to a 16-byte boundary. Hence, we have to either allocate memory and then adjust the pointer to a 16-byte boundary, or allocate the memory using the _mm_malloc function. The _mm_malloc function allocates a memory block that is aligned to a 16-byte boundary.

Just as malloc has a free, the _mm_malloc function has the function _mm_free. Memory blocks allocated using _mm_malloc have to be freed using _mm_free.
2.3 Custom Datatype

The restriction that pointers be aligned to 16-byte boundaries can be troublesome. It would be much better to be able to ignore the alignment of pointers.

When operating on 128-bit data types, it may be necessary to access the floats stored in the data type. In assembly language there is not much choice but to use assembly language constructs. Using C or C++ and the intrinsics library, however, the data will be sortd in the data type __mm128. In this data type, once the value is set, it is not possible to access the individual floating-point numbers directly. One way to access them is to transfer all floating point numbers into an array of floats, change the values and load the array of floats back into the data type. The second method is to cast the data type into a float array and then access the required element. The first method is time consuming and the second method may cause problems if not used properly.

Defining a custom data type can overcome these problems. The custom data type is defined as a union of the data type (__m128) and an array of four floats. The declaration of the new data, called sse4 for now, is given below.
union sse4 {
__m128 m;
float f[4];
};

Using this data type, it is no longer necessary to align memory locations to 16-byte boundaries. When the compiler encounters the data type __m128, it aligns it to a 16-byte boundary. An added advantage of this data type is that the individual floating-point numbers stored in the 128-bit data can be acessed directly.
2.4 Detecting the CPU

As the usage of SSE depends on the presence Pentium III, it is important that applications be able to detect the Pentium III chip. This is done using the cpuid instruction.

For the cpuid instruction to work as desired, the eax register has to be set to the appropriate value. As we are interested only in the CPU ID, we need to set the eax register to 1 before invoking the cpuid instruction.

The source code to detect the presence of the Pentium III CPU is given below. To be able to compile the code, the file fvec.h has to be included.
BOOL CheckP3HW()
{
BOOL SSEHW = FALSE;
_asm {
// Move the number 1 into eax - this will move the
// feature bits into EDX when a CPUID is issued, that
// is, EDX will then hold the key to the cpuid
mov eax, 1

// Does this processor have SSE support?
cpuid

// Perform CPUID (puts processor feature info in EDX)
// Shift the bits in edx to the right by 26, thus bit 25
// (SSE bit) is now in CF bit in EFLAGS register.
shr edx,0x1A

// If CF is not set, jump over next instruction
jnc nocarryflag

// set the return value to 1 if the CF flag is set
mov [SSEHW], 1

nocarryflag:
}
return SSEHW;
}

The SSE SDK also has an SSE emulation mode that emulates the Pentium III and the SSE registers. The code given below can be used to detect this emulation. To be able to compile the code, the file fvec.h has to be included.
// Checking for SSE emulation support
BOOL CheckP3Emu()
{
BOOL SSEEmu = TRUE;
Fvec32 pNormal = (1.0, 2.0, 3.0, 4.0);
Fvec32 pZero = 0.0;
// Checking for SSE HW emulation
__try {
_asm {
// Issue a move instruction that will cause exception
// w/out HW support emulation
movups xmm1, [pNormal]
// Issue a computational instruction that will cause
// exception w/out HW support emulation
divps xmm1, [pZero]
}
}
// If there's an exception, set emulation variable to false
__except(EXCEPTION_EXECUTE_HANDLER) {
SSEEmu = FALSE;
}
return SSEEmu;
}

3. Additional Examples

In this section, we present additional examples to illustrate the usage of the Streaming SIMD Extensions.
3.1 Array Manipulation

In this example, we take two arrays, each with 400 floats. A multiplication operation is performed on each of the array elements. The result of the multiplication is stored in a third array. The two arrays used as operands are named A and B. The result of the multiplication is stored in array C. In all the sources given below, the following declartion is assumed
#include

#define ARRSIZE 400

__declspec(align(16)) float a[ARRSIZE], b[ARRSIZE], c[ARRSIZE];
3.1.1 Assembly Language
_asm {
push esi;
push edi;
mov edi, a;
mov esi, b;
mov edx, c;
mov ecx, 100;
loop:
movaps xmm0, [edi];
movups xmm1, [esi];
mulps xmm0, xmm1;
movups [edx], xmm0;
add edi, 16;
add esi, 16;
add edx, 16;
dec ecx;
jnz loop;
pop edi;
pop esi;
}IZE; i +="4" ) {
m1= _mm_loadu_ps(a+i);
m2= _mm_loadu_ps(b+i);
m3= _mm_mul_ps(m1," m2);
_mm_storeu_ps(c+i, m3);
}
3.1.3 C++
F32vec4 f1, f2, f3;

for ( int i = 0; i loadu(f1, a+i);
loadu(f2, b+i);
f3 = f1 * f2;
storeu(c+i, f3);
}
3.2 Vector for 3D

This example presents a vector in 3D. The vector is implemented as a class. The functionality of the class is implemented using the intrinsics library.

The class declaration is given below.
union sse4 {
__m128 m;
float f[4];
};

class sVector3 {
protected:
sse4 val;
public:
sVector3(float, float, float);
float& operator [](int);
sVector3& operator +=(const sVector3&);
float length() const;
friend float dot(const sVector3&, const sVector3&);
};

The class implementation is given below.
sVector3::sVector3(float x, float y, float z) {
val.m = _mm_set_ps(0, z, y, x);
}
float& sgmVector3::operator [](int i) {
return val.f[i];
}
sVector3& sVector3::operator +=(const sVector3& v) {
val.m = _mm_add_ps(val.m, v.val.m);
return *this;
}
float sVector3::length() const {
sse4 m1;
m1.m = _mm_sqrt_ps(_mm_mul_ps(val.m, val.m));
return m1.f[0] + m1.f[1] + m1.f[2];
}
float dot(const sVector3& v1, const sVector3& v2) {
sVector3 v(v1);
v.val.m = _mm_mul_ps(v.val.m, v2.val.m);
return v.val.f[0] + v.val.f[1] + v.val.f[2];
}
3.3 4x4 Matrix

This example presents a 4x4 matrix. The matrix is implemented as a class. The functionality of the class is implemented using the intrinsics library.

The class declaration is given below.
float const sEPSILON = 1.0e-10f;

union sse16 {
__m128 m[4];
float f[4][4];
};

class sMatrix4 {
protected:
sse16 val;
sse4 sFuzzy;
public:
sMatrix4(float*);
float& operator()(int, int);
sMatrix4& operator +=(const sMatrix4&);
bool operator ==(const sMatrix4&) const;
sVector4 operator *(const sVector4&) const;
private:
float RCD(const sMatrix4& B, int i, int j) const;
};

The class implementation is given below.
sMatrix4::sMatrix4(float* fv) {
val.m[0] = _mm_set_ps(fv[3], fv[2], fv[1], fv[0]);
val.m[1] = _mm_set_ps(fv[7], fv[6], fv[5], fv[4]);
val.m[2] = _mm_set_ps(fv[11], fv[10], fv[9], fv[8]);
val.m[3] = _mm_set_ps(fv[15], fv[14], fv[13], fv[12]);
float f = sEPSILON;
sFuzzy.m = _mm_set_ps(f, f, f, f);
}
float& sMatrix4::operator()(int i, int j) {
return val.f[i][j];
}
sMatrix4& sMatrix4::operator +=(const sMatrix4& M) {
val.m[0] = _mm_add_ps(val.m[0], M.val.m[0]);
val.m[1] = _mm_add_ps(val.m[1], M.val.m[1]);
val.m[2] = _mm_add_ps(val.m[2], M.val.m[2]);
val.m[3] = _mm_add_ps(val.m[3], M.val.m[3]);
return *this;
}
bool sMatrix4::operator ==(const sMatrix4& M) const {
int res[4];
res[0] = res[1] = res[2] = res[3] = 0;
res[0] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[0], M.val.m[0]),
_mm_min_ps(val.m[0], M.val.m[0])), sFuzzy.m));
res[1] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[1], M.val.m[1]),
_mm_min_ps(val.m[1], M.val.m[1])), sFuzzy.m));
res[2] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[2], M.val.m[2]),
_mm_min_ps(val.m[2], M.val.m[2])), sFuzzy.m));
res[3] = _mm_movemask_ps(_mm_cmplt_ps(_mm_sub_ps(
_mm_max_ps(val.m[3], M.val.m[3]),
_mm_min_ps(val.m[3], M.val.m[3])), sFuzzy.m));
if ( (15 == res[0]) && (15 == res[1])
&& (15 == res[2]) && (15 == res[3]) )
return 1;
return 0;
}
sVector4 sMatrix4::operator *(const sVector4& v) const {
return sVector4(
val.f[0][0] * v[0] + val.f[0][1] * v[1]
+ val.f[0][2] * v[2] + val.f[0][3] * v[3],
val.f[1][0] * v[0] + val.f[1][1] * v[1]
+ val.f[1][2] * v[2] + val.f[1][3] * v[3],
val.f[2][0] * v[0] + val.f[2][1] * v[1]
+ val.f[2][2] * v[2] + val.f[2][3] * v[3],
val.f[3][0] * v[0] + val.f[3][1] * v[1]
+ val.f[3][2] * v[2] + val.f[3][3] * v[3]);
}
float sMatrix4::RCD(const sMatrix4& B, int i, int j) const {
return val.f[i][0] * B.val.f[0][j] + val.f[i][1] * B.val.f[1][j]
+ val.f[i][2] * B.val.f[2][j] + val.f[i][3] * B.val.f[3][j];
}

Introduction to the Streaming SIMD Extensions in the Pentium III: Part II

Introduction to the Streaming SIMD Extensions in the Pentium III: Part II

1. Using SSE

After describing SSE, let's look at how we can use SSE in applications.
1.1 Assembly Language

Traditionally, programmers wishing to exploit the advantages of a new processor have developed code in assembly language. This was necessary, as higher-level development tools supporting the new instructions only appeared well after the processor release.

This is also the case with the Pentium III. At present, only the Intel C/C++ compiler and the Microsoft Macro Assembler (Version 6.11d and above) are able to understand the new SSE instructions.

The usual trade-off applies here: Programming a large and complex application in assembly language is difficult, but the resulting application will be very fast.

With the SSE SDK (Software Developers Kit), Intel provides two more programming mechanisms for using the SSE instructions: the intrinsics library, and a C++ class for the new data type defined by SSE. These mechanisms are easier to use than assembly language, particularly because programmers do not have to explicitly manage the SSE registers, and they especially make it easier to develop large applications. This comes at the cost of application speed. Code developed using these mechanisms is fast, but not as fast as corresponding code written in assembly language. Figure 6 shows the inverse relationship between application speed and ease of development among the three programming methods.

Figure 1: Application speed versus ease of development for different environments.
1.1.1 Example: Multiplication

Assume that the two 128-bit numbers a and b are stored in the registers xmm1 and xmm2 respectively. The result of the computation will be stored in the register xmm0. The code is embedded in a C program.
#include
...
_asm {
push esi;
push edi;
; a is loaded into xmm1
; b is loaded into xmm2
mov xmm0, xmm1;
mulps xmm0, xmm2;
; store result into c
pop edi;
pop esi;
}
...

Figure 7 presents a diagrammatic representation of the packed multiplication.

Figure 2: Packed Multiplication.
1.2 Intrinsics

The first additional development mechanism is the intrinsics library. The instrinsics library provides a C programming language interface to the SSE instructions. All the SSE instructions have a corresponsing C function in the intrinsics library. For example, the assembly language mnemonic for the instruction to add two packed data types is addps. Correspondingly, the function in the intrinsics library to add two packed data types is _mm_add_ps. In addition to defining functions for all instructions, the intrinsics library also defines a data type (__m128) that is 128 bits long. It allows the storage of four floating point numbers.

To use the intrinsics library, the file xmmintrin.h must be included.
1.2.1 Example: Multiplication

Assume that a and b are two 128-bit numbers. The result of the computation will be stored in another 128-bit number, c. All the numbers are of the datatype __m128. __m128 is the class that represents the 128-bit number and is declared in the file xmmintrin.h. The function _mm_set_ps takes the most-significant number as the first parameter and the least-significant number as the last parameter.
#include
...
__m128 a, b, c;
a = _mm_set_ps(4, 3, 2, 1)
b = _mm_set_ps(4, 3, 2, 1)
c = _mm_set_ps(0, 0, 0, 0)
c = _mm_mul_ps(a, b);
...
1.3 C++

The second additional development mechanism uses C++. The SSE SDK provides a C++ class, F32vec4, that defines an abstraction for the new 128-bit data type. All operations on the data type are encapsulated by the class. Internally, the class uses the functions of the intrinsics library.

To use the C++ class, the file fvec.h must be included.
1.3.1 Example: Multiplication

Again, assume a and b are two 128-bit numbers. The result of the computation will be stored in another 128-bit number, c. All the numbers are of the datatype F32vec4. The order of parameters for the constructor of F32vec4 is the same as for the function _mm_set_ps.
#include
...
F32vec4 a(4, 3, 2, 1), b(4, 3, 2, 1), c(0, 0, 0, 0);
...
c =a * b;
...
1.4 Compiler Support

As mentioned earlier, only the Intel C/C++ compiler and the Microsoft Macro Assembler support the SSE instructions. The Intel compiler integrates into the Microsoft Visual Studio programming environment. The Visual Studio environment can be configured such that either the whole project or specific files from the project can be compiled using the Intel compiler.
2. SSE Instruction Details

Before we cover some simple examples to illustrate the usage of SSE instructions, we will list all instructions defined in SSE.
Arithmetic Instructions
addps, addss
subps, subss
mulps, mulss
divps, divss
sqrtps, sqrtss
maxps, maxss
minps, minss

Logical Instructions
andps
andnps
orps
xorps

Compare Instructions
cmpps, cmpss
comiss
ucomiss

Shuffle Instructions
shufps
unpchkps
unpcklps

Conversion Instructions
cvtpi2ps, cvtpi2ss
cvtps2pi, cvtss2si

Data Movement Instructions
movaps
movups
movhps
movlps
movmskps
movss

State Management Instructions
ldmxcsr
fxsave
stmxscr
fxstor

Cacheability Control Instructions
maskmovq
movntq
movntps
prefetch
sfence

Additional SIMD Integer Instructions
pextrw
pinsrw
pmaxub, pmaxsw
pminub, pminsw
pmovmskb
pmulhuw
pshufw
3. Examples

In this section, we present a few examples to ilustrate the usage of the Pentium III SSE instructions. For each example, we will present three solutions, one each in assembly language, intrinsics and C++. Additional examples are presented in the additional examples section.
3.1 Packed Multiplication

We have already covered the multiplication of two packed numbers, while describing the different development mechanisms. The packed multiplication was illustrated using assembly language, the intrinsics library, and the C++ class.
3.2 Comparison Operation

Let us consider the case of comparison. Without SSE, we compare floating-point numbers using the less than operator. Using SSE, we can compare 4 floating point numbers in one instruction.

To compare 4 floats in C or C++, we would write a loop and put a comparison condition in the loop. We would then take action on the comparion result. The code for comparison is illustrated below.
float a[4], b[4]
int i, c[4];

// assume that a contains 4.5 6.7 2.3 and 1.2
// assume that b contains 4.3 6.9 2.0 and 1.5

for (i = 0;i < h="">

Figure 3: Comparison operator.
3.3 Branch Removal

Usually, in applications, we have conditions like the one given below.
a =(a < h="">

Introduction to the Streaming SIMD Extensions in the Pentium III: Part I

Introduction to the Streaming SIMD Extensions in the Pentium III: Part I

Abstract

With the launch of the Intel Pentium III processor, many new features are available for application developers. Using these features, application developers can create better content for the end-user.

In addition to being faster than the Pentium II and the Pentium II Xeon, the Pentium III and the Pentium III Xeon processors have many new features, including a unique processor ID and new processor instructions. These new instructions do for the Pentium II what MMX did for the Pentium.

In this article, we introduce you to the Pentium III and describe in brief its features, concentrating on the details of the new instructions.
1. Introduction to the Pentium III

In February 1999, Intel unveiled its latest processor, the Pentium III. As with each new processor, the Pentium III has a host of new features in addition to the increase in clock speed. Previous processor releases from Intel have adhered relatively closely to Moore's Law, which states that the processor speed doubles every 18 months. The Pentium III does not double the speed of the Pentium II. It runs in the 450- to 550-MHz range, while the Pentium II and Pentium II Xeon ran at 333 MHz to 400 MHz. Though the increase in clock speed is more modest than previous gains, the Pentium III more than makes up for this with increased functionality.

A Pentium III is essentially a Pentium II running at higher speeds, with the addition of a new set of instructions: Streaming SIMD Extensions, or SSE. Though SSE adds new features, existing applications are not affected. The Pentium III architecture is compatible with the Pentium II's IA-32.

If the Pentium III does not double the MHz provided by the Pentium II, why should we go for the Pentium III at all?
2. New Features of the Pentium III

The Pentium III adds two interesting and useful features: the processor serial number and SSE. The CPU serial number has been at the center of a debate about privacy. This article will abstain from the debate, and focus instead on the new SIMD instructions.

SSE has an acronym embedded in it: SIMD, which stands for Single Instruction Multiple Data. Usually, processors process one data element in one instruction, a processing style called Single Instruction Single Data, or SISD. In contrast, processors having the SIMD capability process more than one data element in one instruction.
3. MMX versus SSE

MMX and SSE, both of which are instruction sets that have been added to existing architectures, share the concept of SIMD, but they differ in the data types they handle, and in the way they are supported in the processor.

MMX instructions are SIMD for integers, while SSE instructions are SIMD for single-precision floating-point numbers. MMX instructions operate on two 32-bit integers simultaneously, while SSE instructions operate on four 32-bit floats simultaneously.

A major difference between MMX and SSE is that no new registers were defined for MMX, while eight new registers have been defined for SSE. Each of the registers for SSE is 128 bits long and can hold four single-precision floating-point numbers (each being 32 bits long). The arrangement of the floating-point numbers in the new data type handled by SSE is illustrated in Figure 1.

Figure 1: Arrangement of numbers in the new data type.

The immediate question is: Where did the registers for MMX come from? The MMX registers were allocated out of the floating-point registers of the floating-point unit. A floating-point register is 80 bits long, of which 64 bits were used for an MMX register. A limitation of this architecture is that An application cannot execute MMX instructions and perform floating-point operations simultaneously. Additionally, a large number of processor clock cycles are needed to change the state of executing MMX instructions to the state of executing floating-point operations and vice versa. SSE does not have such a restriction. Separate registers have been defined for SSE. Hence, applications can execute SIMD integer (MMX) and SIMD floating-point (SSE) instructions simultaneously. Applications can also execute non-SIMD floating-point and SIMD floating-point instructions simultaneously.

The arrangement of the registers in MMX and SSE is illustrated in Figure 2. Figure 2(a) illustrates the mutually exclusive floating-point and MMX registers, while Figure 2(b) illustrates the SSE registers.

Figure 2: Registers in MMX and SSE.

MMX and SSE have one more similarity: Both have eight registers. MMX registers are named mm0 through mm7, while SSE registers are named xmm0 through xmm7.
4. Application Areas

The Pentium III SSE instructions allow for SIMD operations on four single-precision floating-point numbers in one instruction. Therefore, applications that utilize floating-point calculations stand to benefit the most from the usage of SSE. Applications related to 3D graphics, in particular, should see a substantial benefit. In fact, SSE was created specifically for 3D. Games and other applications that use a 3D back-end to display 2D or 2.5D, as well as applications that use vector graphics at the back end, also stand to benefit.
4.1 The Case of 3D

3D graphics typically consists of manipulating large sets of floating-point numbers used to specify the position of a point in 3D space. Manipulation involves performing floating-point calculations on the data set. With the help of SSE instructions, the application will be able to process more data per instruction. The resulting speed increase translates into a better experience for the user.

Application developers can also add more data or more complex calculations to create better effects and generate a richer experience for the user.

Some of the commonly used operations in 3D processing that stand to benefit from usage of SSE instructions are matrix multiplication, matrix transposition, matrix-matrix operations like addition, subtraction, and multiplication, matrix-vector multiplication, vector normalization, vector dot product, and lighting calculations.
5. Streaming SIMD Extensions

SSE adds 70 new instructions to the processor. With the new instructions, a new status/control word has been added. SSE requires support from the operating system, which can save and restore the processor state as required. At present, the only operating systems that support SSE are Microsoft's Windows 98 and Windows 2000 operating systems. SSE defines new instructions, new data types and instruction categories.

Not all of the new instructions are for SIMD floating point. Of the 70, 50 are SIMD for floating point, 12 are for SIMD integer, and the remaining eight are cacheability instructions. In this article, we will concentrate on the 50 instructions provided for SIMD on single-precision floating-point numbers.
5.1 Classification

The new SIMD floating-point instructions in the Pentium III can be classified in various ways, based on different criteria.

If we classify instructions based on the arrangement of the operands of the instructions, the classification will be as given in the "Data Packing" subsection.

If we classify instructions based on their behavioral characteristics, the classification will be as given in the "Instruction Categories" subsection.

If we classify instructions based on their computational characteristics, the classification will be as given in the "Instruction Groups" subsection.
5.1.1 Data Packing

If we classify the SIMD floating point instructions based on how the data to be manipulated is stored, we have two broad categories: instructions operating on packed data and instructions operating on scalar data. Hence, we have packed instructions and scalar instructions.

In the Pentium III instruction set, the packed and scalar instructions can be distinguished quite easily. Packed instructions have the "ps" suffix, while the scalar instructions have the "ss" suffix.

The new data type defined by SSE allows the storage of four single-precision floating-point numbers. A data element of the new data type can be represented as shown in Figure 3. An element "A" of the new data type can store four single-precision floating point numbers, here labeled a0, a1, a2 and a3.

Figure 3: New data type in SSE.

The arrangement of Figure 3 helps clarify the difference between packed and scalar instructions. SSE packed instructions operate on all four elements of the data type. Scalar instructions operate only on the lowest significant element of the data type, leaving the remaining three elements unchanged.

A packed instruction is illustrated in Figure 4. As shown, there are two operands, A and B, each of the new data type defined by SSE. The result of the operation op will be stored in C. C has the same data type as A and B. The elements of A are a0, a1, a2 and a3, while the elements of B are b0, b1, b2 and b3. The result of executing the instruction op is shown in Figure 4.

Figure 4: Packed operation.

The instruction is applied to each of the elements of A and B, and the results are stored in corresponding elements of C. The instruction op is applied to all the elements at the same time, computing all four elements in the same unit of computation.

A scalar instruction using the same operands, A and B, is illustrated in Figure 5. The instruction is applied to only the least-significant element of A and B. Again, the results are stored in corresponding elements of C. The instruction op is applied to all the elements at the same time, again computing all four elements in the same unit of computation, though only one element is affected.

Figure 5: Scalar operation.
5.1.2 Instruction Categories (collected according to behavioral characteristics)

Most of the packed instructions in SSE have a scalar equivalent. SSE instructions can also be classified, without respect to arrangement of data, into the following instruction categories:
computation instructions
branching instructions
cacheability instructions
data movement and ordering instructions
type conversion instructions, and
state management instructions
5.1.3 Instruction Groups (collected according to computational characteristics)

SSE instructions can also be classified into the following instruction groups:
arithmetic instructions
comparison instructions
logical instructions
shuffle instructions
conversion instructions
state management instructions
cacheability instructions
data management instructions, and
additional integer instructions
5.2 Benefits

The primary benefit of SSE is a reduction in the number of instructions executed for the given data set. Without SIMD and SSE, multiplying each of 400 floats by a number would require looping through the data set, executing the multiplication operator 400 times. With SIMD and SSE, 100 multiplications could perform the same task, as each multiplication can operate on four floats simultaneously.

The number of instructions when using SIMD and SSE is not an exact one-fourth of the non-SIMD case. Some instructions will be required to rearrange the data so that it is in a format acceptable to the SIMD instructions. In practice, SIMD instructions should be able to roughly double the efficiency of the operations.

Sunday, February 17, 2008

Understanding the Microprocessor

Understanding the Microprocessor

Basic Computing Concepts

Introduction

I've been writing on CPU technology here at Ars for almost five years now, and during that time I've done my best to communicate computing concepts in as plain and accessible a manner as possible while still retaining some level of technical sophistication. Without exception, though, all of my CPU articles have been oriented towards the investigation of technologies currently on the market; I've written no general introduction to any of the concepts that I've used in these investigations, opting instead to integrate some introductory material into the more advanced discussions as space allows. As a result, I always get feedback from people who express regret that there were portions of my articles that they didn't understand due to their lack of background in the material. This is unfortunate, and for some time I've considered doing a more generalized introduction to the basic concepts in computing. Events have recently conspired to afford me that opportunity, hence the present article, which is the first in a series on the basics of microprocessor technology.

There are a number of good reasons to do an article like this, now. One reason, as I've suggested above, is to provide readers with a better background for understanding my previous work. After reading this article you should be able to go back and revisit some older articles that you only half digested and get more out of them. But the main reason for doing a general introduction to microprocessor technology is forward-looking: that there are a number of new processors slated to come out in the next year and this article will help to lay the groundwork for my coverage of those designs. Itanium2, Yamhill, the PPC 970, AMD's Hammer, and even the Playstation3 are all on the horizon, and we at Ars want to be proactive about helping you get ready to understand what makes all of those technologies tick.

Due to the continuing success of the Ars RAM Guide, I've chosen to model the present series on it. I'll start out at a very basic level with this first article, and as the series progresses I'll advance along the axes of chronology and complexity from older, more primitive technologies to newer, more advanced ones. The one important difference between this article and the RAM guide is in this article's relative lack of real-world examples. There are a number of reasons why I've chosen to forego detailed discussions of present-day implementations, but the primary one is that such discussions constitute.

Judging by the steady stream of feedback I've gotten on it over the years, the following, which was part of my article on SIMD, has proven to be one of the most popular diagrams I've ever made. (It's my vain suspicion that it had some influence on the Intel hyper-threading ads that previously adorned certain pages here at Ars.) This being the case, I want to develop our general discussion of the types of tasks computers do by first presenting this simple conceptual diagram and then elaborating on it and nuancing it until a more complete picture of the microprocessor emerges.

The above diagram is a variation on the traditional way of representing a processor's arithmetic logic unit (ALU), which is the part of the processor that does the actual addition, subtraction, etc. of numbers. However, instead of showing two operands entering the top ports and a result exiting the bottom (as is the custom in the literature) I've depicted a code stream and a data stream entering and a results stream leaving. For the purposes of our initial discussion, we can generalize by saying that the code stream is made up of different types of operations and the data stream consists of the data on which those operations operate. To illustrate this point and to put a more concrete face on the diagram above, imagine that one of those little black boxes is an addition operator (a "+" sign) and two of the white boxes contain the two integers to be added.

The kind of simple arithmetic operation pictured above represents the sort of thing that we intuitively think computers do: like a pocket calculator, the computer takes numbers and arithmetic operators (like +, -, /, >, <, etc.) as input, performs the requested operation, and then spits out the results. These results might be in the form of pixel values that make up a rendered scene in a computer game, or they might be entries in a spreadsheet.

What is Microprocessor???

What is Microprocessor???

A microprocessor is a computer processor on a microchip. It's sometimes called a logic chip. It is the "engine" that goes into motion when you turn your computer on. A microprocessor is designed to perform arithmetic and logic operations that make use of small number-holding areas called registers. Typical microprocessor operations include adding, subtracting, comparing two numbers, and fetching numbers from one area to another. These operations are the result of a set of instructions that are part of the microprocessor design. When the computer is turned on, the microprocessor is designed to get the first instruction from the basic input/output system (BIOS) that comes with the computer as part of its memory. After that, either the BIOS, or the operating system that BIOS loads into computer memory, or an application progam is "driving" the microprocessor, giving it instructions to perform.

Microprocessor