Microprocessor: July 2008

Friday, July 4, 2008

Glossary of PC Processors by x86.org

Glossary of PC Processors by x86.org

CISC=Complex Instruction Set Computer. A CISC microprocessor is one in which the number of bytes needed to represent the opcode instruction is not a fixed, regular length (for example, 32-bits each).

Opcode=The data that represents a microprocessor instruction.

RISC=Reduced Instruction Set Computer. A RISC microprocessor that has fewer, simpler instructions than its CISC counterpart. RISC instructions perform simple, rudimentary functions. The simplicity of these instructions results in a very simple microprocessor design that can execute very fast. RISC instructions are typically characterized by fix length instruction sets (for example, all instructions are 32-bits each).

EPIC=Explicitely Parallel Instruction Computing. EPIC is a fancy acronym that Intel invented to obfuscate the fact that they do not want the public appearance that their Merced microprocessor is actually a VLIW design. After all, Intel didn't invent VLIW, therefore they don't want to be publicly associated with a VLIW design.

VLIW=Very Long Instruction Word. A microprocessor that packs many simple RISC-like instructions into a much longer internal instruction word format. A VLIW microprocessor will usually have execution units, capable of executing all of the instructions contained in the instruction word, in parallel.

MMX=Multi-Media eXtensions. Intel claims that "MMX" is not an acronym, meaning "Multi-Media eXtensions" because they have filed for a trademark under this name. In reality, MMX instructions are intended to enhance programs that have multi-media capabilities.

Address Lines=The number of address lines, or "address pins" on a microprocessor determine how much memory the chip can address. The amount of addressible memory can be calculated as 2^#address_lines (two raised to the power of the number of address lines). A microprocessor with 32 address lines can address 232 bytes of memory (4 G Bytes).

Cache=A cache is a bank of high speed memory that stores the most recently accessed code and data. When the microprocessor requests data that is in the cache, the amount of time to retrieve the data is many times less than the amount of time needed to access main memory. Many microprocessors have a cache inside of the chip itself. In some cases, there is a cache for the cache (known as a 2nd-level cache). A cache may hold code, data, or even recently accessed data on a hard disk. In general, a cache can be created for faster access to any slower device, beit main memory or hard disks.

PC Processors Guide by x86.org

PC Processors Guide by x86.org

"The Intel 8086, a new microcomputer, extends the midrange 8080 family into the 16-bit arena. The chip has attributes of both 8- and 16-bit processors. By executing the full set of 8080A/8085 8-bit instructions plus a powerful new set of 16-bit instructions, it enables a system designer familiar with existing 8080 devices to boost performance by a factor of as much as 10 while using essentially the same 8080 software package and development tools.

"The goals of the 8086 architectural design were to extend existing 8080 features symmetrically, across the board, and to add processing capabilities not to be found in the 8080. The added features include 16-bit arithmetic, signed 8- and 16-bit arithmetic (including multiply and divide), efficient interruptible byte-string operations, and improved bit manipulation. Significantly, they also include mechanisms for such minicomputer-type operations as reentrant code, position-independent code, and dynamically relocatable programs. In addition, the processor may directly address up to 1 megabyte of memory and has been designed to support multiple-processor configurations."

-- Intel Corporation, February, 1979

Introduction

In 1979, Intel introduced the 8086 and 8088 microprocessor extensions to the 8080 product line. Since that time, the x86 product line has gone through six generations and become the most successful microprocessor in history. Much of this success was due to the success of the IBM-PC and its clones. Therefore Intel was at the right place at the right time when IBM made their historic decision to use the 8088. Today, the x86 market is a multi-billion dollar industry, selling tens-of-millions of units per year.

The huge popularity of these x86 chips has lead to a prosperous x86 clone industry. AMD, Cyrix, IBM, TI, UMC, Siemens, NEC, Harris, and others have all dabbled in the x86 chip industry. Today, AMD, Cyrix, and Centaur are still actively competing. However, there are numerous stories that approximately one dozen other companies are actively working to create a x86 clone chip. Ultimately, the market will decide who survives and who perishes in this cutthroat chips business.
Merced

Merced is Intel's future-generation microprocessor architecture; Merced is not Intel's next-generation x86 microprocessor. Instead, Merced is a completely new microprocessor design and instruction set. Merced will have the means to run legacy x86 programs via a hardware translation mechanism.

The Merced design is a further departure from its x86 predecessors. Merced is not CISC (Complex Instruction Set Computer), RISC (Reduced Instruction Set Computer), but closely resembles a VLIW design (Very Long Instruction Word). Intel doesn't want to call this chip VLIW, ostensibly for political reasons (not invented here). Instead, Intel has coined the term EPIC -- Explicitely Parallel Instruction Computing. For all intents and purposes, EPIC is VLIW.

First versions of Merced will run between 600 MHz and 1000 MHz (1 GHz). x86 compatibility is acheived internally via a hardware translation mechanism. This translation mechanism enables Merced to run your existing Windows and DOS applications. Intel claims that Merced will never run x86 programs as fast as the current state-of-the-art x86 microprocessors. Therefore, Intel offers Merced into the server and workstation market, and will offer x86 compatibility as a matter of convenience.

Xeon

The Xeon is a Pentium II on steroids. The Pentium II contains a P6 (Pentium Pro) core with a 1/2-speed 2nd-level cache (L2 cache). Xeon differs from Pentium II in that the L2 cache runs at full processor speed. Pentium II connects to the motherboard in a slot named Slot-1. Xeon is not slot-compatible with Pentium II. Instead, Xeon uses a new slot named Slot-2.

Xeon is Intel's high-end microprocessor brand for the computer server market. As such, Xeon is priced much higher than Pentium II or Celeron.

Celeron

The Celeron microprocessor is a stripped down Pentium II. This chip is affectionately knows as "the Castrated One." The Celeron is offered without any 2nd-level cache, making its performance lackluster, and reportedly slower than a Pentium (with MMX) running at nearly one-half of its speed.

Intel was completely caught off-guard by the advent and popularity of the sub-$1000 PC. In fact, the sub-$1000 PC completely ruins Intel's marketing game plan. Therefore, Intel hastily created the Celeron in an attempt to regain some marketshare they lost to their competition.

Pentium II

The Pentium II is a glorified Pentium Pro with MMX extensions. The Pentium II has a different package than the Pentium Pro. This new package is called "Slot-1." Officially, Intel claimed technical reasons for needing Slot-1. Industry pundits claimed that slot-1 was devised to thwart industry competition with the Pentium Pro, and further the Intel monopoly. Strangely, the "technical reasons for needing slot-1" evaporated as soon as the Pentium Pro was dead; as Intel somehow made a "breakthrough" that they had previously claimed was impossible (thereby "needing" to invent the proprietary slot-1).

Slot-1 was also promised to be the upgrade path for consumers -- leading many years into the future. Unfortunately, every time Intel has made this promise, first with the 80486, then with the Pentium, the promise has been broken. As soon as Intel saturated the saturated the market with slot-1 computers, they announced the future high performance upgrade path would be "slot-2."

The Pentium II can address up to 64 GB of main memory, but has cache limitations preventing memory use above 512 MB.

The Pentium Pro

The Pentium Pro was introduced in November 1995 as Intel's 6th generation x86 design -- code-named the "P6." The Pentium Pro offered some minor programming enhancements, four more address lines, and a large 2nd-level cache. By this time, the entire "secret" programming features of the Pentium had been revealed .Therefore, Intel abandoned their attempts at keeping most of their new P6 programming features as a secret.the Pentium Pro could address up to 64 GB of main memory. The addition of the 2nd-level cache gave the Pentium Pro a nice performance boost, but was very expensive to manufacture.

Intel continued their attempts at closing the architecture to the exclusion and elimination of their competition. Intel managed to gain patent protection for some pins on the Pentium Pro socket; thus making this chip very difficult to clone without substantial legal liability. However, Intel wasn't paranoid enough. Intel introduced the Pentium II under the guise that the Pentium Pro could never achieve their performance goals. Intel alleged that the Pentium Pro 2nd-level cache could never run faster than 200 MHz, and therefore they must discontinue the development of this product line. The Pentium II abandoned the socket approach to microprocessors, and introduced the "slot" concept. The slot-1 of the Pentium II was further enshrouded in patent protection, thereby further raising the bar to the cost of competition. Now that the Pentium Pro is dead (along with the possibility that their competition will clone this product), Intel has ironically announced a 400 MHz Pentium II with a full-speed 2nd-level cache.

Regardless of Intel's continued monopolistic business practices, the P6 product line has diversified and flourished. The Pentium II added MMX enhancements and a variety of 2nd-level cache options. Intel has created the Celeron brand to compete in the sub-$1000 market. The Xeon was introduced to compete in the server market with a 100 MHz system bus. As time goes on, Intel will continue to diversify the P6 family product line, most likely with a 200 MHz system bus.

Intel's competitors have gone in a different direction than Intel. Intel's competitors have stayed with a Pentium-compatible pin-out. AMD has continued to develop the K6 processor and added MMX enhancements. Cyrix has added MMX enhancements to the MII product line. Centaur products always contained MMX enhancements. These three companies combined their abilities to create a common set MMX-3D instruction set extensions. All three companies have announced plans to create a 100 MHz system bus and an integrated 2nd-level cache (though Intel-fellow Fred Pollack has said publicly that the 2nd-level cache integration is electronically impossible).
The Pentium

The Pentium processor was a big departure from all past Intel x86 processors. The Pentium name signaled the end to the 80x86 nomenclature, and was spurred by losing a trademark dispute against AMD. The Pentium processor contained more than one execution unit -- making it superscalar. Intel no longer needed (or wanted) any second-source fabs manufacturing their microprocessors; Intel wanted all of the profits to themselves. The Pentium also added many programming enhancements though Intel tried to keep all of them secret.

Intel rapidly diversified the Pentium product line. The original Pentium product ran at 60 and 66 MHz. Shortly thereafter, Intel introduced 90, 100, 120, 133 MHz versions of the popular processor. Intel introduced low-power versions of the Pentium to be used in notebook computer applications. Finally, Intel introduced the MMX-enhanced processors.

AMD and Cyrix didn't sit idly by and watch Intel expand and dominate the market. AMD introduced the K5 processor -- their first in-house x86 design. However, the K5 was late to market, and was very slow. In response to their bleak outlook for the K5, AMD bought Nexgen. Nexgen created their own x86-compatible microprocessor, calling it the Nx586. At the time of the acquisition, Nexgen had already finished the design of their next-generation processor core -- the Nx686. AMD used the Nx686 core and created the successful K6 processor. AMD has continued to upgrade this processor to include MMX, and other enhancements.

During this time Cyrix introduced the 6x86. The 6x86 was pin-compatible with the Pentium, though the 6x86 nomenclature might lead the consumer to believe that it is a 6th-generation (Pentium Pro) compatible chip. The 6x86 has also been enhanced with MMX instructions as the 6x86 MX. Cyrix has continued to enhance this chip in their attempt to gain more market share.

During the Pentium era, a new Intel competitor emerged. Centaur Technologies (a wholly owned subsidiary of IDT) created a fast, cheap, and somewhat low power Pentium compatible chip. Centaur has a low-end (low-cost) market focus. Some industry pundits have called this marketing strategy "bottom-feeding." However, with the emergence and overwhelming popularity of the sub-$1000 PC, Centaur may end up having the last laugh.
The 80486

The 80486 offered little in the way of architectural enhancements over its 80386 predecessor. The most significant enhancement of the 486 family was the integration of the 80387-math coprocessor into the 80486-core logic. Now, all software that requires the math coprocessor could run on the 80486 without any expensive hardware upgrades.

Like the 80386 SX, Intel decided to introduce the 80486 SX as a cost-reduced 80486 DX. Unfortunately, Intel chose to ensure that these processors were neither pin-compatible, nor 100% software compatible with each other. Unlike the 80386 SX, the 80486 SX enjoyed the full data bus and address bus of its DX counterpart. Instead, Intel removed the math coprocessor, thereby rendering the 80486 SX somewhat software incompatible with its DX counterpart. To further complicate matters, Intel introduced the 80487 SX -- the "math coprocessor" to the 80486 SX. Intel convinced vendors to include a new socket on the motherboard that could accommodate the 80486 SX and 80487 SX as an expensive hardware upgrade option. Unbeknownst to the consumer, the 80486 SX was an 80486 DX with a non-functional math unit (though later versions of the chip actually removed the math unit). The 80487 SX was a full 80486 DX with a couple of pins relocated on the package -- to prevent consumers from using the cheaper 80486 DX as an upgrade option. In this regard, Intel created a marketing deception. Intel marketed the 80487 SX as a math coprocessor to the 80486 SX. In reality, the 80487 SX electronically disabled the 80486 SX when installed, thereby relegating this chip to the status of an expensive space heater. Sadly, the consumer never knew or even suspected Intel of playing such manipulative games.

Also like the 80386, the Intel began to diversify their 80486 offerings. Low-power versions of the chip were introduced. The 80486 SL was introduced along with the 80386 SL as an integrated, low-power chip for notebook applications. The 80486 DX2 and DX4 were introduced, which doubled and tripled the core clock frequency. Power-saving features from the SL were introduced in later versions of the DX4. Finally, after Intel introduced the Pentium chip, they produced a version of the Pentium that was pin-compatible with the 80486. They called this chip an "overdrive" processor.

Likewise, AMD and Cyrix continued to pursue their own 486-compatible chip solutions. AMD introduced many Am486 variants. Cyrix continued their nomenclature of calling an 80486-compatible chip, the Cyrix 5x86. TI continued to manufacture Cyrix chips, and eventually started their own in-house microprocessor design (though the effort eventually failed). UMC entered the CPU market, but later withdrew because of patent infringement problems. IBM began manufacturing for Cyrix, and still pursuing their own microprocessor designs (the Blue Lightning series).
The 80386

In 1985, Intel introduced the 80386. Like the 80286 before it, the 80386 added significant programming and addressibility enhancements. Protected mode was enhanced to allow easy transitions between it and real mode (without resetting the microprocessor). Another new operating mode (v86 mode) was introduced to allow DOS programs to execute within a protected mode environment. Addressibility was further enhanced to 32-bits, giving the 80386 four gigabytes of memory addressibility (2^32 = 4 GB).

Also like the 80286, the 80386 was not introduced in any computer systems for many years after its introduction. Compaq was the first mainstream company to introduce an 80386-based computer -- beating IBM to market. Regardless, the 80386 enjoyed a very long life for home and business computer users. This long life was largely due to the programming extensions in the 80386 -- namely the ability to create a protected mode operating system to take advantage of all 4 GB of potential memory while still being able to run legacy DOS applications.

Shortly after the 80386 was introduced, Intel introduced the 80386 SX. To avoid confusion, Intel renamed the 80386 to the 80386 DX. The SX was a cost-reduced 80386 with a 16-bit data bus, and 24-bit address bus. The 16-bit data bus meant the SX was destined to have lower memory throughput than its DX counterpart; while the 24-bit address bus mean that the SX could only address 16 MB of physical memory. Regardless of the address bus and data bus differences, the SX and DX were software compatible with each other. Intel also introduced the 80376 as part of the 80386 family. The 376 was an 80386 SX that exclusively ran in protected mode.

During its long reign, the 80386-based computer began to evolve. Chipset vendors began dreaming of ways they could help improve the performance of the computer, thus giving their products a competitive advantage. One of the innovations was the introduction of the memory cache. The memory cache within the chipset would play a huge role in Intel's future product plans. First, Intel introduced a cache . Later, they incorporated the cache into the microprocessor itself. Intel also made their second failed attempt at chip integration. The 80386 SL integrated core logic, chipset functionality, and power-saving features into the microprocessor.

During this time, the popularity of the personal computer, and most notably their Intel microprocessors, didn't escape the notice of many entrepreneurs wishing to cash in on Intel's business. AMD began their own "x86" microprocessor division. IIT began cloning the Intel math coprocessors. Other small startups, such as Cyrix and Nexgen, decided they too could design an Intel-compatible microprocessor. The aspirations of these companies didn't bode well within Intel. Shortly thereafter, Intel began taking measures to ensure their own dominance in the industry -- to the exclusion of everybody else. Hence, Intel began what many believe are anti-competitive (illegal) business practices.

In spite of Intel's business practices, many 80386 clones began to appear. AMD marketed the Am386 microprocessors in speeds from 16 MHz to 40 MHz, though it was possible to overclock this chip up to 80 MHz. IBM introduced the 386 SLC, which featured a low-power 386 with an integrated 8-KB cache. IBM created other 386/486 hybrid chips -- some that were pin-compatible with Intel, and others that were not. Chips and Technologies created their own 386 clone. Cyrix stunned everybody by offering a 386 pin-compatible CPU, but called it a 486 (a nomenclature pattern that Cyrix still uses). Texas Instruments served as a foundry for Cyrix, and negotiated rights to produce chips under their own name. Eventually, TI produced their own chips (based on the Cyrix core), with their own unique enhancements.
The 80286

In 1982, Intel introduced the 80286. For the first time, Intel did not simultaneously introduce an 8-bit bus version of this processor (ala 80288). The 80286 introduced some significant microprocessor extensions. Intel continued to extend the instruction set; but more significantly, Intel added four more address lines and a new operating mode called "protected mode." Recall that the number of address lines directly relates to amount of physical that can be addressed by the microprocessor. The 8086, 8088, 80186, and 80188 all contained 20 address lines, giving these processors one megabyte of addressibility (2^20 = 1MB). The 80286, with its 24 address lines, gives 16 megabytes of addressibility (2^24 = 16 MB).

For the most part, the new instructions of the 80286 were introduced to support the new protected mode. Real mode was still limited to the one megabyte program addressing of the 8086, et al. For all intents and purposes, a program could not take advantage of the 16-megabyte address space without using protected mode. Unfortunately, protected mode could not run real-mode (DOS) programs. These limitations thwarted attempts to adopt the 80286 programming extensions for mainstream consumer use.

IBM was spurred by the huge success of the IBM PC and decided to use the 80286 in their next generation computer, the IBM PC-AT. However, the PC-AT was not introduced until 1985 -- three years after introduction of the 80286.

During the reign of the 80286, the first "chipsets" were introduced. The computer chipset was nothing more than a set of chips that replaced dozens of other peripheral chips, while maintaining identical functionality. Chips and Technologies became one of the first popular chipset companies.

Like the IBM PC, the PC-AT was hugely successful for home and business use. Intel continued to second-source the chips to ensure an adequate supply of chips to the computer industry. Intel, AMD, IBM, and Harris were known to produce 80286 chips as OEM products; while Siemens, Fujitsu, and Kruger either cloned it, or were also second-sources. Between these various manufacturers, the 80286 was offered in speeds ranging from 6 MHz to 25 MHz.
The 80186 / 80188

Intel continued the evolution of the 8086 and 8088 by introducing the 80186 and 80188. These processors featured new instructions, new fault tolerance protection, and was Intel's first of many failed attempts at the x86 chip integration game.

The new instructions and fault tolerance additions were logical evolutions of the 8086 and 8088. Intel added instructions that made programming much more convenient for low-level (assembly language) programmers. Intel also added some fault tolerance protection. The original 8086 and 8088 would hang when they encountered an invalid computer instruction. The 80186 and 80188 added the ability to trap this condition and attempt a recovery method.

Intel integrated this processor with many of the peripheral chips already employed in the IBM-PC. The 80186 / 80188 integrated interrupt controllers, interval timers, DMA controllers, clock generators, and other core support logic. In many ways, this chip was produced a decade ahead of its time. Unfortunately, this chip didn't catch on with many hardware manufacturers; thus spelled the end of Intel's first attempt at CPU integration. However, this chip has enjoyed a tremendous success in the world of embedded processors. If you look on your high performance disk driver or disk controller, you might still see an 80186 being used.

Eventually, many embedded processor vendors began manufacturing these chips as a second source to Intel, or in clones of their own. Between the various vendors, the 80186/80188 was available in speeds ranging from 6 MHz to 40 MHz.
The 8086 / 8088

The 8086 and 8088 were binary compatible with each other, but not pin-compatible. Binary compatibility means that either microprocessor could execute the same programs. Pin-incompatibility means that you can’t plug the 8086 into the 8088 and visa versa, and expect the chips to work. The new "x86" chips implemented a Complex Instruction Set Computer (CISC) design methodology.

The 8086 and 8088 both feature twenty address pins. The number of address pins determines how much memory a microprocessor can access. Twenty address pins gave these microprocessors a total address space of one megabyte (2^20 = one megabyte).

The 8086 and 8088 featured different data bus sizes. The data bus size determines how many bytes of data the microprocessor can read in each cycle. The 8086 featured a 16-bit data bus, while the 8088 featured an 8-bit data bus. IBM chose to implement the 8088 in the IBM-PC, thus saving some cost and design complexity.

At the time IBM introduced the IBM-PC, a fledgling Intel Corporation struggled to supply enough chips to feed the hungry assembly lines of the expanding personal computer industry. Therefore to ensure sufficient supply to the personal computer industry, Intel subcontracted the fabrication rights of these chips to AMD, Harris, Hitachi, IBM, Siemens, and possibly others. Amongst Intel and their cohorts, the 8086 line of processors ran at speeds ranging from 4 MHz to 16 MHz.

It didn’t take long for the industry to start "cloning" the IBM-PC. Many companies tried; but mostly they all failed because their BIOS was not compatible with the IBM-PC BIOS. Columbia, Kayro and others went by the wayside because they were not totally PC-compatible. Compaq broke though the compatibility barrier with the introduction of the Compaq portable computer. Compaq's success created the turning point that enabled today's modern computer industry.

NEC was the first to "clone" this new Intel chip with their V20 and V30 designs. The V20 was pin-compatible with the 8088, while the V30 was pin-compatible with the 8086. The V-series ran approximately 20% faster than the Intel chips when running at the same clock speed. Therefore, the V-series chips provided a cheap "upgrade" to owners of the IBM-PC and other clones computers. The V-series chips were very interesting. These chips were introduced in 1985 at approximately the same time as Intel's introduction of the 80386. The 80386 was still years away from production, and the 80286 was just barely being accepted in the IBM-PC/AT. Even though these chips were pin-compatible with the 8086 and 8088, they also had some extensions to the architecture. They featured all of the "new" instructions on the 80186 / 80188, and also were capable of running in Z-80 mode (directly running programs written for the Z-80 microprocessor).

Branch prediction in the Pentium family

Branch prediction in the Pentium family

What is branch prediction?

Imagine a simple microprocessor where all instructions are handled in two steps: decoding and execution. The microprocessor can save time by decoding one instruction while the preceding instruction is executing. This assembly line-principle is called pipelining. In advanced microprocessors, the pipeline may have many steps so that many consecutive instructions are underway in the assembly line at the same time, one at each stage in the pipeline.

The problem now occurs when we meet a branch instruction. A branch instruction is the implementation of an if-then-else construct. If a condition is true then jump to some other location; if false then continue with the next instruction. This gives a break in the flow of instructions through the pipeline because the processor doesn't know which instruction comes next until it has finished executing the branch instruction. The longer the pipeline, the longer time it will have to wait until it knows which instruction to feed next into the pipeline. As modern microprocessors tend to have longer and longer pipelines, there has been a growing need for doing something about this problem.

The solution is branch prediction. The microprocessor tries to predict whether the branch instruction will jump or not, based on a record of what this branch has done previously. If it has jumped the last four times then chances are high that it will also jump this time. The microprocessor decides which instruction to load next into the pipeline based on this prediction, before it knows for sure. This is called speculative execution. If the prediction turns out to be wrong, then it has to flush the pipeline and discard all calculations that were based on this prediction. But if the prediction was correct, then it has saved a lot of time.

The detective work

Intel manuals have never been very specific about how the branch prediction works. However, since mispredictions are expensive in terms of execution time, I found it important to know how the prediction works in order to optimize my programs. I started to do a lot of experiments together with some clever persons I had met on the Internet, most importantly Karki J. Bahadur at the university of Colorado, and Terje Mathisen in Norway, the guy who reverse engineered system software to find out how to get access to the performance monitor counters on the Pentium chip. Well, my first finding was that the Pentium predicts a branch instruction to jump if it has jumped any of the last two times. This fitted all my experiments, but Karki pointed out that a branch which jumps every third time is predicted one time out of six, where, according to my first model, it should never be predicted correctly. Then followed a series of new experiments until Karki and I independently came out with the same state diagram, shown in fig. 1a. While we agreed on this mechanism, we disagreed on the interpretation and in particular on why it was asymmetric. In the meantime, another guy had found an old article in Microprocessor Report claiming that the mechanism was a symmetric one as illustrated in fig 1b. My opinion was that the designers had actually intended the mechanism to be as in fig. 1b and that the asymmetry was a bug. But Karki and Terje maintained that there had to be an intention behind this asymmetry. It didn't convince them that I demonstrated how the symmetric mechanism was superior to the asymmetric one in almost all cases.

Now I discovered a powerful tool to dig deeper into this mechanism. The Pentium has a set of test registers that make it possible to read or write directly into the area that holds the history information for all branches, the branch target buffer (BTB). I had found this information on the home page of another hacker, Christian Ludloff. His page was shut down (rumors say that this was due to pressure from Intel) but fortunately I had downloaded his page before it was too late. Having direct access to the BTB, I was able to see exactly what happened: When a branch does not have an entry in the BTB it is predicted to not jump. The first time it jumps it gets an entry in the BTB and immediately goes to state 3. The complication is that the designers have equated state 0 with 'vacant BTB entry'. This makes sense because state 0 is predicted to not jump anyway. But since it cannot distinguish between state 0 and a vacant BTB entry it will go to state 3 next time the branch jumps rather than to state 1. This is where the quirk comes from. Apparently, somebody at the design labs has done a lot of research to find a good branch prediction scheme, and then somebody else has messed it all up by letting state 0 mean vacant BTB entry without realizing the consequence. And the consequence is that a branch which seldom jumps will have three times as many mispredictions as it would with the symmetric design.

Figure 1 -- State diagram for branch prediction mechanism

a. asymmetric design in the Pentium:

The state follows the +arrows when the branch instruction jumps, and the -arrows when not jumping. The branch instruction is predicted to jump next time if in state 2 or 3, and to not jump when in state 0 or 1.

b. symmetric design:

This is how the branch prediction should work. The state is incremented when jumping (+arrows) and decremented when not jumping (-arrows).

More quirks

We soon found that there were more strange things about the Pentium's branch prediction. We couldn't make sense of what happened when more branch instructions came close after one another. This time Karki and Terje came with the 'wild' ideas that led to the solution, while I played the role of the sceptic. After a hectic period where we exchanged results by E-mail every day, we found that the BTB information may actually be stored several instructions ahead of the branch it refers to. If there happens to be another branch in between then the BTB information is likely to be misapplied to somewhere in the wrong branch. This can lead to many funny phenomena: a branch instruction can have more than one BTB entry; two branches can share the same BTB entry so that one branch is predicted to go to the target of the other one; an unconditional jump instruction can be predicted to not jump; and a non-jumping instruction can be predicted to jump. I will not go into detail with all these quirks here, but you can find it all on my homepage. None of these quirks are fatal, because all mispredictions eventually get corrected.

A much more powerful mechanism

The later processors in the Pentium family: the Pentium MMX, Pentium Pro, Pentium II, Celeron, and Xeon, all have a much more advanced branch prediction mechanism. I will refrain from more detective stories here and go right to the mechanism.

Figure 2 - Two level branch prediction in Pentium MMX, Pentium Pro, and Pentium II

Level two consists of 16 two-bit counters of the type in fig. 1b. Level one is a four bit shift register storing the history of the last four events. This four bit pattern is used to select which of the 16 two-bit counters to use for the next prediction.

This mechanism is based on the same fundamental idea of the state diagram in fig 1b. This is simply a two-bit counter with saturation. The counter is incremented when jumping and decremented when not jumping. The branch instruction is predicted to jump next time if the counter is in state 2 or 3, and to not jump if in state 0 or 1. This mechanism makes sure that the branch has to deviate twice from what it does most of the time before the prediction changes.

The improvement in the later processors comes from the so-called two-level branch prediction. The first level is a shift register that stores the history of the last four events for any branch instruction. This gives sixteen possible bit patterns. You get a pattern of 0000 if the branch did not jump the last four times, and a pattern of 1111 after four times of jumping. The second level in the branch prediction mechanism is constituted of sixteen 2-bit counters of the type in fig. 1b. It uses the 4-bit pattern in the first level to choose which of the sixteen counters to use in the second level. See fig. 2.

The advantage of this mechanism is that it can learn to recognize repetitive patterns. Imagine a branch that jumps every second time. You can write this pattern as 01010101 where 0 means no-jump and 1 means jump. After 0101 always comes an 0. Every times this happens, the counter with the binary number [0101] will be decremented until it reaches its lowest state. It has now learned that after 0101 comes an 0 and will therefore make this prediction correctly the next time. Similarly, counter number [1010] will be incremented until state three so that it will always predict a 1 after 1010. The remaining fourteen counters for this branch are never used as long as the pattern is the same.

This mechanism is quite powerful as it can handle complex repetitive patterns like 00101-00101-00101. In fact, it can handle any repetitive pattern with a period of up to five, most patterns of period six and seven, and even some patterns with periods as high as sixteen. To see if a pattern of period n can be handled without misprediction, write down the n 4-bit sub-sequences in the pattern. If they are all different, then you will have no mispredictions after an initial learning time of two periods.

But the two-level mechanism is more powerful than that. It is also extremely good at handling deviations from a regular pattern. If a branch instruction has an almost regular pattern with occasional deviations, then the processor will soon learn what the deviations look like, so that it can handle almost any kind of recurrent deviation with only one misprediction.

Furthermore, it can handle a situation where you alternate between two different repetitive patterns. Assume that you have given the processor one repetitive pattern until it has learned to handle it without mispredictions. Then another pattern. And then return to the first pattern. If the two patterns do not have any 4-bit subsequences in common, then they do not use the same counters, so the processor doesn't have to re-learn the first pattern. Therefore, it can handle the transitions back and forth between the two patterns with a minimum of mispredictions.

Conclusion

The first microprocessor in the Pentium family introduced a simple one-level branch prediction mechanism with many ludicrous quirks. The later versions, Pentium MMX, Pentium Pro, Pentium II, etc. have longer pipelines and therefore a higher need for effective branch prediction. This need has been met by the incredibly powerful two-level mechanism with its ability to learn and recognize repetitive patterns and even deviations from the regular patterns. This mechanism is also quite economical in terms of chip area as the history of a branch can be stored in only 32 bits.

The most important shortcoming of the two-level branch prediction is that it is not very good at predicting the branch pattern of a loop control. If, for example, you have a program with a loop that always repeats ten times, then the control instruction at the bottom of the loop will branch back nine times and fall through the tenth time at the cost of one misprediction. For the Pentium Pro and Pentium II, where branch misprediction costs a lot of time, it may acually be advantageous to replace a loop that executes ten times with two nested loops that execute five and two times, in order to avoid mispredictions.

AMD 3DNow! undocumented instructions

AMD 3DNow! undocumented instructions

Introduction

Being involved in computer architecture and computer graphics, I am quite familiar with contemporary processors and graphics stuff. With my hacking attitude I frequently dig into some undocumented details of CPUs and graphics hardware and later publish some of my findings on my website. This "professional hacking" turned out into one of my favorite hobbies, and my recent discovery is simply a result of this activity.
How it all started

June 23rd was unusually cold and rainy day in Warsaw. Approaching my computer at the University in the morning I decided to free some precious space on my tiny, 1GB "working" hard disk. Half-consciously I begun to browse through deep and complicated structure of folders, trying to find something that I would never need again. I entered the folder with various 3DNow! related files downloaded from the net long time ago. While viewing one of them I noticed the names and opcodes of instructions, and suddenly I realized that not all of the names look familiar to me. I've compared the names with instructions listed in official 3DNow! documents. Clearly, there were three names not matching anything from the manual: PF2IW, PI2FW and PSWAPW.
Tracking the instructions

Judging from their names, the instructions were not unwanted artifacts. Rather they looked like something carefully designed and later abandoned. Quickly I opened the text editor and entered the these three mnemonics together with the necessary assembler directives:
.model small
.586
.k3d
.code
pf2iw mm0, mm1
pi2fd mm0, mm1
pswapw mm0, mm1
end

Then I tried to assemble the program with MASM 6.13, which I use daily for the development of hybrid (C+assembler) utilities. MASM didn't complain about the instructions, it recognized them and translated properly, showing the machine codes (suffixes), that turned out to be 1C, 0C and BB hex respectively.
Checking the functions

I quickly wrote five short assembler functions with the purpose to simply invoke the three mentioned instructions as well as documented instructions, PF2ID and PI2FD, for comparing the results with the undocumented ones. Then I wrote a simple program in C, invoking these functions with user-entered parameters and printing the results. I compiled both modules using old 16-bit Borland C and MASM 6.13, getting a simple, DOS-based, command line driven program. It all took about 15 minutes. Then I realized that I have no computer to test the stuff (the PC on my desk has IDT C6 in it). Fortunately it turned out that my colleague in the next room has K6-2 in his PC. In a few minutes I was able to run several simple tests devised to discover the exact functions of the new instructions. Given the names of instructions, their functions were not surprising:
PF2IW is similar to PF2ID, but it returns the 16-bits results in lower halves of 32-bit words. The out-of range values are saturated at -32768 and 32767.
PI2FW is similar to PI2FD, but it considers as its input just the lower halves of two 32-bit words, treating them as signed integers.
PSWAPW swaps the order of 16-bit words inside 64-bit word -- the least significant 16-bit word becomes the most significant one etc..

The more formal description of the new instructions is right below.
PF2IW mmreg1, mmreg2/mem64

Opcode: 0Fh 0Fh / 1Ch

Converts packed floating-point operand to packed 16-bit integer. The instruction is similar to PF2ID, but the result qword contains two 16-bit signed integers in bits 47..32 and 15..0. Result bits 63..48 and 31..16 are cleared.
Function:
IF (mmreg2/mem64[31:0] >= 2^15)
THEN mmreg1[31:0] = 7FFFh
ELSEIF (mmreg2/mem64[31:0] <= -2^15)
THEN mmreg1[31:0] = 8000h
ELSE mmreg1[31:0] = int(mmreg2/mem64[31:0])
IF (mmreg2/mem64[63:32] >= 2^15 )
THEN mmreg1[63:32] = 7FFFh
ELSEIF (mmreg2/mem64[63:32] <= -2^15)
THEN mmreg1[63:32] = 8000h
ELSE mmreg1[63:32] = int(mmreg2/mem64[63:32])
PI2FW mmreg1, mmreg2/mem64

Opcode: 0Fh 0Fh / 0Ch

Packed 16-bit integer to floating-point conversion.
Function:
mmreg1[31:0] = float(mmreg2/mem64[15:0])
mmreg1[63:32] = float(mmreg2/mem64[47:32])
PSWAPW mmreg1, mmreg2/mem64

Opcode: 0Fh 0Fh / 0BBh

Swap 16-bit words within 64-bit MMX word.
Function:
mmreg1[15..0] = mmreg2/mem64[63..48]
mmreg1[31..16] = mmreg2/mem64[47..32]
mmreg1[47..32] = mmreg2/mem64[31..16]
mmreg1[63..48] = mmreg2/mem64[15..0]
Significance and usefulness

As usual, the undocumented instructions should not be treated too seriously -- they may disappear any time from a future product. From the reliable source somewhere on the net I got the information that the above three instructions were abandoned from 3DNow! spec because the lack of commitment from IDT and Cyrix, so I expect that IDT WinChip2 does not support them. In near future we will see if they are supported by AMD K7. Until then we can treat them just as a curiosity.

Thinking about using the undocumented instructions, I can't see any serious application for PF2ID instruction, although I believe that its designer had something in mind. The instruction returns its results in two non-adjacent 16-bit fields. It is not easy to convert the results to a more useful form. It's even easier to convert 4 floats to packed 16-bit integers using PF2ID and PACKSSDW than using PF2IW, since the MMX instructions set does not provide for packing dwords to words without signed saturation.

PI2FW in turn, can be effectively used to convert packed shorts to floats. The same task is not easy to achieve using only documented instructions, as it requires 16-bit to 32-bit signed int conversion, not available in MMX instruction set.

PSWAPW is just what it is -- it may be effectively used to reverse the order of 16-bit data in a quadword.

Microprocessor

Friday, July 4, 2008

Glossary of PC Processors by x86.org

PC Processors Guide by x86.org

Branch prediction in the Pentium family

AMD 3DNow! undocumented instructions

Blog Archive

About Me