Friday, January 23, 2009

Page Address Extensions on the Pentium Pro Processor

Page Address Extensions on the Pentium Pro Processor

Most people never knew that the Pentium's original design included 36-bit addressing, and the capability to access 2M page sizes. These extensions were known as Page Address Extensions (PAE), and were to be enabled in CR4. When CR4.PAE=1 (CR4[5]=1), page address extensions were enabled. When CR4.PAE=0, A[35..32] were forced to 0, regardless of what addresses could be generated in protected mode with a descriptor pointing near 4G, and an offset pointing above the 4G address space. Even when CR4.PAE=1, addresses above 4G would not be generated unless they were the result of a page-mode, paging translation. The only means to access memory above 4G was through these extensions to page mode. This document will describe PAE based on what little I know from the Pentium, and from preliminary P6 literature. This document will also include extensions to PAE that are exclusive to P6.

Whether or not PAE was ever implemented in the Pentium beyond the conceptual stage is not known. But vestiges of its existence are visible throughout the Pentium documentation and architecture. There are at least four references to 2M pages in the various Pentium manuals[1,2,3,4]. In addition to these documentation references, CR4[5] is marked reserved, and was to enable PAE; CPUID.flags[6] is marked reserved and was to indicate the existence of PAE; the MSR TR8 is marked reserved, and contained the upper 4 address bits used for TLB testability. Now it appears that the P6 is going to implement 36-bit addressing and 2M page sizes.

To support 36-bit addressing, it is necessary to make substantial changes to the paging mechanism. 32-bit linear addresses are still used, but they are translated to 36-bit physical addresses. Intel choose to use a three-tier paging mechanism to support PAE for 4K pages, and a two-tier mechanism for 2M pages. When CR4.PAE=1, CR3 points to a small table of Page Directory Pointers (PDPs). Each PDP entry references a separate page directory. Each page directory points to a page table, for 4K pages, or directly to the page frame, for 2M pages. Figure 1 gives a detailed description of all of the CPU structures associated with page translations while PAE is enabled. For comparative purposes, Figure 2 gives a detailed description of all of the CPU structures associated with page translations while Page Size Extensions (PSE) is enabled (4-Mbyte pages).

Figure 1 -- Paging Structures for PAE


Figure 2 -- Paging Structures for PSE


In addition to CR4.PAE, which enables Page Address Extensions, CR4 contains another addition to enhance page mode performance. CR4.PGE (bit-7) enables Paging Global Extensions (PGE). PGE determines whether moves to CR3 flush all of the PTE's from the TLB, or only those whose G-bit (global bit) is not set. Likewise, for task switches which implicitly set CR3, CR4.PGE controls TLB flushing in the same manner.

As shown in Figure 1, CR3 is still a 32-bit register, and therefore the PDP must reside within the first 4G address space. Each PDP is selected by the upper 2 bits of the linear address -- A[31..30]. Therefore the PDP contains only 4 entries. Each PDP entry points to the physical address of a page directory, and is 64-bits wide, though only 36-bits are used. Therefore, each PDP can reference a page directory anywhere in the 64G address space. The index into the Page Directory (PDE) is determined by the linear address bits -- A[29..21]. The Page Directory is therefore limited to 512 entries (2^9) of 8-bytes each. Even though the PDE has been reduced to 512 entries, its structure takes up the same amount of memory space when CR4.PAE=0 (4096 bytes), because of the increase in its element size (to 8-bytes). For 4K pages, each 8-byte PDE points to the physical address of the Page Table. For 2M pages, each 8-byte PDE points to the physical address of the page frame, itself. For 4K pages, the index of the Page Table Entry (PTE) is determined by the linear address bits -- A[20..12]. Similar to the PDE, each Page Table is limited to 512 entries of 8-bytes each; each 8-byte entry pointing to the physical Page Frame Address (PFA). Figure 3 shows the page translation for 4K pages while CR4.PAE=1.

Figure 3 -- Page Translation for 4K Page Address Extensions


Page translation for 2M pages is virtually identical to 4M page translations. The main difference between the two translation mechanism, is the addition of the PDP reference, and the number of index bits in the PDE. Like 4K page translations with PAE enabled, each PDP entry points to the physical address of a page directory. The index into the Page Directory (PDE) is determined by linear address bits -- A[29..21]. The remaining address bits in the linear address, A[20..00], are used to directly index into the page frame. Since the offset is 21-bits wide, the page size is 2M (2^21). Figure 4 shows a diagram of page translations for 2M pages.

Figure 4 -- Page Translation for 2M Page Address Extensions


Some distinction needs to be made as to whether PAE and PSE are mutually exclusive, and which has a higher precedence. Likewise, what is the role of the PDE.PS bit when the page address extensions are enabled. I will assume the two features are mutually exclusive, and that PAE has higher precedence than PSE. Therefore, Table 1 details a description for possible combinations of PAE, PSE, and PDE.PS.

Table 1 -- Control bits for Paging Extensions


Definition of fields in paging structure figures:

Virtual Mode Extensions on the Pentium Processor

Virtual Mode Extensions on the Pentium Processor

Searching for VME

Of all that is known about the secrets contained in the Supplement to the Pentium Processor User's Manual ("Appendix H"), nothing is guarded more closely than details of the Virtual Mode Extensions (VME) implemented in the Pentium processor and late-model Intel486's. Even when closely reading the Pentium manuals, it is possible that the reader doesn't notice that enhancements to virtual-8086 mode exist. Yet the Pentium Processor Family Developer's Manual, Volume 3 is not silent on the subject. There are at least 27 references to VME in the Pentium manual. In addition to these references, another good source of information is Intel's British VME Patent application, which is publicly available. With a good understanding of Virtual-8086 mode (v86 mode), one could infer most of the remaining details of VME solely from those 27 references. All that's needed to characterize the complete details is an understanding of v86 mode, a little ingenuity, experimentation, persistence, and no qualms about hitting the reset button to restart a frozen computer. (For those with $12,000 to spare, an In-Circuit-Emulator (ICE) would be helpful too.)
The need for VME.

When v86 mode was originally implemented, software writers found two main purposes for its use. 1) DOS memory managers, and 2) DOS sessions under a multitasking operating system (MTOS) (like a DOS box in Windows). Under a memory manager, DOS remains a sequential single-tasking operating system. Therefore, hardware resources don't need to be arbitrated;[1] IOPL-sensitive instructions don't need to restricted at all. Running a DOS session under Windows is quite different. Nearly all resources need to be restricted. The DOS session should not have direct access to the hardware with the ability to program any and every device. Nor should the DOS session have the ability to directly control the interrupt flag (IF) of the EFLAGS register. Virtual-8086 mode has support for restricting such access, through the use of IOPL to restrict IOPL-sensitive instructions which modify IF,[2] and the I/O permission bit map to restrict access to I/O ports. However, there are a few shortcomings with the standard v86 mode.

1. Setting IOPL to 3 provides the better performance than setting IOPL less than 3. This setting reduces the trapping overhead, but lets Virtual DOS Machines (VDMs) disable interrupts, a potential integrity problem for the whole system.[3]
2. IOPL must be set less than 3 when the OS needs to virtualize the interrupt flag. When a Virtual Device Driver (VDD) needs to simulate a hardware interrupt into a VDM, it must be able to detect when the VDM is interruptible. Therefore IOPL must be less than 3 so that the interrupt flag can be virtualized.[3] Since IOPL is less than 3, performance is significantly degraded by attempts to execute interrupt flag sensitive (IF-sensitive) instructions which always fault to the v86 monitor.
3. An operating system may allow a VDM to receive real (external) interrupts, or virtual interrupts. This is a policy decision made by the OS implementers. If a v86 task only receives virtual interrupts, then it can be demand-paged, whereby it is swapped out to disk when real memory is needed for other purposes. If the v86 task receives real (external) interrupts, then it cannot be demand-paged, since the interrupt handler may be paged out when the interrupt occurs. The OS may not have enough time to bring in the entire task before another interrupt occurs; likewise it would be too complicated and still too time consuming to bring in just the portion of the task which contains the interrupt service routine.[4]
4. All INT-n instructions cause a switch out of v86 mode. When IOPL<3, an INT-n faults to the v86 monitor. When IOPL=3, the INT-n attempts to invoke the protected mode interrupt handler associated with that particular interrupt (success depends on the DPL of the gate being used). The monitor or interrupt handler must either emulate the interrupt's functionality, or restart the v86 task in such that it executes the interrupt itself (this is known as reflecting the interrupt). DOS kernel routines are accessed through software interrupts. Therefore, thousands of interrupt calls generate thousands of transitions in and out of v86 mode. This gives the v86 task a substantial performance disadvantage to the same program running under DOS.

VME fixes v86 problems

Enhanced v86 mode was designed to eliminate many of these problems, and significantly enhance the performance of v86 tasks running at all IOPL levels. When running in Enhanced virtual-8086 mode (Ev86) at IOPL=3, CLI and STI still modify IF. This behavior hasn't changed. Running at IOPL<3 has changed. CLI, STI, and all other IF-sensitive instructions no longer unconditionally fault to the Ev86 monitor. Instead, IF-sensitive instructions clear and set a virtual version of the interrupt flag in the EFLAGS register called VIF.[5] Clearing VIF does not block external interrupts, as clearing IF does. Instead, IF-sensitive instructions clear and set a virtual version of the interrupt flag called VIF. VIF does not control external interrupts as IF does. Rather, VIF is an indicator of the interruptibility state of the EV86 task. Thus, the operating system is invulnerable to a bug in a DOS program which inadvertently attempts to disable interrupts and spin inside of a loop, as VIF will be cleared instead, while IF will remain set. This new behavior has some substantial benefits: performance is increased, as CLI and STI don't cause time-consuming faults. Secondly, the complexity of the monitor program is reduced, as it doesn't have to maintain its own virtual interrupt flag. When the old-style v86 monitor virtualized IF, it needed to emulate all changes to IF caused by IF-sensitive instructions (CLI, STI, PUSHF, POPF, INT, and IRET). Using Ev86 mode eliminates this complexity because the CPU automatically virtualizes IF; performance increases because IF-sensitive instructions don't fault to the Ev86 monitor.

When external interrupts are generated, such as timer ticks and keyboard strokes, the host operating system running at CPL-0 always intercepts these interrupts. When some interrupts occur, the current task may not be the v86 task, it may be swapped out to disk, or it may be in an uninterruptible state. When this occurs, the host OS must delay sending the interrupt to the v86 task until it is running, and ready to accept interrupts. Other interrupts may be intended for a specific VDM, but not all VDMs (like keystrokes). In this case, the v86 monitor needs to send a specific interrupt to a specific VDM -- ignoring all other VDMs. Delaying and filtering interrupts in this manner is known as interrupt virtualization. Once the VDM with a virtual interrupt pending becomes interruptible, the OS reflects the interrupt to the VDM as if a real interrupt had occurred.

Prior to Ev86 mode, the v86 monitor needed to maintain a virtual interrupt flag in software. The v86 monitor was forced to handle many exceptions which were unnecessary. For example, when a virtual interrupt was pending, further IOPL-sensitive instructions which attempted to clear IF caused undesired faults, which then caused the monitor to redundantly clear the virtual interrupt flag. This problem doesn't exist in Ev86 mode. These instructions which redundantly attempt to clear IF don't fault to the monitor. Therefore the source code which exists to clear the software virtual interrupt flag can be removed. In fact, while using Ev86 mode, all of the code needed to maintain the software virtual interrupt flag can be removed -- as the virtual interrupt flag is maintained by the CPU itself.

Prior to Ev86 mode, software interrupts (INT-n instructions) always caused a switch out of v86 mode. If IOPL=3, the transition occurs through a gate associated with the interrupt;[6] when IOPL<3, the transition occurs as the result of a general protection fault to the monitor. When IOPL=3, the monitor needs to determine whether or not the cause of the interrupt is a software interrupt, external interrupt or CPU-generated exception. When IOPL<3, software interrupts don't transition through their associated gates in the IDT (they transition through the #GP gate). In the case of software interrupts, the monitor must interpret the opcode to determine which interrupt number needs servicing. The monitor must then emulate the interrupt, or reflect the interrupt back to the v86 task. External interrupts and CPU-generated exceptions still transition through their associated gate in the IDT. For these cases, the monitor still needs to determine the source of the interrupt (external or CPU-exception), and take the appropriate action. Using Ev86 mode can simplify this process, and enhance the performance of handling software interrupts.

Software interrupt execution is controlled by a new structure in the TSS called the interrupt redirection bit map (IR bit map). Each bit in this new structure controls whether or not a specific software interrupt will be invoked in a manner compatible with the Intel386, or be invoked purely within the Ev86 task. In Ev86 mode, these interrupts may be invoked and executed without ever leaving the Ev86 task. Using this new technique would reduce complexity in the monitor. Interrupts which would normally fault to the monitor, no longer would. Interrupts which would transition through the IDT no longer would.
Overview of VME Components

VME support is enabled and disabled by setting and clearing the VME bit in CR4 (bit 0). When enabled and running at IOPL=3, all INT-n instructions are controlled by the interrupt redirection bit map in the TSS.[7] When running at IOPL<3, in addition to the INT-n behavior, IF-sensitive instruction are allowed to execute without faulting to the Ev86 monitor.

The TSS has been extended to include a 32-byte interrupt redirection bit map. 32-bytes is exactly 256 bits, one bit for each software interrupt which can be invoked via the INT-n instruction. This bit map resides immediately below the I/O permission bit map (see Figure 1). The definition of the I/O Base field in the TSS is therefore extended and dual purpose. Not only does the I/O Base field point to the base of the I/O permission bit map, but also to the end (tail) of the interrupt redirection bit map. This structure behaves exactly like the I/O permission bit map, except that it controls software interrupts. When its corresponding bit is set, an interrupt will fault to the Ev86 monitor. When its bit is clear, the Ev86 task will service the interrupt without ever leaving Ev86 mode.

Figure 1 -- Interrupt redirection bit map in the TSS


VIF and VIP EFLAGS Bits

Two new flags were added to the EFLAGS register. These flags are intended for use when the IOPL of the Ev86 task is less than 3 (see sidebar Caveats Of VME (When CR4.VME=1)). They can only be purposely modified by the CPL-0 Ev86 monitor or an interrupt service routine.

VIF is a virtualized version of the standard interrupt flag (IF). While the Ev86 task is running, any CLI and STI instruction will not modify the actual IF, instead these instructions modify VIF.[5] This fact is completely hidden from the Ev86 task, as PUSHF, POPF, INT-n, and IRET have also been modified to help hide this behavior.

The VIP flag is a Virtual Interrupt Pending flag. VIP can assist the multitasking operating system in sending a virtual interrupt to the Ev86 task. The easiest way to understand VIP is to explain its use in the context of a program running on an 8086. When the 8086 is in an uninterruptible state, external interrupts remain pending but don't get serviced. After IF is set (because of STI, POPF, or IRET), the pending interrupt is serviced by the CPU. VIF and VIP are intended to serve this same purpose to the MTOS running an Ev86 task. Let's assume your Ev86 task was at the same uninterruptible point as the previous 8086 example. A timer-tick interrupt occurs, and the MTOS services the interrupt. During the interrupt service routine, the MTOS decided that the Ev86 task needs to service this timer tick, and sets VIP. After returning, the Ev86 task is still in an uninterruptible state (VIF=0). At some later time, the Ev86 task attempts to set IF (STI, POPF, or IRET). When this happens, the Ev86 task becomes interruptible, and a general protection fault to the monitor immediately occurs (#GP(0)).[8]
IF-sensitive instructions

To support the new VIF and VIP flags, changes were needed to the instructions which read and write the interrupt flag of the EFLAGS register. CLI, STI, PUSHF, POPF, INT-n, and IRET all had to be changed to support Ev86 mode.

When an Ev86 task is running at IOPL<3, CLI, and STI clear and set the VIF flag, instead of faulting to the Ev86 monitor, or affecting the IF flag.[5]

PUSHF copies the contents of the VIF flag to the IF position as it pushes the FLAGS image onto the stack. This gives the appearance to the Ev86 task that STI and CLI are really setting and clearing IF. This appearance is necessary in case the software attempts to check for this condition. Such a code sequence which tests for the interrupt flag, is seen in Listing 1. In addition to moving the VIF to the IF on the stack image, PUSHF always pushes an IOPL image of 3 onto the stack. It is important to remember that the Pentium's IF-sensitive instructions behave identically to the Intel486 when IOPL=3, even when CR4.VME=1. Therefore, PUSHF simulates an IOPL=3 to any software wishing to read the stack image to determine its IOPL. The actual IOPL of the Ev86 task never changes during this process.

Listing 1 -- Code demonstrating how software tests for the IF flag

STI ; Enable interrupts
PUSHF ; Store FLAGS on stack
POP AX ; Restore flags into register
TEST AX,200h ; Interrupt flag set?
Jcc label ; Jump on condition

POPF works similar to PUSHF by copying the bit in the IF position to VIF flag as it pops the FLAGS image from the stack. The Pentium is careful to make sure that the faked IF and IOPL aren't accidentally copied into the real IOPL during the POPF operation. Before the FLAGS image is merged into the EFLAGS register, the IF image is copied to the VIF slot, the IF and IOPL images are cleared. All of the actual FLAGS register bits are cleared, except the actual IF and IOPL. Finally, the filtered FLAGS image is merged with the actual EFLAGS register. A side-effect of POPF is its handling of the TF in the stack image. If the TF on the stack image is set, then POPF causes a general protection fault before any FLAGS values are modified (#GP(0)).

The IRET instruction behaves exactly as the POPF instruction does with respect to IF, VIF and IOPL. IRET and POPF differ in how they handle the trap flag from the stack image. If TF is set in the FLAGS stack image during POPF, a #GP(0) occurs, yet for IRET the #GP does not occur.

The INT-n instruction is the most complicated of the IOPL-sensitive instructions. INT-n behaves exactly like PUSHF in how it handles IF, VIF, and IOPL.[9] However, one of the enhancements to Ev86 mode is the ability of the Ev86 task to execute software interrupts without leaving Ev86 mode. This enhancement has been accomplished with the aid of the interrupt redirection bit map in the TSS. When the corresponding IR bit is set, the interrupt will be invoked in exactly the same manner as a normal v86 task. When the corresponding bit is clear, the interrupt is invoked as if it were executing on an 8086 processor. In other words, a fault to the monitor is never generated, nor a transition to the protected mode interrupt handler. The interrupt transition and return are done entirely within the Ev86 task. The influence of the IR bit map is best described by the pseudo-code in Listing 2.

Listing 2 - Interrupt handling description in Ev86 mode

N = INTERRUPT_NUMBER;
INTERRUPT_BIT_MAP_PTR = TSS_BASE->IO_PERMISSION_BASE - 32;
IF INTERRUPT_BIT_MAP_PTR->BIT_NUMBER[N]
IF (IOPL<3)
#GP(0);
ELSE
GOTO INT-FROM-V86-MODE;
ELSE
INVOKE_REAL_MODE_STYLE_INTERRUPT_FROM_Ev86_TASK(N);

Conclusions

The virtual mode extensions are very useful to memory managers and multitasking operating systems. Memory managers can primarily benefit by the use of the interrupt redirection bit map to reduce the number of switches to and from protected mode. This has the added benefit of reducing the complexity of the interrupt service routines, as they no longer have to reflect software interrupts back to the v86 task.

Multitasking operating systems can benefit in many ways. The MTOS benefits from interrupt redirection, and from the virtual interrupt support. The MTOS would run with virtual mode extensions enabled, and the Ev86 tasks running at IOPL<3. This gives the MTOS full benefit of the virtualization of interrupts. When the MTOS wishes to send a virtual interrupt (like a virtual timer-tick) to an uninterruptible Ev86 task, it will do so by setting VIP=1. When the task becomes interruptible, a general protection fault occurs, and the MTOS will send the virtual interrupt to the Ev86 task. This would give programs which are timer-dependent (such as games) a significant performance advantage. As an added benefit of using the virtualization features of the CPU, even more complexity of the Ev86 monitor can be removed. The result of using these new features, is an Ev86 monitor that is simpler to implement and maintain than its non-Ev86 counterpart, and software which runs faster.

Understanding Page Size Extensions on the Pentium Processor

Understanding Page Size Extensions on the Pentium Processor

Introduction

In the Pentium manuals, there are at least 9 references to 4MB pages. The Pentium Family User's Manual, Volume 1 (P/N 241428) mentions 4MB pages in sections 2.0, 3.7.2, and 3.7.4. Volume 3 refers to 4MB pages in sections 10.1.3, 11.3.3, 11.3.4, 16.5.3, 23.2.10.2, and 23.2.18.1. The Intel 860 XP processor documentation claims the i860 XP is page-level compatible with the Intel386, Intel486, and Pentium processors. This compatibility is noteworthy, as the i860 XP also supports 4MB pages, and its documentation provides a complete description of the 4MB paging mechanism(1). All that's needed to obtain an Appendix H description of 4MB pages, are a few references from the Pentium manuals, and the description of 4MB pages from the i860 XP manual.
Making the jump to 4MB pages

With an understanding of the 4KB paging mechanism, it's not difficult to deduce the 4MB paging mechanism. Recall that each page directory entry controls 4MB of memory. Now imagine how Figure 111 would change if the page table lookup were eliminated. The page frame index would increase from 12-bits to 22-bits, thus allowing direct control of a 4MB page size. The 20-bit pointer in the page directory, would be reduced to a 10-bit pointer, pointing directly to the 4MB page frame of memory. With the page table lookup eliminated, the page directory points directly to a 4MB page frame. This describes how 4MB pages are implemented in the i860 XP(1). But the question remains: are i860 XP 4MB pages compatible with Pentium 4MB pages? To answer that question, we need to compare the i860 and Pentium manuals.

The Pentium manual, volume 3, describes that CR4.PSE enables page-size extensions and 4MB pages but refers the reader to Appendix H for more information(4,5). Later in the Pentium manual, Intel shows that bit-7 of the page directory entry is the Page Size (PS) bit(3). Without CR4.PSE=1, the Pentium will always use Intel486-compatible (4KB) paging, regardless of the setting of the PDE.PS bit. Similarly, when CR4.PSE=1, and PDE.PS=0, Pentium still uses 4KB pages. But when CR4.PSE=1, and PDE.PS=1, Pentium uses an i860 XP-compatible 4MB page translation mechanism.

The linear address for a 4MB pages is converted to a physical address in much the same manner as 4KB pages. In this case however, the access to the page table is omitted. The high-order 10-bits form an index into the page directory. The page directory no longer contains a 20-bit pointer to a page table, but instead contains a 10-bit pointer to the 4MB page frame of memory. This convention mandates that all 4MB pages reside on 4MB boundaries. The 10-bit pointer in the page directory, is then combined with the low-order 22 bits of the linear address to form the 32-bit physical address.

Figure 1 shows a pictorial description of the 4MB and 4KB paging translation mechanism. Given all of the official documented references to 4MB pages in the Pentium manuals, all one needs to complete their understanding of 4MB pages is to study and understand this picture. Ironically, the 1993 edition of the Pentium manual, volume 3 contained a virtually identical picture(6). Intel obviously recognized the significance of this pictorial representation of 4MB pages, and substantially modified it in subsequent editions of their Pentium manual to remove the visual representation of the 4MB paging mechanism.

Figure 1-- Page Translation for 4MB and 4KB Page Sizes



Side-effects and caveats of 4MB pages.

(Their existence is worth mentioning here. However the details will be reserved for the magazine article.)

* Page fault error codes
* TLB Translation
* TLB Invalidation

Testing our hypothesis

After formulating our understanding of 4MB paging, it should be quite straightforward to write characterization code which would confirm our hypothesis. To detect whether or not 4MB pages are implemented in Pentium as they are in the i860 XP, we could do the following:

* Write the software assuming 4MB compatibility with the i860 XP.
* Enable paging.
* Before enabling 4MB paging, modify the second PDE (PDE which controls memory from 4MB-8MB) to point to the 0MB-4MB page frame, and mark it as a 4MB page. Install a PTE in the first entry pointed to by the modified PDE. This PTE should point back to the first page of memory at 4MB (which contains a signature of some sort).
* Read from the signature in memory. If 4MB paging works as expected, instead of getting the signature, you will retrieve the PTE we installed during the previous step. If 4MB paging does not work as expected, all is well, because the PTE is correctly formed, and you will retrieve your memory signature.

The key to this technique is to read from one location in memory if 4MB pages work, but another location if they don't (so we don't page fault). This approach is demonstrated in the source code listing, 4MPAGES.ASM to show that 4MB pages work as described herein.

Now that we have demonstrated that 4MB pages work as expected, we could write more characterization code to prove other behavioral characteristics of enabling CR4.PSE. Distributed with this article is source code to demonstrate the page faulting behavior of PSE. Another program is included to detect the TLB size and associativity. Finally, another program will demonstrate that writing any values to CR4.PSE will not invalidate the TLB.
Conclusion

Since the Pentium was introduced, Intel has withheld the architectural details of 4MB pages. Only by signing a 15-year NDA would you be given access to the documents that describe their implementation and use. The earliest Pentium manuals documented enough details of 4MB pages to allow anyone to reverse-engineer the details. As newer Pentium manuals were introduced, Intel removed the most expository details. Unknown to most people outside of Intel, the entire implementation details are documented in the i860 XP data sheet which is readily available -- no NDA required.

Protected Mode Virtual Interrupts (PVI)

Protected Mode Virtual Interrupts (PVI)

1. Introduction

Sensitive instructions (in the protected mode) are opcodes related to I/O operations. They can be dangerous to the system, but there may exist a necessity of their execution at the privilege level lower than zero. To ensure protection and accessibility at the same time, a special two-bit field called IOPL was introduced into the EFLAGS register (IOPL stands for Input/Output Privilege Level). The sensitive instructions can be executed only if CPL <= IOPL (where CPL or Current Privilege Level is PL of the CS selector). In this way, disabling of access to these instructions can be achieved. Additionally, to prevent user-level code from changing the IOPL field, POPF or IRET do not modify it when executed at CPL > 0.

Here is the list of sensitive instructions:

* IN - input,
* INS - input a string,
* OUT - output,
* OUTS - output a string,
* CLI - clear the interrupt enable flag (IF),
* STI - set the interrupt enable flag (IF).

Trying to execute any of these instructions when CPL > IOPL will cause a General Protection Fault (#GP), that is exception 13 (for I/O instructions, the I/O permission bitmap may prevent the fault from occuring for selected I/O addresses). However, an implicit attempt to modify the IF bit, using POPF or IRET instructions will not cause any error -- simply the new value of the IF bit will be discarded.
2. Virtual Interrupts (traditional approach)

Of all sensitive instructions, only CLI and STI are discussed here as only these instructions are related to the mechanism of interrupts.

It may happen, that some application programs would like to serve hardware interrupts, originated from non-standard devices that were not known at the time the operating system (OS) was created. Additionally, such a program might not want to receive these interrupts during some special operations. In this situation the OS may detect executing of the CLI or STI instruction and set appropriate internal variables that would decide whether to let the application serve the interrupt or not. It has to be noted, that there may exist many concurrent tasks in the system that may be cyclically switched using, for example, a timer interrupt. Moreover, it may be necessary to serve some other peripherals, such as keyboard. In such case, disabling of interrupts may lead to a system crash. However, it is possible to use the mechanism of so called virtual interrupts, in which disabling of interrupts performed by one of tasks, causes only excluding the task from serving these interrupts.

Another example of using CLI and STI instructions is defining of application level programs' critical sections. Executing of the CLI instruction would guarantee uninterrupted execution of the following section, until the STI instruction is met. Also in this situation, it is necessary to preserve system's vitality -- a malicious application could execute the CLI/JMP $ sequence causing the system halt. To avoid this, the OS may treat the CLI instruction only as a request to ensure a mutual exclusion, which could be achieved, for example, using a temporary blockade of the user-level task switching.

Meeting of the above conditions may be realized in a #GP exception serving routine. Unfortunately, generating of a #GP exception on every CLI and STI occurence is very time-consuming. The routine has to be called through a gate, then it has to found out the exception's reason, then it has to perform appropriate actions and, finally, it has to return to the interrupted program. A typical action performed by this procedure in case of CLI and STI instructions, is to modify a flag defined somewhere in the task state area to inform of the actual state of task's local interrupt subsystem.
3. Pentium Processor's Virtual Interrupts Improvement

Just for improving processor's performance, in Pentium and SL-enhanced i486s, Intel introduced new mechanisms and some bits that are related to them. These bits are: the PVI bit of CR4 register (Fig.1) and VIF and VIP bits of EFLAGS register (Fig.2). The effect of these bits is as follows:

* For PVI = 0:
*
o If CPL <= IOPL, then CLI and STI operate on the IF flag.
o If CPL > IOPL, then CLI and STI cause a #GP exception.
* For PVI = 1 and CPL < 3:
*
o If CPL <= IOPL, then CLI and STI operate on the IF flag.
o If CPL > IOPL, then CLI and STI cause a #GP exception.
* For PVI = 1 and CPL = 3:
*
o If IOPL = 3, then CLI and STI operate on the IF flag.
o If IOPL < 3 and VIP = 0 (no pending interrupts present), then CLI/STI resets/sets the VIF flag.
o If IOPL < 3 and VIP = 1 (a pending virtual interrupt is present), then an attempt to enable virtual interrupts (by setting VIF) using STI will cause a #GP exception.

3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |M|-|P|D|T|P|V|
| R E S E R V E D |C|-|S|E|S|V|M|
| |E|-|E| |D|I|E|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
* ^

^ -- these bits are referenced
* -- these bits are reserved

Fig.1. The CR4 Register (as defined for Pentium)

3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |I|V|V|A|V|R| |N|I O|O|D|I|T|S|Z| |A| |P| |C|
| R E S E R V E D |D|I|I|C|M|F|0|T|P L|F|F|F|F|F|F|0|F|0|F|1|F|
| | |P|F| | | | | | | | | | | | | | | | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
^ ^ ^ * ^ ^ * * *

^ -- these bits are referenced
* -- these bits are reserved

Fig.2. The EFLAGS Register (as defined for Pentium)

This mechanism significantly simplifies user-level interrupt serving routines. Before giving control the OS just checks the VIF flag and depending on its value takes an appropriate further action. In particular, using this technique, it is possible to execute some system level programs or procedures (designed for execution at CPL = 0) at the application level (at CPL = 3).

Additionally, the VIP flag makes it easier to monitor enabling of virtual interrupts. This bit may be used by the OS for marking, that having VIF = 0 (e.g. when serving a virtual interrupt), another virtual interrupt was signaled. In this case the OS stores the interrupt reason and sets the VIP flag, which means a virtual interrupt is currently pending. At the moment of enabling of virtual interrupts and having CPL = 3 and IOPL < 3, a #GP exception is generated. A handler that serves this situation may now reset the VIP flag and give control to the right procedure that will serve the previously pending interrupt. Certainly, incoming interrupts may be queued. In this situation resetting of VIP would occur only after emptying the queue.

Here is the summary of the flags that are used by the mechanism of virtual interrupts and which are subject to protection:

* VIP -- It is modified by IRETD, provided that CPL = 0.
* VIF -- It is modified by IRETD, provided that CPL = 0; by CLI and STI -- as described above.
* IOPL -- It is modified by IRET(D) and POPF(D), provided that CPL = 0.
* IF -- It is modified by IRET(D) and POPF(D), provided that CPL <= IOPL; by CLI and STI -- as described above.

Certainly, all flags are modified at the time of a task switch. Note that, unlike the IF flag, VIF is not modified at the time of transition through interrupt or trap gates.

In the case of an (E)FLAGS related protection violation when executing IRET(D) or POPF(D), no exception is generated and the protected bit remains unchanged. On the other hand, a #GP exception is generated when trying to execute a program having both VIP and VIF set, provided that PVI = 1 and CPL = 3 (IOPL may have any value). That happens, for instance, after executing the IRETD instruction that changes the CPL from 0 to 3 and writes ones to VIP and VIF. The instruction finishes successfully and then, before the next opcode, a #GP exception is generated. Testing of the EFLAGS image on the stack or in the Task State Segment (depending on the #GP exception gate type) returns both VIP and VIF bits set.
4. Detection of Mechanism of Virtual Interrupts Presence

The mechanism of protected mode virtual interrupts is implemented along with virtual mode extensions and is identified by the VME bit returned by the CPUID instruction. Processors that do not implement the CPUID instruction, do not support virtual interrupts. To get the value of the VME bit the following sequence has to be performed:

* Check the presence of EFLAGS' bit ID (Fig.2) trying to negate it.
* Execute the CPUID instruction with the EAX argument value of 0 and ensure the chip was manufactured by Intel (GenuineIntel: ECX=6C65746E, EDX=49656E69, EBX=756E6547) and supports executing of CPUID having the argument in EAX equal to 1,
* Execute the CPUID instruction with the EAX argument value of 0 and obtain the value of feature flags from the EDX register (Fig.3).

3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |A|C|M|-|M|T|P|D|V|F|
| R E S E R V E D |P|X|C|-|S|S|S|E|M|P|
| |I|8|E|-|R|C|E| |E|U|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+C+-+-+-+-+-+-+-+-+-+
* ^

^ these bits are referenced
* these bits are reserved

Fig.3. The Feature Flags (as defined for Pentium)

Sizing Memory in Protected Mode

The Concepts of Sizing Memory

The concepts of memory sizing wouldn't be complete without an explanation of how RAM works. Think of memory as a square, two-dimensional array; there are as many columns as rows. Each element in the array represents a bit (not byte) of memory. To address an individual bit in the array, we provide the RAM chip with a row address and a column address. To minimize the number of pins on the chip, RAM chips have only one set of address pins, and uses them for both row and column addresses (see figure 1). The memory controller must provide signals telling the RAM chip when the row address is valid via the Row Address Strobe (RAS), and when the column address is valid via the Column Address Strobe (CAS). The concept of using a single set of address lines for both row and column addressing is called multiplexing (MUX). As a result of multiplexing the address lines, RAM chip sizes are always a square power of 2 (or even multiple of 2). Chips with 8 address lines hold 64k bits of information ((28)2 = 65536); 9 address lines yield 256k chips ((29)2 = 262144), 10 address lines yield 1M chips, and so on.


Figure 1 - 64k, 256k, 1M, 4M, 64M, 256M
DRAM pin out

Now, armed with this information, we can discuss the relationship between the CPU address bus and the memory address bus. The CPU has 24 address pins if it is a '286, and 32 address pins if it's the '386, '486, or Pentium; 36 address pins if it's a Pentium Pro. The RAM chips obviously have far fewer address pins than the CPU; so to convert the CPU address to RAM RAS & CAS addresses, the CPU address bus enters a memory controller. The memory controller converts the CPU address signals to RAS addresses, CAS addresses, and BANK addresses (more on BANK addressing later). Let's suppose our computer was a '286 design supporting only 4 banks of memory. (Even though the 80286 is obsolete, describing it's memory configuration is perfect for this example.) This memory can consist of any combination of 64k, 256k, 1M, and 4M chips, but each bank must be populated with a consistent size of RAM chips. To support 4M memory chips, our memory controller must provide signals to 11 RAS addresses, plus 11 CAS addresses. This takes 22 of our 24 address lines available from the CPU. The remaining two address lines from the CPU are interpreted by the memory controller to select which BANK of memory to access. Figure 2 shows a hypothetical relationship between the CPU address bus and the memory address bus. According to this diagram, anytime the CPU asserts A00..A10, the RAM chip receives a column address. Likewise, anytime the CPU asserts A13..A23, the RAM chip receives a row address. CPU address lines A11 and A12 are used to select between RAM banks. Since our hypothetical computer supports only four banks of memory, when =00, the memory controller selects BANK0; when =01, the memory controller selects BANK1; =10 the memory controller selects BANK2; and =11, the memory controller selects BANK3. With this information, we can write to any row of any bank of RAM chips in the computer. In this manner, we can detect how many banks of RAM are installed in the computer, and determine the size of each RAM bank. All that we need to complete our RAM sizing discussion is some theory and algorithms to apply our knowledge.

In our hypothetical computer, the memory controller multiplexes the addresses according to the size of RAM chips. Figure 2 showed the CPU address bus/memory address bus relationship for 4M chips. When the chip size is smaller than 4M, the relationship between the CPU address bus and RAM address bus changes. In our computer, chip size in each RAM bank is programmable; therefore we need to determine the size of the chip in the socket and subsequently re-program the memory controller before we can have full access to all the memory in the computer.

To size memory, we start by assuming each RAM bank has the maximum size RAM chips which are supported by the memory controller. In our hypothetical computer, the memory controller supports up to 4M chips. Therefore, we program the memory controller for four banks of 4 MB chips. By programming the memory controller for this configuration, we can detect when a smaller memory chip is installed in the socket. But before we check the size of the chip, we need to detect if any RAM is in the socket. RAM detection is achieved by writing to the RAM bank, completely reloading the prefetch queue, and checking the value we wrote. If we get back the value we wrote, we have sufficiently determined that RAM is available for that bank of memory. And then we can check the size of the chip.


Figure 1 - 64k, 256k, 1M, 4M, 64M, 256M
DRAM pin out

Figure 2 - CPU Address Bus Conversion
for 4 MB DRAM Chips

To determine the actual size of RAM, we need to detect how many address lines are connected to the RAM chip. Consider the 11 address lines on our 4 MB RAM chip socket. If our RAM is a 4M chip, then all 11 address lines are connected (MA0 - MA10). If the RAM chip is a 1M chip, only 10 address lines are connected (MA0 - MA9); 256k has 9 lines connected (MA0 - MA8); and 64k has 8 lines connected (MA0 - MA7). To determine whether a 4M chip is installed, we need to determine if MA10 is connected. We write to MA10, then read back at RAM address 0 (MA0-MA10 not asserted). If a 4M chip is in the socket, then the data we wrote at MA10, will appear at MA10, and not at RAM address 0. But if the RAM chip isn't a 4M chip, then MA10 isn't connected; therefore the data we wrote at MA10 would have been written to address 0. Our algorithm repeats itself by using MA9 for 1M chips and MA8 for 256k chips. If MA8 - MA10 all fail, then it is safe to assume the bank is populated with 64k chips, as we already determined the presence of RAM in the socket.

To determine how many banks of memory are in the computer, we test the existence of RAM for each bank. This is done by writing to the CPU address lines that control RAM bank selection. For our computer, these are address lines A11 and A12.

By repeating this RAM sizing algorithm for each bank of memory, we can determine how much RAM is installed in the computer.summarizes the relationship between Multiplexed Addresses (MA) on the RAM address bus, and CPU addresses on the CPU address bus. The CPU addresses in this figure are calculated from the table of MA addresses. If we need to assert , , and , then we calculate the CPU address as 223 + 211 + 212 = 801800h.


Figure 2 - CPU Address Bus Conversion
for 4 MB DRAM Chips

To summarize: knowing the relationship between CPU addresses, RAM row and column addresses, and bank selection addresses is essential to determining RAM size and quantity. RAM Bank selection is done by writing to the CPU address lines that are interpreted by the memory controller as controls for bank selection. RAM sizing is done by writing to the highest RAM ROW address for a given chip size and reading the RAM at 0. If the data appears at 0, you know the RAM ROW address line on the chip isn't connected. Continue this process until all RAM sizes are determined for each bank. At the completion of this process, we re-program the memory controller for the proper RAM configuration in the system.

Sizing RAM under program control is accomplished differently than it was during the power-on sequence. During the power-on sequence, the BIOS is guaranteed absolute control of the system, and its resources. During this sequence, BIOS can guarantee that cache is disabled so that it won't interfere with the results. Under program control, we can't write a RAM sizing algorithm that makes any assumptions about the state of the hardware or the cache. If we re-program the memory controller, our program would most certainly fail, as CPU address translation changes with the RAM address translation. In determining the amount of RAM under program control, we must use a different approach, an approach that doesn't rely on knowledge of the hardware. This means we can't re-program the hardware, the memory controller, or the cache controller. Since we can't re-program the memory controller (or make any assumptions about the existence, state, or programmability of a cache controller), we must write an algorithm that can detect the presence of memory without the intrusion of cache RAM. Therefore, the algorithm must be able to invalidate the cache RAM contents while checking for the existence of memory.