Some Hardware Fundamentals and an Introduction to Software
In order to comprehend fully the function of system
software, it is vital to understand the operation of computer hardware and
peripherals. The reason for this is that software and hardware are inextricably
connected in a symbiotic relationship. First, however, we need to identify
types of software and their relationship to each other and, ultimately to the
hardware.
Figure 1 Software Hierarchy
Figure 1 represents the relationship between the various
types of software and hardware. The
figure appears as an inverted pyramid to reflect the relative size and number
of the various types of software, on one hand, and their proximity to computer
hardware, on the other.
First, application software is remote from, and rarely
interacts with, the computer’s hardware. This is particularly true of
applications that run on modern operating systems such as Windows NT 4.0, 2000,
XP and UNIX variants. By 2000, with the advent of the .NET and Java paradigms,
applications became even further removed from hardware, as .NET’s Common
Language Runtime (CLR) and the Java Virtual Machine (JVM) provide operating
system and hardware access.
Indeed, from the operating systems perspective, the JVM and
the CLR are merely applications. Older operating systems such as MS-DOS,
permitted some direct interaction between applications, chiefly computer games;
however, this meant that vendors of such applications had to write code that
would interact with the computer’s BIOS (Basic Input/Output System or firmware
based on the computer’s read only memory or ROM integrated circuits). Indeed,
when computers first appeared on the market this practice was the norm, rather
than the exception. Application
programmers soon tired of reinventing the wheel every time they wrote an
application, as they would have to include software routines that helped the
application software communicate and control hardware devices, including the
CPU (central processing unit). In order to overcome this, computer scientists
focused on developing a new type of software—operating or system software—whose
sole purpose was to provide an environment or interface for applications such
that the burden of managing and communicating the computer hardware was removed
from application programs. This proved important as technological advances
resulted in computer hardware becoming more sophisticated and difficult to
manage. Thus, operating systems were
developed to manage a computer’s hardware resources and provide an application
programming interface, as well as a user or administrator interface, to permit
access to the hardware for use and configuration by application software
programmers and systems administrators.
In early computer systems, a boot strap code, that was
either loaded into the system manually via switches, and/or pre-coded punched
cards or teletype tape, was required to load the operating system program and
boot the system so that an application program could be loaded and run. The
advent of read only memory (ROM) in the 1970s saw the emergence of firmware;
that is, system software embedded in the hardware. The developers of mainframe
computers, minicomputers, and early microprocessors saw the advantage of having
some operating system code integrated into a computer’s hardware to permit
efficient operation, particularly during the power and boot up phases and
before the operating system was loaded from secondary storage—initially
magnetic tape and later floppy and hard disks. However, firmware came into its
own in microprocessors systems and, later, personal computers. By the turn of
the new millennium, entire operating systems, such as Windows NT and Linux,
appeared in the firmware of embedded systems. The most recent advances in this
area have been in the PDA or Pocket PC market, where Palm OS and Microsoft CE
are competing for dominance. That said, while almost every type of electronic
device possesses firmware of one form or other, the most prevalent appears in
personal computers (PCs). Likewise, PCs dominate the computer market due to
their presence in all areas of human activity. Hence, understanding PC hardware
has become a sine qua non for all who call themselves IT professionals. The
remainder of this chapter therefore focuses on delineating the basic
architecture of today’s PC.
A Brief Look Under the Hood of Today’s PC
This section provides a brief examination of the major
components of the PC.
The Power Supply
The most oft ignored of the PCs component is the system
power supply. Most household electrical appliances operate on alternating
current (AC) 110 Volt (60 Hz AC, e.g. USA) or 220 Volt (50 Hz AC, Europe).
However, electronic subassemblies or entire devices with embedded logic
circuitry, whether microprocessor-based or not, operate exclusively on direct
current (DC). The job of a PC’s power supply is to transform and rectify the
external AC commercial supplies to a range of DC voltages required by the
computer logic, associated electronic components, the DC motors in the hard
disk drives, floppy, CD_ROM, and DVD drives and the system fans. Typical DC
power supplies in a PC are rated at 1.5, 3.3, 5, -5, 12, -12 volts. Also note
that as Notebook and Laptop computers have a rechargeable DC battery, it
requires special DC-DC converters to generate the required range of DC
voltages. Several colour-designated cables emanate from a computer’s power
supply unit, the largest of which is connected to the computer’s main circuit
board, called the motherboard. The various DC voltages are distributed via the
power supply rails printed onto the circuit board.
The Basic Input/Output Operating System
The Basic Input/Output System (BIOS) is system software and
is a collection of hardware-related software routines embedded as firmware on a
read-only memory (ROM) integrated circuit (IC) ‘chip’ which is typically housed
on a computer’s motherboard. Usually
referred to as ROM BIOS, this software component provides the most fundamental
means for the operating system to communicate with the hardware. However, most
BIOS’s are 16 bit programs and must operate in real mode[1]
on machines with Intel processors. While this does not cause performance
problems during the boot-up phase, it means a degradation in PC performance as
the CPU switches from protected to real mode when BIOS routines are referenced
by an operating system. 32 bit BIOS’s are presently in use, but are not
widespread. Modern 32 bit operating systems such as LINUX do not use the BIOS
after bootup, as the designers of LINUX integrated 32 bit versions of the BIOS
routines into the LINUX kernel. Hence, the limitations of real mode switches in
the CPU are avoided. Nevertheless, the BIOS plays a critical role during the
boot-up phase, as it performs the power-on self test (POST) for the computer
and then loads the boot code from the hard disk’s master boot record (MBR),
which in turn copies the system software into RAM and loads it into the CPU.
When a computer is first turned on, DC voltages are applied
to the CPU and associated electrical and logic circuits. This would lead to
electronic mayhem if the CPU did not assert control. However, a CPU is merely a
collection of hundreds of thousands (and now millions) of logic circuits. CPU
designers therefore built in a predetermined sequence of programmed electronic
events, which are triggered when a signal appears on the CPU’s reset pin. This
has the CPU’s control unit use the memory address in the instruction counter
(IC) register to fetch the first instruction to be executed. The 32 bit value
placed in the IC is the address of the first byte of the final 64 KB segment in
the first 1 MB of the computer’s address space (this is a hangover from the
early days of the PC when the last 384 KB of the first 1 MB of RAM was reserved
for the system and peripheral BIOS routines, each of which were 64 KB in
length). This is the address of first of the many 16 bit BIOS instructions to
be executed: remember these instructions make up the various hardware specific
software routines in the system BIOS. These routines systematically check each
basic hardware component including the CPU, RAM, the system bus (including
address, data and control lines), the expansion buses. Peripheral devices such
as the video graphics chipset/adapter, hard disk drives, etc. are then checked.
Each hardware device also has its own BIOS routines that enable the CPU to
communicate with it. These are viewed as extensions of the system BIOS and on
boot up are invoked to ensure that the device is operating properly. The ROM
BIOS also runs a check on itself. Of course all of this happens under the
control of the CPU, which is itself controlled by the BIOS routines.
If any of the basic components, such as the CPU, RAM, system
bus, etc. malfunction a special sequence of beeps are emitted from the CPU
speaker. Examples of errors detected by POST routines are: BIOS ROM checksum,
RAM refresh failure, RAM parity check, RAM address line failures, Base 64K RAM
failure, timer malfunction, CPU malfunction, keyboard fail, VIDEO memory
malfunction, and so on. Once the video has been tested the text error messages
are displayed for malfunctioning components. These allow a repair technician to
quickly diagnose the cause of the failure. Finally, the BIOS examines the
system configuration with that stored in the CMOS chip. If a new device has
been added (e.g. a hard disk drive) changed or its configuration altered the
BIOS will alert the user and/or take remedial action such as necessary. For
example, it will install a new hard disk and register it in the CMOS
table.
CMOS stands for complementary metal oxide semiconductor (or
silicon, in some texts). CMOS integrated circuits have low power consumption
characteristics and are therefore suitable as non-volatile RAM. CMOS chips can
contain 64 KB of data, but the core system data takes only 128 bytes. The BIOS
puts structure on the data much like a database management system would. It
also provides a software interface called the Setup Utility that can be
accessed at boot up. The setup utility provides software control of the
following system components: CPU, BIOS routines, the motherboard chipset,
integrated peripherals (floppy disk drive), power management setup,
Plug’n’Play, Peripheral Component Interconnect (PCI), basic system security.
The drawback of CMOS chips is power has to be maintained even if the computer
is powered down. This is achieved using a small long-life battery mounted on
the motherboard. However, with the advent of Flash ROM, the CMOS function has
been integrated with the BIOS itself. This has also been significant for BIOS
upgrades, which can be downloaded over the Internet and loaded or ‘flashed’
onto the ROM BIOS chip. Similar software is used to save or make changes to
system setup features.
The other major function of the BIOS is to identify the boot
device (CD-ROM, floppy disk or hard disk) and transfer the operating system
code to RAM. The boot strap loader is simply a short routine that polls each
bootable device and then uses he devices master boot record to locate the system
and/or boot partitions and thereby load the operating system files. In the
Windows NT/2000 world, if there is more than one partition or disk drive, with
one or more operating systems, then the first of these will be called the
system partition. In Windows NT/2000 machines the system partition will hold
the following files: NTLDR, BOOT.INI, NTDETECT.COM, NTBOOTDD.SYS. For example,
the boot-up route in the MBR will start the NTLDR program. In a multiboot
system with more than one operating system, NTLDR will examine the BOOT.INI
file to identify the default operating system and/or present options to the
user to boot up a particular operating system such as Windows 98, XP etc. If
Windows NT/2000 is selected, then the NTDETECT.COM program determines the hardware
configuration of the computer. This will have been previously stored in the
HKEY_LOCAL_MACHINE Hive of the Registry. The registry is stored as a binary
file, but is a database of hardware and software components, as well as
authorized users, their passwords and personal settings.
The Motherboard and the Chipset
The motherboard or system board houses all system
components, from the CPU, RAM, expansion slots (E.G. ISA and PCI), to the I/O
controllers. However, the key component on a motherboard is the chipset. While
motherboards are identified physically by their form factor, the chipset
designation indicates the capability of the motherboard to house system
components. The most popular form factor is IBM’s ATX. This motherboard was
designed by IBM to increase air movement for cooling on-board components, and
allow easier access to the CPU and RAM. While the motherboard contains many
chips or ICs, such as the CPU, RAM, BIOS, and a variety of smaller chips, two
chips now handle most of the I/O functionality of a PC. The first is the
Northbridge chip, which handles all communication (address, data and control)
to the CPU, RAM, Accelerated Graphics Port and PCI devices. The frontside system
bus (FSB) terminates on the Northbridge chip and permits the CPU to access the
RAM, AGP and PCI devices and those serviced by the Southbridge chip (and vice
versa). The Southbridge chip permits communication with slow peripherals such
as the floppy disk drive, the hard disk drive/CD-ROMS, ISA devices, and the
parallel, serial, mouse, keyboard ports Flash ROM BIOS.
Figure 3 The Intel 850 Chipset
Intel and VIA are the leaders in chipset manufacture as of
2002, although there are several other manufacturers—Ali and SiS. While Intel
services its own CPUs, VIA manufactures for both Intel and its major competitor
AMD. In 2002, the basic Intel i850 chipset consisted of the 82850 Northbridge
MCH (Memory Controller Hub) and a ICH2 (I/O Controller Hub) Southbridge. The
chipset also contains a Firmware Hub (FH) that provides access to the Flash ROM
BIOS. This permits up to 4GB of RAM with
ECC (error correction), 4XAGP Mode, 4 Ultra ATA 100 IDE disk drives, and four
USB ports. ISA is not supported. Different chipset designs support different
RAM types and speeds (e.g. DDR SDRAM or RAMBus DRAM), CPU types and packaging,
system bus speeds, and so on.
In 2000, Intel announced that the future of RAM in the PC
industry was RAMBus Dram (RDRAM). This heralded the release of the Intel 820
‘Camino’ chipset, which supported three RAMBus memory slots. However, errors in
the design meant that only two memory slots could be used. A loss of confidence
in the marketplace meant that withdrawal of the ill-fated Camino and its
replacement with the Intel 840 ‘Carmel’ chipset. This includes a 64 bit PCI controller, a
redesigned and improved RDRAM memory repeater, and an SDRAM memory repeater
that converts the RDRAM protocol to SDRAM. This was a smart move by Intel,
which backfired terribly as the SDRAM hub had design errors that limited the
limited the number of SDRAMs that could be used. In addition, the RDRAM to
SDRAM conversion protocol impaired overall memory throughput when using SDRAM.
Consequently, faster memory performance on Intel’s Pentium III Coppermine CPUs
with an 133 Mhz Frontside Bus could only be achieved using VIA’s Apollo Pro 133
A. To make matters worse, the Intel 815 Solano chipset, which was introduced to
support PC 133 DIMMs (SDRAM memory modules) and to help regain market share
from VIA, would not allow SDRAM modules work at 133 Mhz, if CPUs (such as
certain variants of Intel’s Pentium III) rated for a 100 Mhz external clock
rate were fitted on the motherboard. This particularly applies to the Celeron
family which ran at a 66 Mhz external clock rate. It is significant that many
of Intel’s competitors promoted PC133 and PC 266 DIMM standards over the more
expensive RAMBus DRAM. This further impeded the acceptance of RDRAM; however,
by late 2002, RDRAM had its own market niche as the price of SDRAM increased
once more.
Intel learned from its experience with Camino and Carmel
chipsets. Bowing to market pressure it designed two new chipset families for
use with its new Pentium IV CPU. The first of these, the i845 (see Figure 2)
was targeted at systems based on the Pentium IV and synchronous DRAM memory
such as the PC133, 233, and 333, with up to 3 GB of memory. The i850 (see Figure 3) was targeted on RDRAM-based
systems of up to 4 GB, which supported the PC 800, 1033 and 1066 RAMBus memory.
In late 2002, The
Intel 845GE chipset was released to support PC333 DDR SRAM and Pentium 4
processor. The chipset also included Intel’s Extreme Graphics technology which
ran at 266 MHz core speed. The basic member of the Intel 850 chipset family had support
for PC800 RDRAM memory and provided a balanced performance platform for the
Pentium 4 processor with 400MHz system bus and NetBurst™ Architecture. It also
supports dual channel access to RDRAM RIMMs, which increases overall throughput
to 3.2 Gbps. Subsequent developments in this chipset family provided support
for RDRAM running at 1033 Mhz, 1066 Mhz and a 533 MHz FSB. Further advances in DDR SDRAM
technologies saw DDR SDRAM-based Intel and VIA chipsets which accommodated
PC2400 AND PC2700 DDR SRRAM running at 150 Mhz and 166MHz.respectively and which is double clocked
to 300 and 333 Mhz (so called DDR 300 and 333). However, the evolution of
DDR366 and chipset design led to the PC3000 DDR SDRAM being released with even higher bandwidth
speeds.
Basic CPU Architectures
CISC vs. RISC
There are two types of fundamental CPU architecture: complex
instruction set computers (CISC) and reduced instruction set computers
(RISC). CISC is the most prevalent and
established microprocessor architecture, while RISC is a relative newcomer.
Intel’s 80x86 and Pentium microprocessor families are CISC-based, although
RISC-type functionality has been incorporated into Pentium CPUs. Motorola’s
68000 family of microprocessors is another example of this type of
architecture. Sun Microsystems’ SPARC microprocessors and MIPS R2000, R3000 and
R4000 families dominate the RISC end of the market; however, Motorola’s
PowerPC, G4, Intel’s i860, and Analog Devices Inc.’s digital signal processors
(DSP) are in wide use. In the PC/Workstation market, Apple Computers and Sun
employ RISC microprocessors as their choice of CPU.
Table 1 CISC and RISC
CISC
|
RISC
|
Large instruction set
|
Compact instruction set
|
Complex, powerful instructions
|
Simple hard-wired machine code and control unit
|
Instruction sub-commands microcoded in on board
ROM
|
Pipelining of instructions
|
Compact and versatile register set
|
Numerous registers
|
Numerous memory addressing options for operands
|
Compiler and IC developed simultanwously
|
The difference between the two architectures is the relative
complexity of the instruction sets and underlying electronic and logic circuits
in CISC microprocessors. For example, the original RISC I prototype had just 31
instructions, while the RISC II had 39. In the RISC II prototype, these
instructions are hard-wired into the microprocessor using 41,000 integrated
transistors, so that when a program instruction is presented for execution it
can be processed immediately. This typifies the pure RISC approach, which
results in up-to-a fourfold increase in processing power over comparable CISC
processors. In contrast, the Intel 386
has 280,000 and uses microcode stored in on-board ROM to process the
instructions. Complex instructions have to be first decoded in order to
identify which microcode routine needs to be executed to implement the
instructions. The Pentium II uses 9.5 million transistors and while older
microcode is retained, the most frequently used and simpler instructions, such
as MMX, are hardwired. Thus Pentium CPUs are essentially a hybrid, however they
are still classified as RISC as their basic instructions are complex.
Remember the internal transistor logic gates in a CPU are
opened and closed under the control of clock pulses (i.e. electrical voltage
values of 0 or 5 V (volts) being 0 or 1). These simply process the binary
machine code or data by producing predetermined outputs for given inputs.
Machine code or instructions (the binary equivalent of high level programming
code) control the operation of the CPU so that logical or mathematical
operations can be executed. In CISC processors, complex instructions are first
decoded and the corresponding microcode routine dispatched to the execution unit.
The decode activity can take several clock cycles depending on the complexity
of the instruction. In
|
the 1970s, an IBM engineer discovered that 20% of the
instructions were doing 80% of the work in a typical CPU. In addition, he found
that a collection of simple instructions could perform the same operation as a
complex instruction in less clock cycleS. This led him to propose an
architecture based on reduced instruction set size, where small instructions
could be executed without decoding and in parallel with others. As indicated,
this simplified CPU design and made for faster processing of instructions with
reduced overhead in terms of clock cycles.
Inside the CPU
The basic function of a CPU is to fetch, decode and execute
instructions held in ROM or RAM. To accomplish this it must fetch data from an
external memory source and transfer it into its own internal memory, each
addressable component of which is called a register. It must also be able to
distinguish between instructions and operands, that is, the. read/write memory
locations containing the data to be operated on. These may be byte addressable
location in ROM, RAM or in the CPU’s own registers. In addition, the CPU must
perform additional tasks such as responding to external events such as resets
and interrupts, provide memory management facilities to the operating system,
etc. A consideration of the fundamental components in a basic microprocessor is
first undertaken before introducing more complex modern devices. Figure 2
illustrates a typical microprocessor architecture
Microprocessors must perform the following activities:
- Provide temporary storage for addresses and data
- Perform arithmetic and logic operations
- Control and schedule all operations.
Registers
Registers for a variety of purposes such as holding the
address of instructions and data, storing the result of an operation, signaling
the result of a logic operation, or indicating the status of the program or the
CPU itself. Some registers may be accessible to programmers, while others are
reserved for us by the CPU itself. Registers store binary values such as 1 or 0
as electrical voltages of say 5 volts or 0 volts. They consist of several
integrated transistors which are configured as a flip-flop circuits each of
which can be switched into a 1 or 0 state. They remain in that state until
changed under control of the CPU or until the power is removed from the
processor. Each register has a specific
name and is addressable, some, however, are dedicated to specific tasks while
the majority are ‘general purpose’. The
width of a register depends on the type of CPU, e.g., an 16, 32 or 64 bit
microprocessor. In order to provide backward compatibility, registers may be sub-divided.
For example, the Pentium processor is a 32 bit CPU, and its registers are 32
bits wide. Some of these are sub-divided and named as 8 and 16 bit registers in
order to run 8 and 16 bit applications designed for earlier x86
microprocessors.
Instruction Register
When the Bus Interface Unit receives an instruction it
transfers it to the Instruction Register for temporary storage. In Pentium
processors the Bus Interface Unit transfers instructions to the L1 I-Cache,
there is no instruction register as such.
Stack Pointer
A ‘stack’ is a small area of reserved memory used to store
the data in the CPU’s registers when: (1) system calls are made by a process to
operating system routines; (2) when hardware interrupts generated by
input/output (I/O) transactions on peripheral devices; (3) when a process
initiates an I/O transfer; (3) when a process rescheduling event occurs on foot
of a hardware timer interrupt. This transfer of register contents is called a
‘context switch’. The stack pointer is the register which holds the address of
the most recent ‘stack’ entry. Hence, when a system call is made by a process
(to say print a document) and its context is stored on the stack, the called
system routine uses the stack pointer to reload the register contents when it
is finished printing. Thus the process can continue where it left off.
Instruction Decoder
The Instruction Decoder is an arrangement of logic elements
which act on the bits that constitute the instruction. Simple instructions with
corresponding logic hard-wired into the execution unit are simply passed to the
Execution Unit (and/or the MMX in the Pentium II, III and IV), complex
instructions are decoded so that related microcode modules can be transferred
from the CPU’s microcode ROM to the execution unit. The Instruction Decoder
will also store referenced operands in appropriate registers so data at the
memory locations referenced can be fetched.
Program or Instruction Counter
The Program Counter (PC) is the register that stores the
address in primary memory (RAM or ROM) of the next instruction to be executed.
In 32 bit systems, this is a 32 bit linear or virtual memory address that
references a byte (the first of 4 required to store the 32 bit instruction) in
the process’s virtual memory address space. This value is translated to
determine the real memory address in which the instruction is stored. When the
referenced instruction is fetched, the address in the PC is incremented to the
address of the next instruction to be executed. If the current address is 00B0
hex, then the next address will be 00B4 hex. Remember each byte in RAM is
individually addressable, however each complete instruction is 32 bits or 4
bytes, and the address of the next instruction in the process will be 4 bytes
on.
Accumulator
The accumulator may contain data to be used in a
mathematical or logical operation, or it may contain the result of an
operation. General purpose registers are used to support the accumulator by
holding data to be loaded to/from the accumulator.
Computer Status Word or Flag Register
The result of a ALU operation may have consequences of
subsequent operations; for example, changing the path of execution. Individual
bits in this register are set or reset in accordance with the result of
mathematical or logical operations. Also called a flag, each bit in the
register has a preassigned meaning and the contents are monitored by the
control unit to help control CPU related actions.
Arithmetic and Logic Unit
The Arithmetic and Logic Unit (ALU) performs all arithmetic
and logic operations in a microprocessor viz. addition, subtraction, logical
AND, OR, EX-OR, etc.. A typical ALU is connected to accumulator and general
purpose registers and other CPU components that help transfer the result of its
operations to RAM via the Bus Interface Unit and the system bus. The results
may also be written into internal or external caches.
Control Unit
The control unit coordinates and manages CPU activities, in
particular the execution of instructions by the arithmetic and logic unit
(ALU). In Pentium processors its role is complex, as microcode from decoded
instructions are pipelined for execution by two ALUs.
The System Clock
The Intel 8088 had a clock speed of 4.77 Mhz; that is, its
internal logic gates were opened and closed under the control of a square wave
pulsed signal that had a frequency of 4.77 million cycles per second.
Alternatively put, the logic gates opened and closed 4.77 million times per
second. Thus, instructions and data were pumped through the integrated transistor
logic circuits at a rate of 4.77 million bits per second. Later designs ran at
higher speeds viz. the i286 8-20 Mhz, the i386 16-33 Mhz, i486 25-50 Mhz. Where does this clock signal come from? Each
motherboard is fitted with a quartz oscillator in a metal package that
generates a square wave clock pulse of a certain frequency. In i8088 systems
the crystal oscillator ran at 14.318 Mhz and this was fed to the i8284 to
generate the system clock frequency of 4.77 Mhz in earlier system, to 10Mhz is
later designs. Later, the i286 PCs had a 12 Mhz crystal which provided i82284
IC multiplier/divider with the primary clock signal. This then
divided/multiplied the basic 12 Mhz to generate the system clock signal of 8-20 Mhz. With the advent of the i486DX, the
system clock signal, which ran at 25 or 33 Mhz, was effectively multiplied by
factors of 2 and 3 to deliver an internal CPU clock speed of 50, 66, 75, 100
Mhz. This approach is used in Pentium IV architectures, where the primary
crystal source delivers a relatively slow 50 Mhz clock signal that is then
multiplied to the system clock speed of 100-133 Mhz. The internal multiplier in
the Pentium then multiplies this by a fact or 20+ to obtain speeds of 2Ghz and
above.
Instruction Cycle
An instruction cycle consists of the activities required to
fetch and execute an instruction. The length of time take to fetch and execute
is measured in clock cycles. In CISC processors this will take many clock
cycles, depending on the complexity of the instruction and number of memory
references made to load operands. In RISC computers the number of clock cycles
are reduced significantly. When the CPU
finishes the execution of an instruction it transfers the content of the
program or instruction register into the
Bus Interface Unit (1 clock cycle) . This is then gated onto the system address
bus and the read signal is asserted on the control bus (1 clock cycle). This is
a signal to the RAM controller that the value of this address is to be read
from memory and loaded onto the data bus (4+ clock cycles). The instruction is
read in from the data bus and decoded (2 + clock cycles. The fetch and decode
activities constitute the first machine cycle of the instruction cycle. The
second machine cycle begins when the instruction’s operand is read from RAM and
ends when the instruction is executed and the result written back to memory.
This will take at least another 8+ clock cycles, depending on the complexity of
the instruction. Thus an instruction cycle will take at least 16 clock cycles,
a considerable length of time. Together, RISC processors and fast RAM can keep
this to a minimum. However, Intel made advances by super pipelining
instructions, that is by interleaving fetch, decode, operand read, execute, and
retire (i.e. write the result of the instruction to RAM) activities into two
separate pipelines serving two ALUs. Hence, instructions are not executed
sequentially, but concurrently and in parallel—more about pipelining later.
5th and 6th Generation Intel CPU Architecture
The Pentium microprocessor was the last of Intels’ 5th
generation microprocessors and had several basic units: the Bus Interface Unit
(BIU); the I-Cache (8 KB of write-through Static RAM—SRAM); the Instruction
Translation Lookaside Buffer (TLB); The D-Cache (8KB of write-back SRAM); the
Data TLB; the Clock Driver/Multiplier; Instruction Fetch Unit; the Branch
Prediction Unit; the Instruction Decode Unit; Complex Instruction Support Unit;
Superscalar Integer Execution Unit; Pipelined Floating Point Unit. Figure 5
presents a block diagram of the original Pentium.
The Pentium was the first Intel chip to have a 64 bit
external data bus which was split internally into two separate pipelines, each
32 bits wide. This allowed the Pentium to execute two instructions
simultaneously; however, more than one instruction could be in the pipeline,
thus increasing instruction throughput.
Heat dissipation is enemy of chip designers, as the greater
the number of integrated transistors, the higher the speed of operation and the
operating voltage, the more poser is consumed, and the more heat generated. The
first two Pentium versions ran at 60 and 66 Mhz respectively with an operating
voltage of 5 V DC. Hence they ran quite hot. However, a change in package
design (from Socket 5 to 7, Pin Grid Array—PGA) and a reduction in operating
voltage to 3.3 Volts lowered power consumption and heat dissipation. Intel also
introduced a clock multiplier which multiplied the external clock signals and
enabled the Pentium to run at 1.5, 2, 2.5 and finally 3 times this speed. Thus
while the system bus ran at 50, 60, and 66 Mhz, the CPU ran at 75-200Mhz.
In 1997, Intel changed the Pentium design in several ways,
the most significant was the inclusion of an MMX unit (multi media extension)
and 16 KB instruction and data caches. The MMX unit contains a eight new 64 bit registers and 57 ‘simple’
hardwired MMX instructions that operate on 4 new data types. The
internal architecture and external operation of the Pentium family evolved from
the Pentium MMX, with the Pentium Pro, Pentium II and Pentium III. However,
major design changes came with the Pentium IV. Modifications and design changes
centered on (a) the physical package; (b) the process by which instructions
were decoded and executed; (c) support for memory beyond the 4 GB limit; (c)
the integration and enhancement of L1 and L2 cache performance and size; (d)
the addition of a new cache; (e) the speed of internal and external operation.
Each of these issues receives attention in the following subsections.
Figure
5 Pentium CPU Block Diagram
Physical Packaging
Two terms are employed to describe the packaging employed
for the Pentium family of processors: the first refers to the motherboard
connection, and the second to the actual package itself. For example, the
original Pentium P5 was fitted to the Socket 5 type connection on the
motherboard using a Staggered Pin Grid Array (SPGA) for the die’s I/O (die is
the technical term for the physical structure that incorporates the chip).
Later variants used the Socket 7 connector. The Pin Grid Array (PGA) family of
packages are associated with different Socket types, which are numbered. A pin
grid array is simply an array of metal pin connectors used to form an
electrical connection between the internal electronics of the CPU (packaged on
the die) and other system components like the system chipsets. The pins plug
into corresponding receptacle pinholes in the CPU’s socket on the motherboard.
The different types of PGA reflect the type of packaging, e.g. ceramic to
plastic, the number of pins, and how they are arrayed. The Pentium Pro used a
SPGA with a staggering 387 pins for connection to the motherboard socket,
called Socket 8. The Pentium Pro was the first Intel processor to have an L2
cache connected to the CPU via backside bus, but on a separate die. This was a
significant technical achievement packaging.
When Intel designed the Pentium II they decided to change the packaging
significantly and introduced a Single Edge Contact Connector (SECC) package
(with three variants SECC for the Pentium II, SECC2 for the Pentium II and SEPP
for the Celeron), each of which plugged into the Slot 1 connector on the
motherboard. However, later variants of the Celeron and Pentium III used PGA
packaging for certain applications: the Celeron uses the Plastic PGA, the
Celeron III and Pentium III the Flip-Chip Pin Grid Array (FC-PGA). Both use the
370-pin Socket. The Pentium IV saw a full return to the PGA for all chips. Here
a Flip-Chip Pin Grid Array (FC-PGA) was employed in a 478 PCPGA package.
Overall Architectural Comparison of the Pentium Family of Microprocessors
The Pentium (P54) first shipped in 1993 and had 3.1 million
transistors. It used a 5 Volt to power its core and I/O logic, PGA on Socket 4,
had a 2x8kb L1 cache, and operated at 50, 60 and 66 Mhz. The system bus also
operated at these speeds. The Pentium (P54C) was released in 1994 and had PGA
on Socket 5 and 7, 3.3 Volts supply for core and I/O logic. It was also the
first to use a multiplier to give processor speeds of 75, 90,100,120,133, 150,
166 and 200 Mhz. The last version of the first member of this sub-generation
was the Pentium MMX (P55C). This had a 4.1 million transistors, fit Socket 7,
and had a 2x16KB L1 cache with improved branch prediction logic. It operated at
2.8 V for its core logic and 3.3V for I/O logic. Its 60 and 66 MHz system clock
speed was multiplied on board the CPU to give between 120-300MHz CPU clock
speeds.
q Superscalar
architecture: Two integer (U (slow) and V (fast)) and one floating point
pipelines. The U and V pipelines contain five stages of instruction execution,
while the floating point pipeline has 8 stages.
The U and V pipelines are served by two 32 byte prefetch buffers. This
allows overlapping execution of instructions in the pipelines.
q Dynamic
branch prediction using the Branch Target Buffer. The Pentium’s branch
prediction logic helps speed up program execution by anticipating branches and
ensuring that branched-to code is available in cache
q An
Instruction and a Data Cache each of 8 Kbyte capacity
q A 64
bit system data bus and 32 bit address bus
q Dual
processing capability
q On-board
Advanced Programmable Interrupt Controller
q The
Pentium MMX version contains an additional MMX unit that speeds up multimedia
and 3D applications. Processing multimedia data involves instructions operating
on large volumes of packetized data. Intel proposed a new approach: single
instruction multiple data, which could operate on video pixels or Internet
audio streams. The MMX unit contains a eight new 64 bit registers and 57
‘simple’ hardwired MMX instructions that operate on 4 new data types. To
leverage the features of the MMX unit, applications must be programmed to
include the new instructions.
Pentium Pro
The Pentium Pro was designed around a the 6th
generation P6 architecture, which was optimized for 32 bit instructions and
32-bit operating systems such as Windows NT and Linux. It was the first of the
P6 family, which included the Pentium II, the Celeron variants, and the Pentium
III. As indicated, the physical package was also significant advance, as was
the incorporation of additional RISC features. However, aimed as it was at the
server market, the Pentium Pro did not incorporate MMX technology. It was
expensive to produce as it included the L2 cache on its substrate (but on a
separate die) and had 5.5 million transistors at its core and over 8 million in
its L2 cache. Its core logic operated at 3.3Volts. The microprocessor was
still, however, chiefly CISC in design, and optimized for 32 bit operation. The
chief features of the Pentium Pro were:
q A
partly integrated L2 cache of up to 512 KB (on a specially manufactured SRAM
separate die) that was connected via a dedicated ‘backside’ bus that ran at
full CPU speed.
q Three
12 staged pipelines
q Speculative
execution of instructions
q Out-of-order
completion of instructions
q 40
renamed registers
q Dynamic
branch prediction
q Multiprocessing
with up to 4 Pentium Pros
q An
increased bus size to 36 bits (from 32) to enable up to 64 Gb of memory to be
used. (Please note that the 4 extra bits can address up to 16 memory locations;
this gives 4 Gb x 16 = 64 Gb of memory.)
The following description is taken
from Intel’s introduction to its microprocessor architecture is relevant to all
members of the P6 family, including the Celeron, Pentium II and III.
The Intel Pentium Pro processor has
three-way superscalar architecture. The term “three-way superscalar” means that
using parallel processing techniques, the processor is able on average to
decode, dispatch, and complete execution of (retire) three instructions per
clock cycle. To handle this level of instruction throughput, the Pentium Pro
processor uses a decoupled, 12-stage superpipeline that supports out-of-order
instruction execution. It does this by incorporating even more parallelism than
the Pentium processor. The Pentium Pro processor provides Dynamic Execution
(micro-data flow analysis, out-of-order execution, superior branch prediction,
and speculative execution) in a superscalar implementation.
The centerpiece of the Pentium Pro
processor architecture is an innovative out-of-order execution mechanism called
“dynamic execution.” Dynamic execution incorporates three data-processing
concepts:
• Deep branch prediction.
• Dynamic data flow analysis.
• Speculative execution.
Branch prediction is a concept found in most mainframe and
high-speed RISC microprocessor architectures. It allows the processor to decode
instructions beyond branches to keep the instruction pipeline full. In the
Pentium Pro processor, the instruction fetch/decode unit uses a highly
optimized branch prediction algorithm to predict the direction of the
instruction stream through multiple levels of branches, procedure calls, and
returns.
Figure 6 Functional Block Diagram of the Pentium Pro Processor
Micro-architecture
|
Dynamic data flow analysis involves real-time analysis of
the flow of data through the processor to determine data and register
dependencies and to detect opportunities for out-of-order instruction
execution. The Pentium Pro processor dispatch/execute unit can simultaneously
monitor many instructions and execute these instructions in the order that
optimizes the use of the processor’s multiple execution units, while
maintaining the integrity of the data being operated on. This out-of-order
execution keeps the execution units busy even when cache misses and data
dependencies among instructions occur.
Speculative execution refers to the processor’s ability to
execute instructions ahead of the program counter but ultimately to commit the
results in the order of the original instruction stream. To make speculative
execution possible, the Pentium Pro processor microarchitecture decouples the
dispatching and executing of instructions from the commitment of results. The
processor’s dispatch/execute unit uses data-flow analysis to execute all
available instructions in the instruction pool and store the results in
temporary registers. The retirement unit then linearly searches the instruction
pool for completed instructions that no longer have data dependencies with
other instructions or unresolved branch predictions. When completed
instructions are found, the retirement unit commits the results of these
instructions to memory and/or the Intel Architecture registers (the processor’s
eight general-purpose registers and eight floating-point unit data registers)
in the order they were originally issued and retires the instructions from the
instruction pool.
Through deep branch prediction, dynamic data-flow analysis,
and speculative execution, dynamic execution removes the constraint of linear
instruction sequencing between the traditional fetch and execute phases of
instruction execution. It allows instructions to be decoded deep into
multi-level branches to keep the instruction pipeline full. It promotes
out-of-order instruction execution to keep the processor’s six instruction
execution units running at full capacity. And finally it commits the results of
executed instructions in original program order to maintain data integrity and
program coherency.
Three instruction decode units work in parallel to decode
object code into smaller operations called “micro-ops” (microcode). These go
into an instruction pool, and (when interdependencies don’t prevent) can be
executed out of order by the five parallel execution units (two integer, two
FPU and one memory interface unit). The Retirement Unit retires completed
micro-ops in their original program order, taking account of any branches.
The power of the Pentium Pro
processor is further enhanced by its caches: it has the same two on-chip
8-KByte L1 caches as does the Pentium processor, and also has a 256-512 KByte
L2 cache that’s in the same package as, and closely coupled to, the CPU, using
a dedicated 64-bit (“backside”) full clock speed bus. The L1 cache is dual
ported, the L2 cache supports up to 4 concurrent accesses, and the 64-bit
external data bus is transaction -oriented, meaning that each access is handled
as a separate request and response, with numerous requests allowed while
awaiting a response. These parallel features for data access work with the
parallel execution capabilities to provide a “non-blocking” architecture in
which the processor is more fully utilized and performance is enhanced.
Pentium Pro Modes of Operation
The Intel Architecture supports three operating modes:
protected mode, real-address mode, and system management mode. The operating
mode determines which instructions and architectural features are accessible:
q Protected
mode. The native state of the processor. In this mode all instructions and
architectural features are available, providing the highest performance and
capability. This is the recommended mode for all new applications and operating
systems. Among the capabilities of protected mode is the ability to directly
execute “real-addressmode” 8086 software in a protected, multi-tasking
environment. This feature is called virtual-8086 mode, although it is
not actually a processor mode. Virtual-8086 mode is actually a protected mode
attribute that can be enabled for any task.
q Real-address
mode. Provides the programming environment of the Intel 8086 processor with
a few extensions (such as the ability to switch to protected or system
management mode). The processor is placed in real-address mode following
power-up or a reset.
q System
management mode. A standard architectural feature unique to all Intel
processors, beginning with the Intel386 SL processor. This mode provides an
operating system or executive with a transparent mechanism for implementing
platform-specific functions such as power management and system security. The
processor enters SMM when the external SMM interrupt pin (SMI#) is activated or
an SMI is received from the advanced programmable interrupt controller (APIC). In SMM, the processor
switches to a separate address space while saving the entire context of the
currently running program or task. SMM-specific code may then be executed
transparently. Upon returning from SMM, the processor is placed back into its
state prior to the system management interrupt.
The basic execution environment
is the same for each of these operating modes,
Basic Pentium Execution Environment
Any program or task running on an Intel Architecture
processor is given a set of resources for executing instructions and for
storing code, data, and state information. These resources (shown in Figure )
include an address space of up to 232 bytes, a set of general data registers, a set of
segment registers, and a set of status and control registers. When a program
calls a procedure, a procedure stack is added to the execution environment.
(Procedure calls and the procedure stack implementation are described in
Chapter 4, Procedure Calls, Interrupts, and Exceptions.)
Figure 7 Basic Execution Environment
Pentium Pro Memory Organization
The memory that the processor addresses on its bus is called
physical memory. Physical memory is organized as a sequence of 8-bit
bytes. Each byte is assigned a unique address, called a physical address.
The physical address space ranges from zero to a maximum of 232 – 1 (4 gigabytes). Virtually any operating system or
executive designed to work with an Intel Architecture processor will use the
processor’s memory management facilities to access memory. These facilities
provide features such as segmentation and paging, which allow memory to be
managed efficiently and reliably. Memory management is described in detail
later. The following paragraphs describe the basic methods of addressing memory
when memory management is used. When employing the processor’s memory
management facilities, programs do not directly address physical memory.
Instead, they access memory using any of three memory models: flat, segmented,
or real-address mode.
With the flat memory model (see Figure 3-2), memory
appears to a program as a single, continuous address space, called a linear
address space. Code (a program’s instructions), data, and the procedure
stack are all contained in this address space. The linear address space is byte
addressable, with addresses running contiguously from 0 to 232 - 1. An address for
any byte in the linear address space is called a linear address. With the segmented memory
model, memory appears to a program as a group of independent address spaces
called segments. When using this model, code, data, and stacks are
typically contained in separate segments. To address a byte in a segment, a
program must issue a logical address, which consists of a segment
selector and an offset. (A logical address is often referred to as a far
pointer.) The segment selector identifies the segment to be accessed
and the offset identifies a byte in the address space of the segment. The
programs running on an Intel Architecture processor can address up to 16,383
segments of different sizes and types, and each segment can be as large as 232 (4GB) bytes.
Internally, all the segments that are defined for a system
are mapped into the processor’s linear address space. So, the processor translates
each logical address into a linear address to access a memory location. This
translation is transparent to the application program. The primary reason for
using segmented memory is to increase the reliability of programs and systems.
For example, placing a program’s stack in a separate segment prevents the stack
from growing into the code or data space and overwriting instructions or data,
respectively. And placing the operating system’s or executive’s code, data, and
stack in separate segments protects Them from the application program and vice
versa.
With either the flat or segmented model, the Intel
Architecture provides facilities for dividing the linear address space into
pages and mapping the pages into virtual memory. If an operating system/executive
uses the Intel Architecture’s paging mechanism, the existence of the pages is
transparent to an application program.
The real-address mode model uses the memory model for
the Intel 8086 processor, the first Intel Architecture processor. It was
provided in all the subsequent Intel Architecture processors for compatibility
with existing programs written to run on the Intel 8086 processor. The real
address mode uses a specific implementation of segmented memory in which the
linear address space for the program and the operating system/executive
consists of an array of segments of up to 64K bytes in size each. The maximum
size of the linear address space in real-address mode is 220 bytes.
Figure 8 Three Memory Management Models
|
32-bit vs. 16-bit Address and Operand Sizes
The processor can be configured for 32-bit or 16-bit address
and operand sizes. With 32-bit address and operand sizes, the maximum linear
address or segment offset is FFFFFFFFH (232), and operand sizes are typically 8
bits or 32 bits. With 16-bit address and operand sizes, the maximum linear
address or segment offset is FFFFH (216), and operand sizes are typically 8 bits or
16 bits. When using 32-bit addressing, a logical address (or far pointer) consists
of a 16-bit segment selector and a 32-bit offset; when using 16-bit addressing,
it consists of a 16-bit segment selector and a 16-bit offset. Instruction
prefixes allow temporary overrides of the default address and/or operand sizes
from within a program. When operating in protected mode, the segment descriptor
for the currently executing code segment defines the default address and
operand size. A segment descriptor is a system data structure not normally
visible to application code. Assembler directives allow the default addressing
and operand size to be chosen for a program. The assembler and other tools then
set up the segment descriptor for the code segment appropriately. When
operating in real-address mode, the default addressing and operand size is 16
bits. An address-size override can be used in real-address mode to enable 32
bit addressing; however, the maximum allowable 32-bit address is still
0000FFFFH (216).
Figure 9 Application Programming Registers
|
REGISTERS
The processor provides 16 registers for use in general system
and application programming. As shown in Figure, these registers can be grouped
as follows:
q General-purpose
data registers. These eight registers are available for storing operands
and pointers.
q Segment
registers. These registers hold up to six segment selectors.
q Status
and control registers. These registers report and allow modification of the
state of the processor and of the program being executed.
General-Purpose
Data Registers
The 32-bit general-purpose data registers EAX, EBX, ECX,
EDX, ESI, EDI, EBP, and ESP are provided for holding the following items:
q Operands
for logical and arithmetic operations
q Operands
for address calculations
Although all of these registers are available for general
storage of operands, results, and pointers, caution should be used when
referencing the ESP register. The ESP register holds the stack pointer and as a
general rule should not be used for any other purpose. Many instructions assign
specific registers to hold operands. For example, string instructions use the
contents of the ECX, ESI, and EDI registers as operands. When using a segmented
memory model, some instructions assume that pointers in certain registers are
relative to specific segments. For instance, some instructions assume that a
pointer in the EBX register points to a memory location in the DS segment.
The following is a summary of these special uses:
q EAX—Accumulator
for operands and results data.
q EBX—Pointer
to data in the DS segment.
q ECX—Counter
for string and loop operations.
q EDX—I/O
pointer.
q ESI—Pointer
to data in the segment pointed to by the DS register; source pointer for string
operations.
q EDI—Pointer
to data (or destination) in the segment pointed to by the ES register;
destination pointer for string operations.
q ESP—Stack
pointer (in the SS segment).
q EBP—Pointer
to data on the stack (in the SS segment).
As shown in Figure, the lower 16 bits of the general-purpose
registers map directly to the register set found in the 8086 and Intel 286
processors and can be referenced with the names AX, BX, CX, DX, BP, SP, SI, and
DI. Each of the lower two bytes of the EAX, EBX, ECX, and EDX registers can be
referenced by the names AH, BH, CH, and DH (high bytes) and AL, BL, CL, and DL
(low bytes).
Segment Registers
The segment registers (CS, DS, SS, ES, FS, and GS) hold
16-bit segment selectors. A segment selector is a special pointer that
identifies a segment in memory. To access a particular segment in memory, the
segment selector for that segment must be present in the appropriate segment
register. When writing application code, you generally create segment selectors
with assembler directives and symbols. The assembler and other tools then
create the actual segment selector values associated with these directives and
symbols. If you are writing system code, you may need to create segment
selectors directly.
How segment registers are used depends on the type of memory
management model that the operating system or executive is using. When using
the flat (unsegmented) memory model, the segment registers are loaded with
segment selectors that point to overlapping segments, each of which begins at
address 0 of the linear address space (as shown in Figure). These overlapping
segments then comprise the linear-address space for the program. (Typically,
two overlapping segments are defined: one for code and another for data and
stacks. The CS segment register points to the code segment and all the other
segment registers point to the data and stack segment.)
When using the segmented memory model, each segment register
is ordinarily loaded with a different segment selector so that each segment
register points to a different segment within the linear-address space (as
shown in Figure 9). At any time, a program can thus access up to six segments
in the linear-address space. To access a segment not pointed to by one of the
segment registers, a program must first load the segment selector for the
segment to be accessed into a segment register.
Figure 10 Use of Segment Registers for
Flat Memory Model
|
Figure 11 Use of Segment Registers in
Segmented Memory Model
|
Each of the segment registers is associated with one of
three types of storage: code, data, or stack). For example, the CS register
contains the segment selector for the code segment, where the
instructions being executed are stored. The processor fetches instructions from
the code segment, using a logical address that consists of the segment selector
in the CS register and the contents of the EIP register. The EIP register
contains the linear address within the code segment of the next instruction to
be executed. The CS register cannot be loaded explicitly by an application
program. Instead, it is loaded implicitly by instructions or internal processor
operations that change program control (such as, procedure calls, interrupt
handling, or task switching).
The DS, ES, FS, and GS registers point to four data
segments. The availability of four data segments permits efficient and
secure access to different types of data structures. For example, four separate
data segments might be created: one for the data structures of the current
module, another for the data exported from a higher-level module, a third for a
dynamically created data structure, and a fourth for data shared with another
program. To access additional data segments, the application program must load
segment selectors for these segments into the DS, ES, FS, and GS registers, as
needed.
The SS register contains the segment selector for a stack
segment, where the procedure stack is stored for the program, task, or
handler currently being executed. All stack operations use the SS register to
find the stack segment. Unlike the CS register, the SS register can be loaded
explicitly, which permits application programs to set up multiple stacks and
switch among them.
The four segment registers CS, DS, SS, and ES are the same
as the segment registers found in the Intel 8086 and Intel 286 processors and
the FS and GS registers were introduced into the Intel Architecture with the
Intel386 family of processors.
EFLAGS Register
The 32-bit EFLAGS register contains a group of status flags,
a control flag, and a group of system flags. Figure 3-7 defines the flags
within this register. Following initialization of the processor (either by
asserting the RESET pin or the INIT pin), the state of the EFLAGS register is
00000002H. Bits 1, 3, 5, 15, and 22 through 31 of this register are reserved.
Software should not use or depend on the states of any of these bits.
Some of the flags in the EFLAGS register can be modified
directly, using special-purpose instructions (described in the following
sections). There are no instructions that allow the whole register to be
examined or modified directly. However, the following instructions can be used
to move groups of flags to and from the procedure stack or the EAX register:
LAHF, SAHF, PUSHF, PUSHFD, POPF, and POPFD. After the contents of the EFLAGS
register have been transferred to the procedure stack or EAX register, the
flags can be examined and modified using the processor’s bit manipulation
instructions (BT, BTS, BTR, and BTC).
When suspending a task (using the processor’s multitasking
facilities), the processor automatically saves the state of the EFLAGS register
in the task state segment (TSS) for the task being suspended. When binding
itself to a new task, the processor loads the EFLAGS register with data from
the new task’s TSS.
When a call is made to an interrupt or exception handler
procedure, the processor automatically saves the state of the EFLAGS registers
on the procedure stack. When an interrupt or exception is handled with a task
switch, the state of the EFLAGS register is saved in the TSS for the task being
suspended.
Instruction Pointer
The instruction pointer (EIP) register contains the offset
in the current code segment for the next instruction to be executed. It is
advanced from one instruction boundary to the next in straightline code or it
is moved ahead or backwards by a number of instructions when executing JMP, Jcc,
CALL, RET, and IRET instructions.
The EIP register cannot be accessed directly by software; it
is controlled implicitly by controltransfer instructions (such as JMP, Jcc,
CALL, and RET), interrupts, and exceptions. The only way to read the EIP
register is to execute a CALL instruction and then read the value of the return
instruction pointer from the procedure stack. The EIP register can be loaded
indirectly by modifying the value of a return instruction pointer on the
procedure stack and executing a return instruction (RET or IRET).
All Intel Architecture processors prefetch instructions.
Because of instruction prefetching, an instruction address read from the bus
during an instruction load does not match the value in the EIP register. Even
though different processor generations use different prefetching mechanisms,
the function of EIP register to direct program flow remains fully compatible
with all software written to run on Intel Architecture processors.
Operand-size and Address-size Attributes
When processor is executing in protected mode, every code
segment has a default operand-size attribute and address-size attribute. These
attributes are selected with the D (default size) flag in the segment
descriptor for the code segment. When the D flag is set the 32-bit operand-size
and address-size attributes are selected; when the flag is clear, the 16-bit
size attributes are selected. When the processor is executing in real-address mode,
virtual-8086 mode, or SMM, the default operand-size and address-size attributes
are always 16 bits.
The operand-size attribute selects the sizes of operands
that instructions operate on. When the 16-bit operand-size attribute is in
force, operands can generally be either 8 bits or 16 bits, and when the 32-bit
operand-size attribute is in force, operands can generally be 8 bits or 32
bits. The address-size attribute selects the sizes of addresses used to address
memory: 16 bits or 32 bits. When the 16-bit address-size attribute is in force,
segment offsets and displacements are 16-bits. This restriction limits the size
of a segment that can be addressed to 64 KBytes. When the 32-bit address-size
attribute is in force, segment offsets and displacements are 32-bits, allowing
segments of up to 4 GBytes to be addressed. The default operand-size attribute
and/or address-size attribute can be overridden for a particular instruction by
adding an operand-size and/or address-size prefix to an instruction. The effect
of this prefix applies only to the instruction it is attached to.
Pentium II
The Pentium II incorporates many of the salient features of
the Pentium Pro and Pentium MMX; however, its physical package was based on the
SECC/Slot 1 interface and its 512 KB L2 cache ran at only half the processor
internal clock rate. First generation Pentium II Klamath CPUs operated at 233,
266, 300 and 333Mhz with a FSB of 66Mhz and a core voltage of 2.8 Volts. In
1998, Intel introduced the Pentium II Deschutes that operated at a speed of
350, 400 and 450 MHz with a 100 Mhz, and later 66MHz, FSB and at 2.0 Volts at
the core. Its major improvements were:
q 16
Kb L1 instruction and data caches
q L2
cache with non-proprietary commercially available SRAM
q Improved
16 bit capability through segment register caches
q MMX
unit.
q Standard
Pentium II could only be used in dual multiprocessor configurations; however,
Pentium XEON cpus had up to 2 MB of L2 cache and could be used in
multiprocessor configurations of up to 4 processors.
Celeron
The Celeron began as a scaled down version of the Pentium II
and was designed to compete against similar offerings from Intel’s competitors.
The Klamath-based Covington core ran at 266 and 300 MHz and were constructed
without an L2 cache. However, adverse market reaction saw the Deschutes-based
Mendocino core introduced with an 128 Kb L2 cache and ran at 300, 333, 400,
433, 466, 500 and 533 MHz. Celerons have the same L1 cache as their bigger
brothers—Pentium II and III. The important distinction is that the L2 cache
operates at full CPU clock rates, unlike the Pentium II and the SECC packaged
Pentium III. (Later variants of the Pentium III had an on-die L2 cache which
ran at full CPU clock rate. The Celeron III (Coppermine128 core)has the same
internal features as the Pentium III, but has reduced functionality: 66 Mhz
clock rate, no error correction codes for the data bus, and parity creation for
the address bus, and a maximum of 4 GB of
address space. Celeron III Coppermine128s with a 1.6 V core and a 100
MHz were produced in 2001 and operated at core speeds of up to 1.1 Mhz.
Tualatin-core Celerons were put on the market in late 2001 and ran at 1.2 GHz.
2002 saw the final versions produced running aty 1.3 and 1.4 MHz.
Pentium III
The only significant difference between the Pentium III and
its predecessor was the inclusion of 72 MMX instructions, known as the Internet
Streaming Single Instruction Multiple Data Extensions (ISSE), they include
integer and floating point operations. However, like the original MMX
instructions, application programmers must include the corresponding extensions
if any use is to be made of these instructions. The most controversial and
short-lived addition was the CPU ID number which could be used for software
licensing and e-commerce. After protest from various sources, Intel disabled it
as default, but did not remove it. Depending on the BIOS and motherboard
manufacturer, it may remain as such but it can be enabled via the BIOS. In
reality, Pentium III performance was based. The three variants of Pentium III
were the were the Katami, Coppermine, and Tualatin. Katami first introduced the
ISSE (MMX/2) as described with an FSB of 100 MHZ. The Coppermine also
introduced Advanced Transfer Cache (ATC) for the L2 cache which reduced cache
capacity to 256 KB but saw the cache run at full processor speed. Also the
64-bit Katami cache bus was quadrupled to 256 bits. Coppermine also uses an
8-way set associative cache, rather than the 4-way set associative cache in the
Katami and older Pentiums. Bringing the cache on-die also increased the
transistor count to 30 million, from the 10 million on the Katami. Another
advance in the Coppermine was Advanced System Buffering (ASB), which simply
increased the number of buffers to account for the increased FSB speed of 133
MHz. The Pentium III Tualatin had a reduced die size that allowed it to run at
higher speeds. Tualatins use a 133MHz FSB and have ATC and ASB.
Pentium IV: The Next Generation
The release of the Pentium IV in 2000 heralded the seventh
generation of Intel microprocessors. The release was premature, however, due to
the out performance of the Pentium III Coppermine, with its 1 Ghz performance
threshold, by Intel’s major competitor the microprocessor market, the AMD
Athlon. Intel was not ready to answer the competition through the early release
of the next member of its Pentium III family, the Pentium III Tualatin, which
were designed to break the 1 Ghz barrier. Previous attempts to do so with the
Pentium III Coppermine 1.13 Ghz met with failure due to design flaws.
Paradoxically, however, Intel was in a position to release the first of the
Pentium IV family the Willamette, which ran at 1.3, 1.4 and 1.5 Mhz, using a
FC-PGA package on the short-lived Socket 423, which was a design dead end for
motherboard manufacturers and consumers. Worse still, the only Intel chipset
available for the Pentium IV could only house the highly expensive Rambus DRAM.
In addition, the early versions of Pentium IV CPU were outperformed by slower
AMD Athlons. Nevertheless, the core capability of Intel’s seventh generation
processors is that they can run at ever-higher speeds. For example, Intel’s
sixth generation Pentiums began at 120 Mhz with the Pentium Pro and ended at
over 1.2 Ghz, a tenfold increase. The bottom line here is that Intel’s seventh
generation chips could end up running at speeds of 10 Ghz or more. How has
Intel achieved this? Through a radical
redesign of the Pentium’s core architecture. The following sections illustrate
the major advances.
The most visible feature seen of the new Pentium IV is the
Front Side Bus (FSB) which initially operated at equivalent speed of 400 Mhz as
compared to 100 MHz on the Pentium III. The Pentium III has a 64-bit data bus
that delivered a data throughput of 1.066 GB (64* 133= 1.066). The Pentium IV
FSB bus is also 64-bit wide, however, its 100 Mhz bus speed is ‘quad-pumped’
giving an effective bus speed of 400Mhz and a data transfer rate of 3.2 GB. The
newer (as of late 2002) Pentium IV/chipsets operate at 133 Mhz and deliver a
bus speed of 533 Mhz and a bus speed of 4.2 Ghz. Thus, the Pentium IV exchange
data with the i845 and i850 chipsets faster than any other processor, thus
removing the Pentium III’s most significant bottleneck. Intel's 850 chipset for
the Pentium IV uses two Rambus channels to 2-4 RDRAM RIMMs. Together, these two
RDRAM channels are able to deliver the same data bandwidth as the Pentium IV
FSB. As the later discussion on DRAM indicates, similar transfer rates are
delivered using the i845 chipset and DDR DRAM. stellation enables Pentium
4-systems to have the highest data transfer rates between processor, system and
main memory, which is a clear benefit.
Advanced Transfer Cache
The first major improvement is the integration of the L2
cache and the evolution of the Advanced Transfer Cache introduced in the
Pentium III Coppermine which had just 256 KB of L1 Cache. The first Pentium IV,
the Willamette, had a similar sized cache, but could transfer data at 48 GB per
second at a CPU clock speed of 1.5 Ghz into the CPU’s core logic, In
comparison, the Coppermine could only transfer 16 GB/s at 1 Ghz to its L1
Instruction Cache. Note also that the Front Side Bus speed of the Pentium III
was 133 Mhz, while the Pentium IV Willamette had a FSB speed of 400 Mhz. In
addition, the Pentium IV L2 cache has 128-byte cache lines, which are divided
in two 64-byte segments. For example, when the Pentium IV fetches data from the
RAM, it does so in 64 byte burst transfers. However, if just four bytes (32
bits) are required this block transfer becomes inefficient. However, the cache
has advanced Data Prefetch Logic that predicts the data required by the cache
and loads it into the L2 cache in advance. The Pentium IV's hardware prefetch
logic significantly accelerates the execution of processes that operate on
large data arrays. The read latency (the time it takes the cache to transfer
data into the pipeline) of Pentium 4's L2-cache is 7 clock pulses. However, its
connection to the core logic (the Translation Lookaside buffer in this case,
there is no I-Cache in the Pentium IV) is 256-bit wide and clocked the full
processor speed. The second member of the Pentium IV family was the Northwood,
which had a 512 KB L2 Cache running at the processor’s clock speed.
L1 Data Cache
The second major development in cache technology is that the
Pentium IV has only one L1 8 KB data cache. In place of the L1 instruction
cache (I-Cache) in the 6th generation Pentiums it has a much more
efficient Execution Trace Cache.
Intel reduced the size of its L1 data cache to enable a very
low latency of only 2 clock cycles. This results in an overall read latency
(the time it takes to read data from cache memory) of less than half of the
Pentium III's L1 data cache.
7th Generation NetBurst Micro-Architecture
Intel’s NetBurst Micro-Architecture provides a firm
foundation for future advances in processor performance, particularly where
speed of operation is concerned. The NetBurst micro-architecture has four major components:
Hyper Pipelined Technology, Rapid Execution Engine, Execution Trace Cache and a
400MHz system bus. Also incorporated are four significant improvements over
sixth generation architecture: Advanced Dynamic Execution, Advanced Transfer
Cache, Enhanced Floating Point & Multimedia Unit, and Streaming SIMD
Extensions 2.
Hyper Pipelined Technology
The
traditional approach to increasing a CPU’s clock speed was make smaller
processors by shrinking the die. An alternative strategy evident in RISC
processors is to make the CPU more efficient do less per clock cycle and have
more of them. To do this in a CISC-based processor, Intel simply increased the
number of stages in the processor’s pipeline. The upshot of this is that less
is accomplished per clock cycle. This is akin to a ‘bucket-brigade’ passing
smaller buckets rapidly down a chain, rather than larger buckets at a slower
rate. For example, the U and V integer pipelines in the original Pentium each
had just five stages: instruction fetch, decode 1, decode 2, execute and
write-back. The Pentium Pro introduced a P6 architecture with a pipeline
consisting of 10 stages. The P7 NetBurst
micro-architecture in the Pentium IV increased the number of stages to 20. This, Intel terms its Hyper Pipelined
Technology.
Enhanced Branch Prediction
The
key to pipeline efficiency and operation is effective branch prediction, hence
the much improved branch prediction logic in the Pentium IV’s Advanced Dynamic
Execution Engine (ADE). The Pentium IV’s
branch prediction logic delivers a 33% improvement in prediction efficiency
than that of the Pentium III. The Pentium IV also contains a dedicated 4 KB
Branch Transfer Buffer. When a processor’s branch prediction logic predicts the
flow of operation correctly no changes need to be made to the code in the pipeline.
However, when an incorrect prediction is made, the contents of the pipline must
be replaced a new instruction cycle must begin at the start at the beginning of
the pipeline. 6th generation processors with their 10 stage
pipeline suffer a lower overhead penalty for an unpredicted branch than that of
the Pentium IV with its 20 stage pipeline. The longer the pipeline, the
further back in a process’s instruction execution path the processor needs to
go in order to correct unpredicted branches. One critical element in overcoming
problems with unpredicted branches is the Execution Trace Cache.
Execution Trace Cache
The Pentium IV’s sophisticated fancy Execution Trace Cache is
simply a 12 KB L1 instruction cache that lies sits between the decoders and the
Rapid Execution Engine. The cache stores the microcode (micro-ops) of decoded
complex instructions, especially those in a program loop, and minimises the
wait time of the execution engine.
Rapid Execution Engine
The
major advance in the Pentium IV’s execution unit is that its two Arithmetic
Logic Units operate at twice the CPU clock rate. This means that the
1.5GHz Pentium 4 had ALU’s running at 3GHz: the ALU is effectively ‘double
pumped’. The Floating Point Unit has no such feature. Why the difference?
Intel had to double pump the ALUs in order to deliver integer performance that
was at least equal to that of a lower clocked Pentium III. Why? The
length of the Pentium IV’s 20 stage pipeline and to ensure that any hit
caused by poor branch prediction could be made up for by faster execution of
microcode. The benefits here are that as the Pentium IV’s clock speed
increases, the integer performance of the processor will improve by a factor of
two.
Enhanced Floating Point Processor
The
Pentium IV has 128-bit floating point registers (up from the 80 bit registers
in he 6th generation Pentiums) and a dedicated register for data
movement. This enhances floating point operations, which are not prone to the
same type of branch prediction inefficiencies as integer-based
instructions.
Streaming SIMD Extensions 2
In
the follow-up to Intel’s Streaming SIMD (Single Instruction Multiple Data)
Extensions (SSE). SIMD is a technology
that allows a single instruction to be applied to multiple datasets at the same
time. This is especially useful when processing 3 D graphics. SIMD-FP (Floating
Point) extensions help speed up graphics processing by taking the
multiplication, addition and reciprocal functions and apply them to the
multiple datasets simultaneously. Recall, SIMD first appeared with the Pentium
MMX which incorporated 57 MMX instructions. These are essentially SIMD-Int (integer) instructions. Intel first
introduced SIMD-FP extensions in the Pentium III with 72 Streaming SIMD Extensions
(SSE). Intel introduced 144 new instructions in the Pentium IV that enable it
to handle two 64-bit SIMD-INT operations and two double precision 64-bit
SIMD-FP operations. This is contrast to the two 32-bit operations the Pentium
MMX and III (under SSE) handle. The major benefit of SSE2 is enhanced greater
performance, particularly with SIMD-FP instructions, as it increases the
processor’s ability to handle greater precision floating point calculations. As
with MMX and SSE, these instructions require software support.
Celeron IV
The Celeron IV first appeared in 2002, these were based on the
Pentium IV and could be accommodated on the Socket 478 motherboards. Based on
the Willamette, the L2 was halved to 128 KB and ran at 1.7 GHz. Later models
ran at 1.8, 1.9 and 2 GHz. The next member was based on the Northwood and had
256 KB L2 cache. Based on the i845 chipset, the new Celeron’s are now good
value entry level processors.
Additional Resources
The following Diagrams of the Pentium III, IV and AMD Athlon
CPUs are provided to highlight the architectural features of these
microprocessors and enhance the foregoing text. The following figures have been
obtained from Tom’s Hardware Guide (NOT
this Tom): further insights into the Intel architectures may be found at: (http://www6.tomshardware.com/cpu/20001120/index.html).
[1] 16
bit applications operate in real mode on all Intel CPUs. This effectively
limits the address space to 1 MB, by using 16 x 64 KB program segments. Each 16
bit application can only address 64 KB (216 = 65,536 locations),
however, the CPU manages and uses an extra 4 bit/address lines to provide the
BIOS, OS and applications with 16 (24 = 16) segment
addresses.
No comments:
Post a Comment
Thanks for comment me