![]() ![]() ![]()
|
School of Computing and Mathematical Sciences, De Montfort University, Leicester 24 September 1995 |
Modern processors, including microprocessors, use instruction pipelining and cache memory techniques first used in the large mainframe computers of the 1960's and 1970's (Foster 1976).
A program consists of a sequence of instructions in main memory. Under the control of the Control Unit each instruction is processed in a cyclic sequence called the fetch/execute or instruction cycle:
Fetch Cycle
A machine code instruction is fetched from main memory and moved into the Instruction Register, where it is decoded.
Execute Cycle
The instruction is executed, eg data is transferred from main memory and processed by the ALU.
To speed up the overall operation of the CPU modern microprocessors employ instruction prefetch or pipelining which overlap the execution of one instruction with the fetch of the next or following instructions. For example, the MC68000 uses a two-word (each 16-bits) prefetch mechanism comprising the IR (Instruction Register) and a one word prefetch queue. When execution of an instruction begins, the machine code operation word and the word following are fetched into the instruction register and one word prefetch queue respectively. In the case of a multi-word instruction, as each additional word of the instruction is used, a fetch is made to replace it. Thus while execution of an instruction is in progress the next instruction is in the prefetch queue and is immediately available for decoding. Powerful processors make extensive use of pipelining techniques in which extended sequences of instructions are prefetched with the decoding, addressing calculation, operand fetch and execution of instructions being performed in parallel (Tanenbaum 1990). In addition, modern processors cater for the pipelining problems associated with conditional branch instructions, etc, eg the MC68040 (Edenfield et al 1990).
There has always been a problem of maintaining comparability between processor and memory speed (Foster 1976, Stallings 1993, Tanenbaum 1990). Increasing processor speed is relatively cheap in comparison to corresponding increases in the speed of the bus and main memory configuration (hence the use of WAIT states to match processors to slower and cheaper memory).
A cache memory makes use of the locality of reference phenomenon already discussed in the section on virtual memory, ie over short periods of time references of both instructions and data tend to cluster. The cache is a fast memory (matched to CPU speed), typically between 4K and 256Kbytes in size, which is logically positioned between the processor and bus/main memory. When the CPU requires a word (instruction or data) a check is made to see if it is in the cache and if so it is delivered to the CPU. If it is not in the cache a block of main memory is fetched into the cache and it is likely that future memory references will be to other words in the block (typically a hit ratio of 75% or better can be achieved). Clearly memory writes have to be catered for and the replacement of blocks when new block is to be read in. Modern microprocessors (Intel 80486 and Motorola MC68040) have separate on-chip instruction and data cache memories - additional external caches may also be used, see Fig 2. Cache memory is particularly important in RISC machines where the one instruction execution per cycle makes heavy demands on main memory.
The concept of a cache has been extended to disk I/O. When a program requests a block or blocks several more are read into the cache where it is immediately available for future disk access requests. Disk caches may take two forms:
Software disk cache
in which the operating system or disk driver maintain the cache in main memory, ie using the main CPU of the system to carry out the caching operations.
Hardware disk cache
in which the disk interface contains its own cache RAM memory (typically 4 to 16Mbytes) and control circuits, ie the disk cache is independent of the main CPU.
Hardware disk caches are more effective but require a more complex (and expensive) disk controller and tend to be used with fast disks in I/O bound applications, eg databases.

4.3 Example Processor Evolution: Intel and Motorola Microprocessors
The Motorola MC68000 family has evolved considerably since the introduction of the MC68000 in 1979 (the Intel 8086 family has evolved along similar lines - see Fig. 3):
MC68000 - 1979
NMOS technology approximately 68000 transistors. 16-bit data bus, 24-bit address bus (maximum 16 Mbyte memory)
2 word prefetch queue (including IR)
approximately 0.6 Mips at 8MHz
MC68008 - 1982
NMOS technology - from a programmers viewpoint almost identical to 68000
8-bit data bus, 20 bit address bus (maximum 1Mbyte memory)
approximately 0.5 Mips at 8MHz
MC68010 - 1982
as 68000 with the following enhancements:
three word prefetch queue (tightly looped software runs in 'loop mode')
memory management support (for virtual memory)
approximately 0.65 Mips at 8MHz
MC68020 - 1984
CMOS technology with 200000 transistors
true 32-bit processor with 32-bit data and address busses (4 Gbyte address space)
extra instructions and addressing modes
three clock bus cycles (68000 bus cycles take four clock cycles)
extended instruction pipeline on-chip 256 byte instruction cache co-processor interface, eg for MC68881 floating-point co-processor
approximately 2.2 Mips at 16MHz
MC68030 - 1987
300000 transistors
extended pipelining
256 byte on-chip instruction cache and 256 byte on-chip data cache
on-chip memory management unit
approximately 5.0 Mips at 16MHz
MC68040 - 1989
1200000 transistors
4Kbyte on-chip instruction cache and 4Kbyte on-chip data cache
on-chip memory management unit and floating point processor
pipelined integer and floating point execution units operating concurrently
approximately 22.0 Mips at 25MHz
4.4 CISC and RISC processors (Stallings 1993, Wilson 1989, Tanenbaum 1990)
During the 1970's and 1980's as the size the silicon wafers increased and circuit elements reduced the architecture of processors become more and more complex. In an attempt to close the semantic gap between high-level language operations and processor instructions more and more powerful and complex instructions and addressing modes were implemented. As microprocessors evolved this continued until many of todays advanced microprocessors (eg Intel 80486, Motorola 68040) have hundreds of instructions and tens of addressing modes. This type of processor architecture is called a complex instruction set computer or CISC. There are a number of drawbacks with this approach:
| Statement | SAL | XPL | Fortran | C | Pascal | Average |
|
Assignment IF CALL LOOP GOTO other |
47 17 25 6 0 5 |
55 17 17 5 1 5 |
51 10 5 9 9 16 |
38 43 12 3 3 1 |
45 29 15 5 0 6 |
47 23 15 6 3 7 |
An alternative approach to processor architecture was evolved called the reduced instruction set computer or RISC. The number of instructions was reduced by an order of magnitude and the space created used for more processor registers (a CISC machine typically has 20 registers a RISC machine 500) and large on-chip cache memories. All data manipulation is carried out on and using data stored in registers within the processor, only LOAD and STORE instructions move data between main memory and registers (RISC machines do not allow direct manipulation upon data in main memory). There are a number of advantages to this approach:
The disadvantages are:
Until recently (Wilson 1989) there was no out and out winner
with RISC and CISC machines of similar price giving similar overall
performance. However, problems have arisen with the latest generations
of CISC microprocessors which incorporate sophisticated on-chip
instruction pipelines, memory management units, large instruction
and data caches, floating point units, etc. As clock speeds were
increased (to improve performance) severe problems occurred in
maintaining reliable production runs with commercially available
machines appearing up to a year after the announcement of the
microprocessor concerned. An interesting pointer to the current
trend towards RISC technology is that all the latest high performance
workstations are RISC based (in some cases replacing CISC models),
eg IBM 6000, DEC 5000, Hewlett Packard 9000/700.
| CPU | Transistors |
Design (person-months) |
Layout (person-months) |
|
RISC I RISC II MC68000 Z8000 Intel APx-432 |
44,000 41,000 68,000 18,000 110,000 |
15 18 100 60 170 |
12 12 70 70 90 |
A new approach to processor architecture is called Very Long Instruction Word (VLIW).
4.5 Special Purpose Processors, Multi-processors, etc.
Adding extra processors can significantly enhance the overall performance of a system by allowing tasks to be performed by specialised hardware and/or in parallel with 'normal' processing .
4.5.1 Special Purpose Processors.
The use of specialised processors to perform specific functions was implemented in the large mainframe computer systems of the 1970's, eg the PPU's (peripheral processing units) of the CDC 6600 (Foster 1978). Today's high performance systems may contain a number of specialised processors:
Floating point co-processor
to carry out real number calculations.
Graphics processor
to control the graphics display. This can range from a fairly simple graphics controller chip which provides basic text, pixel and line drawing capabilities up to specialised processors which support advanced graphics standards such as X windows.
Input/Output control processors
which carry out complex I/O tasks without the intervention of the CPU, eg network, disk, intelligent terminal I/O, etc. For example, consider a sophisticated network where the network communications and protocols are handled by a dedicated processor (sometimes the network processor and associated circuits is more powerful and complex than the main CPU of the system).
In a 'simple' system all the above tasks would be carried out by sequences of instructions executed by the CPU. Implementing functions in specialised hardware has the following advantages which enhance overall system performance:
(a) the specialised hardware can execute functions much faster than the equivalent instruction sequence executed by the general purpose CPU; and
(b) it is often possible for the CPU to do other processing while a specialist processor is carrying out a function (at the request of the CPU), eg overlapping a floating point calculation with the execution of further instructions by the CPU (assuming the further instructions are not dependent upon the result of the floating point calculation).
4.5.2 Multi-processors and Parallel Processors
John von Neuman in 1949 (Foster 1978, Tanenbaum 1990) developed EDSAC, the first electronic stored program computer, in which a single CPU sent sequential requests over a bus to memory for instructions and data. The vast majority of computer systems (CISC and RISC) built since that time are essentially developments of the basic von Neuman machine.
One of the major limitations when increasing processor clock rate is the speed, approximately 20cm/nsec, at which the electrical signals travel around the system. Therefore to build a computer with 1nsec instruction timing, signals must travel less than 20cm to and from memory. Attempting to reducing signal path lengths by making systems very compact leads to cooling problems which require large mainframe and supercomputers to have complex cooling systems (often the downtime of such systems is not caused by failure of the computer but a fault in the cooling system). In addition, many of the latest 32-bit microprocessors have experienced over-heating problems. It therefore becomes harder and harder to make single processor systems go faster and an alternative is to have a number of slower CPUs working together. In general modern computer systems can be categorised as follows:
The von Neuman machine is SISD architecture in which some parallel processing is possible using pipelining and co-processors.
4.5.2.1 Data parallel processing
In data parallel processing one operation acts in parallel on different data. For example, the SIMD (single-instruction multiple-data) architecture is one in which a control unit issues the same instruction to a number of identical processing elements or PEs. For example, such an architecture is useful in specialised applications where a sequence of instructions is to be applied to a regular data structure. For example, image processing applications (from pattern recognition to flight simulators) require sequences of operations to be applied to all pixels (picture elements) of an image; which may be done pixel by pixel in a single processor system or in parallel in a SIMD system. Many complex application areas (aerodynamics, seismology, meteorology) require high precision floating point operations to be carried out on large arrays of data. 'Supercomputers' designed to tackle such applications are typically capable of hundreds of millions of floating point operations per second.
4.5.2.2 Control parallel processing
Data parallel processing is applicable to a limited range of applications where a sequence of instructions is applied to a data structure. General purpose computing, however, requires multiple instruction multiple data processing. Such an environment is called control parallel processing in which different instructions act on different data in parallel.
The MIMD (multiple-instruction multiple-data) architecture is one in which multiple processors autonomously execute different instructions on different data. For example:
Multi-processing
in which a set of processors (eg in a large mini or mainframe system) share common main memory and are under the integrated control of an operating system, eg the operating system would schedule different programs to execute on different processors.
Parallel processing
in which a set of processors cooperatively work on one task in parallel. The executable code for such a system can either be generated by:
(a) submitting 'normal' programs to a compiler which can recognise parallelism (if any) and generate the appropriate code for different processors;
(b) programmers working in a language which allows the specification of sequences of parallel operations (not easy - the majority of programmers have difficulty designing, implementing and debugging programs for a single processor computer).
