C SC 340 Lecture 8: Main Memory

[ previous | schedule | next ]

The overall goal

Every fetch/execute cycle requires instructions/data to be transferred to/from main memory. The goal of memory management is to satisfy these requests as efficiently as possible in a multiprogramming environment. Our initial focus is on mechanism over policy: the CPU scheduler decides which process will run next; the memory manager assures the process is loaded in main memory.

Mapping logical to physical addresses

We start with three assumptions:
  1. a process' instructions are in binary executable format
  2. the memory unit will at any given time contain instructions and data from more than one process.
  3. a process can be loaded into any available memory location

This is reasonable: even if only one process is currently running, a quick context switch to a "ready" process cannot occur unless at least a portion of that process is already contained in memory. And if there are multiple processes, the OS will not be able to guarantee them a "reserved" place in memory.

Given those assumptions, address references contained in the binary executable code will in general not match the target physical addresses when the process runs. The address references in the program are thus logical addresses, not physical addresses. In order for the program to run correctly, logical addresses must be mapped, or bound, to physical addresses at some point. Logical addresses are also known as virtual addresses.

Historically, address binding has been available at any of three steps:

Computers and OSs in the modern era use execution time binding almost exclusively. Exceptions include certain OS kernel code which resides in fixed low memory locations (e.g. interrupt vectors and handlers).

Execution time binding requires special hardware consisting at a minimum of a relocation (base) register and limit register, located in the memory-management unit (MMU). Recall these were introduced in lecture 1.

Dynamic linking

Since the development of GUIs, processes have become very large due to the size of the user interface library functions. Moreover, many processes use the same library functions. Traditionally, library functions were statically linked to the user's program at the end of compilation (actually, an intermediate step between compilation and loading).

When the statically-linked programs are run, the result may be multiple copies of the same library functions in memory all at the same time. This is wasteful, so the concept of dynamically linked libraries (DLL) was developed. This is a special form of execution-time address translation. Programs are compiled to substitute a stub for each library call containing information about how to find it at runtime.

When the stub is executed the first time, the corresponding library function is loaded into memory (if not already there). Its loaded address is substituted into the stub and subsequent executions of the call work as if the function were part of the original process image.

This does require some OS assistance to allow the process to call a function located outside its address space. The loaded library function can be shared by multiple processes, thus saving memory.

The other major advantage is ability to update the DLL without having to recompile the programs that use it.

An early version of this was implemented in the original Macintosh 20 years ago. This machine included a 64K ROM module containing GUI library functions that the user process could access directly. Come see the 1984 vintage Macintosh in my office and bet me how long it will take to boot up.

A Word about Swapping versus Virtual Memory

For efficient system performance, there must be a balance between the number of processes the CPU can work with concurrently and the number of processes the memory can contain concurrently.

If the CPU can juggle more processes than can be contained simultaneously in memory, something has to be done by memory to address the imbalance.

The two major approaches are swapping and virtual memory.

In a swapping system, space for a loading process is cleared out by writing a non-running process out to the swap device (special disk sectors). It is later swapped back in for its next CPU burst.

In a virtual memory system, only the currently active portions of a process need be resident in memory. Other parts of the process are not being used -- a thread is a single instruction sequence which is normally linear or in a localized loop. Portions can be loaded in as needed and no-longer-active portions can be aged and overwritten.

A side benefit of virtual memory is it allows the logical address space of a process to exceed the physical address space of the machine.

Nearly all modern systems use the virtual memory approach, and the algorithms we study were designed with that in mind.


Partitioned Allocation

Partitioning involves loading the entire process space into memory. Physical memory is thus partitioned into the various processes, and each process is stored in a contiguous chunk of memory.

This makes address limit checking and relocation very simple! It is not efficient use of space because different processes have different size.

Early partitioning techniques used fixed partitions, a fixed number of partitions of fixed size each. Each partition holds exactly one process, and the OS made an effort to fit processes to partitions.

This soon gave way to variable partitions in which the partition was only as large as the process, and could be allocated in any available contiguous hole of memory large enough to contain it. The OS must maintain a list of holes. The trick then is matching processes to holes, and several strategies emerged:

Best fit uses space efficiently but creates small holes that are hard to fill and can add up to lots of wasted space. Worst fit creates large holes that are more likely to be used by later processes. First fit allocates faster because there is little decision logic. Note these are all policy decisions.

All these approaches result in external fragments, holes between processes that are not large enough to be usable. Fragments can be periodically removed through compaction but this requires OS overhead.


Paged Allocation

The fragmentation problems of variable partitioning were caused by requirement for contiguous memory allocation. Paging allows the physical process space to be non-contiguous.

Here we consider only the mechanism, which is loading pages into frames and translating logical addresses to physical addresses. The policy, which concerns which frames to load, when to load them, where to load them and when to replace them, is covered in the virtual memory lecture.

Paging eliminates external fragmentation but introduces internal fragmentation. This is simply the unused portion of the last page of a process, on average one half a page in size. Fragmentation is reduced only by reducing the page size, which has performance costs of its own (results in more frames and thus longer page table, see below).

Another cost of paging: since any frame can be allocated to any page, the OS has to keep track of which frames are available (for future allocations), plus keep track of which frames are allocated to which process (for protection -- prevent process from accessing frame allocated to different process) and which frames are available for allocation. It keeps track using a frame table with one entry per physical frame.

Hardware support for paged address translation

The hardware required to implement paged addressing includes:

The translation process is simple (and is not sequential as implied by numbered steps):
  1. logical address is loaded into logical address register
  2. page displacement is copied to frame displacement
  3. page number is used to index into page table
  4. page table entry is copied to frame number
  5. physical address used to access memory
An example would be good here.

Implementing the page table

Each process has its own page table. The page table (or at least a pointer to it) is part of the Process Control Block and must be saved/restored upon context switch.

The page table can be implementing using registers only if the table is very small. The benefits of a large page table (e.g. 1 million entries) are so great the sacrifice is made to store them in main memory. In this case, the base address of the page table itself is stored in a register. To get the page table entry thus requires an additional memory access to accomplish steps 3 and 4 above:

Protection bits can also be added to each page table entry. Two examples are:

If multiple processes are running the same application, it is advantageous to keep only one shared copy of the application's binary code in memory. The page tables for those processes will all contain entries pointing to the same set of shared frames occupied by the application.

Speeding up the translation process

As described just above, the logical-to-physical address translation itself requires a memory access to get the page table entry. Thus each memory request requires two memory accesses. Can this process be made faster? You know the answer.

The answer is a small associative cache of recently-accessed page table entries called the Translation Look-aside Buffer (TLB).

Address translation now becomes:
  1. logical address is loaded into logical address register
  2. page displacement is copied to frame displacement of physical address register
  3. page number is used to search TLB
  4. If TLB hit, copy frame number into physical address
  5. If TLB miss,
  6. physical address used to access memory

Multi-level Paging

Page table size is a huge concern. Consider a 32 bit virtual address with a reasonable 4KB page size; the displacement field uses the low order 12 bits, leaving 20 for the page number. The page table thus has 220 (over 1 million) entries, with each entry requiring 4 bytes or more -- 4 MB of RAM per process!

Wait, it gets worse...the page table must be stored in contiguous locations to allow page number to be used as index!

Solution? Page the page table! Instead of having one page table with 220 entries, you could have, say, 210 page tables each with 210 entries. E.g. 1024 page tables of 1024 entries each. The contiguous storage requirement then drops from 4MB to 4KB (e.g. could be stored in one frame).

Here's what's required:

Advantage? Page table can be split up and stored in non-contiguous frames, facilitating memory management.

Disadvantage? Now two memory accesses are required to find the frame number, one to access the outer page table and a second to access the page table page. This can be overcome using a TLB, since a TLB hit would eliminate both accesses.

Can this be extended to 3-levels? 4? Sure -- the Motorola 68030, used in Macintoshes for years, implemented 4-level paging.


Hashed Page Table

For address spaces larger than 32 bits, especially the new 64 bit processors, either page tables get prohibitively long or the number of memory accesses needed to access the page table entry gets prohibitively large (e.g. N-level paging, N > 4). The problem is using page number as table index.

One alternative is a hashed page table. The page table is replaced by a hash table. To perform a memory translation, a hash function is applied to the page number and the resulting hash value becomes the table index containing the corresponding frame number.

The major concern for this technique, as for any hash, is collision, where two different page numbers hash to the same hash table index. One solution is to have each hash table entry contain a pointer to the head of a linked list of page/frame pairs. The hash result is used to access the linked list, which is then traversed sequentially until the page number is matched. Performance is very poor in the worst case, but on average each linked list will have only one or two members.

The average linked list length can be shortened, thus speeding the translation, by making each list node a short page table (e.g. 16 entries). This technique is called clustered page tables. In this case, we must assure the hash function will map all possible entries in a given page table to the same linked list. Easily done: If the number of pages is 2N and each little page table has 2K entries, then hash on the high order N-K bits of the page number and use the low order K bits for page table displacement.


Topsy-turvy: the Inverted page table

Remember the frame table mentioned above? Has one entry per frame, and is used by the OS to track which frames are occupied and which are available for allocation.

Consider a translation approach that uses a page table with one entry per frame. It is indexed by frame number, but the index is not used for searching; instead the index is the result of a successful search! The search is based on combined process ID and virtual page number. Because the index is the result instead of the basis for the search, this technique is called an inverted page table.

Each page table entry contains a process ID and a page number from that process. The process ID is needed because all processes share the same page table.

Here's what's required:


Alternative to paging: Segmentation

Paging is fine but results in memory organization that bears no resemblance to process structures. Examples of process structures are functions, classes, modules, data, and so forth.

Organizing memory by segmentation means thinking of the logical address space as a collection of segments.

Assuming each segment is stored in contiguous chunk of memory, here's what's required:

Advantage of segmentation: facilitates sharing and protection. For read-only contents, such as program module, several processes can share one copy of memory-resident segment. Can control access/manipulation of a segment through protection bits stored in one segment table entry.

Disadvantage of segmentation: external fragmentation of memory, since allocation is based on variable segment sizes.


The ultimate: Paged Segmentation

The advantages of segmentation are considerable, but how can we control the fragmentation problem?? You guessed it: page the segments! We relax the requirement that a segment be stored in contiguous memory. A generic solution involves defining a segment table where each entry points to the page table for that segment. You should be able to figure it out from there.

We will not go into the details of this solution. However you should be aware that this is not just of theoretical interest; the Intel Pentium processor implements the technique of paged segments.

 


[ C SC 340 | Peter Sanderson | Math Sciences server  | Math Sciences home page | Otterbein ]

Last updated:
Peter Sanderson (PSanderson@otterbein.edu)