Difference between revisions of "Techwiki:Memory management in the Windows XP kernel"

From ReactOS Wiki
Jump to: navigation, search
m (Atomic functions of memory management: syntax highlighting)
m (Page fault handling: syntax highlighting)
Line 167: Line 167:
  
 
* MiResolveMappedFileFault called from MiResolveProtoPteFault when Prototype == 1 in the prototype PTE. Then the PTE is treated as follows (options without PAE and with PAE):
 
* MiResolveMappedFileFault called from MiResolveProtoPteFault when Prototype == 1 in the prototype PTE. Then the PTE is treated as follows (options without PAE and with PAE):
 
+
<source lang="c">
 
           typedef struct _MMPTE_SUBSECTION {
 
           typedef struct _MMPTE_SUBSECTION {
 
               ULONG Valid : 1;
 
               ULONG Valid : 1;
Line 185: Line 185:
 
               ULONGLONG SubsectionAddress : 32;  
 
               ULONGLONG SubsectionAddress : 32;  
 
           } MMPTE_SUBSECTION;
 
           } MMPTE_SUBSECTION;
 
+
</source>
 
It contains the address of the object SUBSECTION, supporting the projected file. For example, in SUBSECTION:: ControlArea->FilePointer stored FILE_OBJECT file.
 
It contains the address of the object SUBSECTION, supporting the projected file. For example, in SUBSECTION:: ControlArea->FilePointer stored FILE_OBJECT file.
  

Revision as of 10:19, 7 December 2009

This article outlines the structures and API as used in the Memory Manager. It assumes that the reader is already familiar with low-level memory management operations, and covers in-depth the behavior of both kernel-mode and user-mode functionality.

Prerequisites

This article is intended for those who have already worked with memory in kernel mode, and can distinguish between MmProbeAndLockPages and MmMapLockedPagesSpecifyCache, and is also familiar with hardware memory management in a processor - page directory (PDE), page table entries (PTE), page fault exceptions (#PF). Otherwise, the following two articles are recommended pre-reading:

  1. Kernel mode drivers, parts 6 and 9, by Four-F, to understand kernel-mode memory management
  2. Intel processors in protected mode, parts 6 and 7, by BrokenSword, to understand hardware memory management in a processor (By the way, Part 7 has a bug in the picture - instead of pictures for PDE 4M pages of the image for PDE 4K pages)

PTE Device PDE / PTE, PTE invalid

Consider first how Windows is using the field PTE, which are marked as available for the software operating system by Intel (Avail.) The Windows operating system uses the three bits as follows (the structure with PAE turned off and on, respectively):

     typedef struct _MMPTE_HARDWARE {
        ULONG Valid : 1;
        ULONG Write : 1;
        ULONG Owner : 1;
        ULONG WriteThrough : 1;
        ULONG CacheDisable : 1;
        ULONG Accessed : 1;
        ULONG Dirty : 1;
        ULONG LargePage : 1;
        ULONG Global : 1;
        ULONG CopyOnWrite : 1; // software field
        ULONG Prototype : 1;   // software field
        ULONG reserved : 1;    // software field
        ULONG PageFrameNumber : 20;
    } MMPTE_HARDWARE, *PMMPTE_HARDWARE;
    typedef struct _MMPTE_HARDWARE_PAE {
        ULONGLONG Valid : 1;
        ULONGLONG Write : 1;
        ULONGLONG Owner : 1;
        ULONGLONG WriteThrough : 1;
        ULONGLONG CacheDisable : 1;
        ULONGLONG Accessed : 1; 
        ULONGLONG Dirty : 1;
        ULONGLONG LargePage : 1;
        ULONGLONG Global : 1;
        ULONGLONG CopyOnWrite : 1; // software field
        ULONGLONG Prototype : 1; // software field 
        ULONGLONG reserved0 : 1; // software field
        ULONGLONG PageFrameNumber : 24;
        ULONGLONG reserved1 : 28; // software field
    } MMPTE_HARDWARE_PAE, *PMMPTE_HARDWARE_PAE;

Commentaries labeled such fields.

CopyOnWrite indicates whether the page is copied in the record. These pages are specified by the user attribute or PAGE_WRITECOPY PAGE_EXECUTE_WRITECOPY and that means that the process will allocate a personal copy of the page when you try to write to it. Others will use public unmodified copy. The field Prototype for a valid PTE means that this is a so-called prototype PTE, used for shared memory between processes through the mechanism of projecting into the memory files (Memory Mapped Files, MMF, see the documentation on Win32 API CreateFileMapping, OpenFileMapping, MapViewOfFile(Ex)) this field is reserved for a valid PTE and is not used for the invalid PTE this bits called Transition and is set when the PTE is considered transitional.

I'm not going to talk about the hardware memory management and other fields of the PDE/PTE structures: about bad writing for more than a dozen times. Subsequent narration goes about the format of the PTE, which are used by Windows when the flag Valid = 0, or about invalid (invalid) PTE.

  • Paged out PTE (unloaded PTE) - an invalid PTE, describing the options that has been uploaded to the paging file. On the first demand, it will again be considered and included in the working set. This PTE is described by the following structure:
           typedef struct _MMPTE_SOFTWARE {
              ULONG Valid : 1;
              ULONG PageFileLow : 4;
              ULONG Protection : 5;
              ULONG Prototype : 1;
              ULONG Transition : 1;
              ULONG PageFileHigh : 20;
          } MMPTE_SOFTWARE;

On PAE systems it is:

           typedef struct _MMPTE_SOFTWARE_PAE {
              ULONGLONG Valid : 1;
              ULONGLONG PageFileLow : 4;
              ULONGLONG Protection : 5;
              ULONGLONG Prototype : 1;
              ULONGLONG Transition : 1;
              ULONGLONG Unused : 20;
              ULONGLONG PageFileHigh : 32;
          } MMPTE_SOFTWARE_PAE;

Thus if Valid = 0, PageFileLow contains the number of the paging file (which, you guessed it the most can be 16 pieces). Protection, respectively, the attributes of access to the page specified as constants MM_ *:

          #define MM_ZERO_ACCESS         0  // this value is not used. 
          #define MM_READONLY            1 
          #define MM_EXECUTE             2 
          #define MM_EXECUTE_READ        3 
          #define MM_READWRITE           4  // bit 2 is set if this is writable. 
          #define MM_WRITECOPY           5 
          #define MM_EXECUTE_READWRITE   6 
          #define MM_EXECUTE_WRITECOPY   7 
          #define MM_NOCACHE             8 
          #define MM_DECOMMIT         0x10 
          #define MM_NOACCESS         MM_DECOMMIT|MM_NOCACHE

Prototype = 0 Transition = 0 PageFileHigh - page number in the paging file (the paging file and pumping more)

  • Demand zero PTE (nullable demand PTE) - an invalid PTE, describing the options, which is not in the working set, but the treatment it should be on either list zeroed pages, or from the list of free pages, reset and added to the working set. Described similar unloaded PTE except that PageFileHigh = 0.
  • Prototype PTE (prototype PTE) - invalid PTE, which describe the page, shared by several processes, such as projected in the memory files. More precisely, such PTE is a single instance and are not included in the lists of PDE, and PDE in the lists of the process are the following invalid PTE, referring to the prototype PTE, respectively, of their version for systems without PAE and with PAE:
           typedef struct _MMPTE_PROTOTYPE
              ULONG Valid : 1;
              ULONG ProtoAddressLow : 7;
              ULONG ReadOnly : 1;
              ULONG WhichPool : 1;
              ULONG Prototype : 1;
              ULONG ProtoAddressHigh : 21;
          } MMPTE_PROTOTYPE;
          typedef struct _MMPTE_PROTOTYPE_PAE {
              ULONGLONG Valid : 1;
              ULONGLONG Unused0: 7;
              ULONGLONG ReadOnly : 1;
              ULONGLONG Unused1: 1;
              ULONGLONG Prototype : 1;
              ULONGLONG Protection : 5;
              ULONGLONG Unused: 16;
              ULONGLONG ProtoAddress: 32;
          } MMPTE_PROTOTYPE_PAE;

In this case: Valid = 0 ProtoAddress (ProtoAddressLow / ProtoAddressHigh) contain a reference to the prototype PTE, describing the shared page. Prototype = 1 Protection provides security attributes page (MM_ *) ReadOnly is set, if the page should be read only. Ignored when loading images in the space of a session - the loader is allowed to write in these pages in order to handle the import or placed relokov. WhichPool purpose of this field is unknown to me ..

  • Transition PTE (transitional PTE) - invalid PTE, describing the page that is listed Standby, Modified or ModifiedNoWrite pages (these lists, etc.). At the reference page is returned to the working set. Describe the following structures:
           typedef struct _MMPTE_TRANSITION {
              ULONG Valid : 1;
              ULONG Write : 1;
              ULONG Owner : 1;
              ULONG WriteThrough : 1;
              ULONG CacheDisable : 1;
              ULONG Protection : 5;
              ULONG Prototype : 1;
              ULONG Transition : 1;
              ULONG PageFrameNumber : 20;
          } MMPTE_TRANSITION;
          typedef struct _MMPTE_TRANSITION_PAE {
              ULONGLONG Valid : 1;
              ULONGLONG Write : 1;
              ULONGLONG Owner : 1;
              ULONGLONG WriteThrough : 1; 
              ULONGLONG CacheDisable : 1;
              ULONGLONG Protection : 5;
              ULONGLONG Prototype : 1;
              ULONGLONG Transition : 1;
              ULONGLONG PageFrameNumber : 24;
              ULONGLONG Unused : 28;
          } MMPTE_TRANSITION_PAE;

In this case: Valid = 0 Prototype = 0 Transition = 1 Appointment of other fields like valid PTE

Page fault handling

When the processor encounters an invalid PTE, a page fault exception occurs (#PF, Page Fault). In Windows, the handler calls _KiTrap0E MmAccessFault() to handle these exceptions, which, after a certain number of checks calls MiDispatchFault, if the page should be resolved successfully.

MiDispatchFault uses one of the following options to resolve the error page:

  • MiResolveProtoPteFault called when an error page on PTE c flag Prototype = 1 She explores the prototype PTE, which indicates the failed PTE and:
    1. If the prototype PTE has a flag too Prototype, then it's projected in the shared memory page file. Called MiResolveMappedFileFault.
    2. If the prototype PTE has a flag Transition, it means that this transition PTE, his page is in the list of modified or idle pages. Popal is there as a result of truncation of the working set. Called MiResolveTransitionFault.
    3. If the prototype PTE Transition == 0 && Prototype == 0 && PageFileHigh == 0, it is demand-zero PTE. Called MiResolveDemandZeroFault.
    4. If the prototype PTE Transition == 0 && Prototype == 0 && PageFileHigh != 0, then the page is swapped out to the swap. Called MiResolvePageFileFault.
  • MiResolveTransitionFault called when the failed PTE has a flag Transition = 1, or if it points to a prototype PTE, having the flag of Transition. Since the pages in this condition are a result of truncation of the working set, or other circumstances, when needed physical pages, the resolution of this error page must be in the return page in the working set. Since the page is not swapped out to disk, then make it very easy - just need to record a valid PTE in place of the invalid. For example, in the state Transition precisely translates page feature MmTrimAllSystemPagableMemory (0), but more about it later in part of the article devoted to pumping .
  • MiResolveDemandZeroFault caused by errors when processing the page zeroing on demand. If the request came from user mode, then is an attempt to allocate physical page from the list of zeroed pages (on the supported list of physical pages, etc.). If this fails, the allocated free pages, and reset. At the request of the kernel-mode does not reset is forced when allocating a page from the list of free pages. To reset use reserved system PTE or hyperspace.
  • MiResolvePageFileFault call processing error page, which was unloaded in the paging file. Read operation is initiated by the paging file from the return status STATUS_ISSUE_PAGING_IO, pages are read from the paging file clusters to reduce the number of page faults. When MiDispatchFault receives the status STATUS_ISSUE_PAGING_IO, it performs an operation of reading pages using the IoPageRead, which makes the establishment of normal operations for the IRP IRP_MJ_READ, but raises it a special flag IRP_PAGING_IO. Page is selected from the list of free or zeroed page.
  • MiResolveMappedFileFault called from MiResolveProtoPteFault when Prototype == 1 in the prototype PTE. Then the PTE is treated as follows (options without PAE and with PAE):
           typedef struct _MMPTE_SUBSECTION {
              ULONG Valid : 1;
              ULONG SubsectionAddressLow : 4;
              ULONG Protection : 5;
              ULONG Prototype : 1;
              ULONG SubsectionAddressHigh : 20;
              ULONG WhichPool : 1;
          } MMPTE_SUBSECTION;
          
          typedef struct _MMPTE_SUBSECTION { 
              ULONGLONG Valid : 1; 
              ULONGLONG Unused0 : 4; 
              ULONGLONG Protection : 5; 
              ULONGLONG Prototype : 1; 
              ULONGLONG Unused1 : 21; 
              ULONGLONG SubsectionAddress : 32; 
          } MMPTE_SUBSECTION;

It contains the address of the object SUBSECTION, supporting the projected file. For example, in SUBSECTION:: ControlArea->FilePointer stored FILE_OBJECT file.

Management of physical memory

Physical memory in the system is described by certain structures of a kernel mode. They are necessary to maintain a list of vacant and occupied pages to meet allocations and other memory operations. To start with, what are the main parts of the kernel responsible for the description and allocation of physical memory system. The first structure, which we consider to be MmPhysicalMemoryDescriptor, having a description:

   typedef struct _PHYSICAL_MEMORY_RUN {
       PFN_NUMBER BasePage;
       PFN_NUMBER PageCount;
   } PHYSICAL_MEMORY_RUN, *PPHYSICAL_MEMORY_RUN;
   typedef struct _PHYSICAL_MEMORY_DESCRIPTOR {
       ULONG NumberOfRuns;
       PFN_NUMBER NumberOfPages;
       PHYSICAL_MEMORY_RUN Run[1];
   } PHYSICAL_MEMORY_DESCRIPTOR, *PPHYSICAL_MEMORY_DESCRIPTOR;
   PPHYSICAL_MEMORY_DESCRIPTOR MmPhysicalMemoryDescriptor;

Variable kernel MmPhysicalMemoryDescriptor describes all of the available and suitable for use physical memory in the system and is initialized at boot time.

The kernel maintains six lists of pages (out of eight possible states), which contain almost all the physical pages, except that with the exception of those used by the memory manager. Lists of pages supported pointers u1.Flink and u2.Blink structure MMPFN (about these later). The lists are:

  • ZeroedPageList - The list of zeroed pages, which can be issued on request from the user code. In the background thread is running MmZeroPageThread (it becomes the primary flow KiSystemStartup after all initialization) and is zeroed free pages with the movement of them in this list. If you request a page the user code is the most priority list, from which can be taken out page.
  • FreePageList - list of free pages. They can be sent after clearing polzovaelyu (stream or flashes event MmZeroingPageEvent, then reset the page flow page zeroing MmZeroPageThread, or in some exceptional cases resets itself - for example when processing a #PF PTE type of demand-zero. In this case, the flow channel page zeroing would add an additional loss of time), in the user's request is the second priority list after ZeroedPageList.
  • StandbyPageList - a list of idle pages. This page was previously part of the working set (process or system), but later were removed from it. Page has not been changed since the last recording on the disc, PTE, referring to a page that is in transition (transition) state and the page can be used to satisfy the request for memory allocation, but after viewing the list zeroed and free pages. (In Windows 2003 there are 8 sublists supporting idle page, on the priorities, they are described in the array MmStandbyPageListByPriority []. In Windows XP and below the list one)
  • ModifiedPageList - list of modified pages, they too had entered into a working set, but later were removed from it by reducing the working sets, for any reason. Pages have changed since the last recording on the disc and must be written in the paging file. PTE still refers to the page, but is not valid and is in transition.
  • ModifiedNoWritePageList - list of modified but not written pages. As above, but the page should not be written to disk.
  • BadPageList - a list of pages that have been marked bad by the memory manager for any reason. They should not be used. For example, the flow of zeroing pages temporarily mark pages bad when looking area of the pages, waiting for reset, so they were suddenly transferred to some process on request allocation of an extended region of memory (MmAllocateContiguousMemory). PTE should not rely on this page.

Status pages, not a list:

  • ActiveAndValid - page is active and real, not included in any list. These pages are part of the working set or not belong to any one of the working set and are part of the nepodkachivaemoy memory. They refer to the actual PTE.
  • TransitionPage - a temporary state pages on the waiting time I / O operations.

Pointers to lists of stores core variable MmPageLocationList[], the contents of which declared as follows:

   PMMPFNLIST MmPageLocationList[8] =
   {
       &MmZeroedPageListHead,
       &MmFreePageListHead,
       &MmStandbyPageListHead,
       &MmModifiedPageListHead
       &MmModifiedNoWritePageListHead,
       &MmBadPageListHead,
       NULL,
       NULL 
   };

There are two important flow, operating lists of pages - page zeroing and stream flow records of modified pages.

  • Flow zeroing pages. It KiSystemStartup goes after initialization of all components of the system and the manager run the sessions smss. It deals with the fact that in a loop waiting for an event MmZeroingPageEvent. When it comes (and it occurs if the system as a sufficient number of free pages that the flow of zeroing could erase them), is captured by the spin-locking database frames (PFN Database), is allocated a page from the list of free pages, is projected into hyperspace and is reset after What is included in the list of zeroed pages, and the cycle repeats.
  • A thread recording the modified pages. After starting the memory management subsystem, MmInitSystem() creates a thread PsCreateSystemThread MiModifiedPageWriter, which kicks off the second subsidiary thread MiMappedPageWriter, while itself goes into MiModifiedPageWriterWorker. The main function of the discharge of pages to the swap file is MiGatherMappedPages, on unloading will be discussed further in the next section.

MmPfnDatabase

MmPfnDatabase is an array of structures MMPFN, describing each physical page in the system. This is perhaps the second most important object, after the PDE/PTE tables, which support low-level memory operations. The lists of PFNs stores information about a particular physical page. MMPFN schematically presented as follows (full obyavleie attached to your sources to the article, including other versions of the OS - Windows 2000, Windows 2003 Server):

   typedef struct _MMPFN {
       union {
           PFN_NUMBER Flink;             // Used if (u3.e1.PageLocation < ActiveAndValid)
           WSLE_NUMBER WsIndex;          // Used if (u3.e1.PageLocation == ActiveAndValid)
           PKEVENT Event;                // Used if (u3.e1.PageLocation == TransitionPage)
           NTSTATUS ReadStatus;          // Used if (u4.InPageError == 1) }
       u1;
       PMMPTE PteAddress;
       union {
           PFN_NUMBER Blink;             // Used if (u3.e1.PageLocation < ActiveAndValid)
           ULONG ShareCount;             // Used if (u3.e1.PageLocation >= ActiveAndValid)
           ULONG SecondaryColorFlink;    // Used if (u3.e1.PageLocation == FreePageList or == ZeroedPageList)
       } u2;
       union {
           struct _MMPFNENTRY {
               ULONG Modified : 1;
               ULONG ReadInProgress : 1;
               ULONG WriteInProgress : 1;
               ULONG PrototypePte: 1;
               ULONG PageColor : 3;
               ULONG ParityError : 1;
               ULONG PageLocation : 3;
               ULONG RemovalRequested : 1;
               ULONG CacheAttribute : 2;
               ULONG Rom : 1;
               ULONG LockCharged : 1;
               ULONG ReferenceCount : 16;
           } e1;
           struct {
               USHORT ShortFlags;
               USHORT ReferenceCount;
           } e2;
       } u3;
       MMPTE OriginalPte;
       union {
           ULONG EntireFrame;
           struct {
               ULONG PteFrame : 26;
               ULONG InPageError : 1;
               ULONG VerifierAllocation : 1;
               ULONG AweAllocation : 1;
               ULONG LockCharged : 1;
               ULONG KernelStack : 1;
               ULONG Reserved : 1;
           };
       } u4;
   } MMPFN, *PMMPFN;

The elements u1.Flink / u2.Blink maintain connectivity of six lists of pages, about which mentioned above, are used when u3.e1.PageLocation < ActiveAndValid. If u3.e1.PageLocation >= ActiveAndValid, then the second union is treated as u2.ShareCount and contains the number of users - the number of PTEs referring to this page. For pages that contain arrays of PTEs, it contains the number of valid PTEs in the page. [Remark: Not entirely correct, there are additional references. 1 for the PDE pointing to this page and another unknown one. It also includes invalid PTEs, that still refer to a PFN (transitional)] If u3.e1.PageLocation == ActiveAndValid, u1 is treated as u1.WsIndex - the index of the page in the working set (or 0 if the page in nepodkachivaemoy memory area). If u3.e1.PageLocation == TransitionPage, u1 is treated as u1.Event - address of the facility event, which will expect the memory manager to allow access to the page. If u4.InPageError == 1 then u1 is treated as ReadStatus and contains the status of read errors.

ReferenceCount contains a reference count of the valid PTEs pointing to this page, or for use within the memory manager (for example, during the page write to the disk, the reference count is incremented). It is always >= ShareCount [Remark: This is not correct. For page tables the ShareCount is usually bigger]. PteAddress contains a back link to the PTE, which indicates that the physical pages. Lower bits means that the PFN is removed. OriginalPte contains the original PTE, used to restore it to the unloading of pages. u4.PteFrame - Number PTE, supports a page where the current structure MMPFN. In addition the union has yet u4 and the following additional flags:

  • InPageError - shows that when reading a page from disk error occurred. u1.ReadStatus keeps the status of this error.
  • VerifierAllocation installed in the unit for allocation protected by Driver Verifier.
  • AweAllocation installed in the unit for Address Windowing Extension
  • Appointment LockCharged fields and fields of the same name MMPFNENTRY me, unfortunately, is not known. If anyone knows - share.
  • KernelStack, apparently, is installed in the unit for the pages that belong to the kernel stack.

If the page is in the list of zero or idle pages, the second association is treated as a pointer linking the lists of zero or free pages for the secondary color (the so-called Secondary Color). The difference in color is made for the following reason: the number of colors is set to the number of pages that can contain a cache memory of the second level of the processor and the difference is that the two neighboring memory allocations are not used the pages of one color for the correct use of cache.

The union u3 contains the flags of the PFN. Consider what they mean:

  • Modified. Set for booster or projected from the disk pages that its contents had been changed and must be flushed to disk.
  • ReadInProgress, he's StartOfAllocation
  • WriteInProgress, he's EndOfAllocation
  • PrototypePte, he's LargeSessionAllocation For nepodkachivaemyh system addresses these three fields are treated as StartOfAllocation, EndOfAllocation and LargeSessionAllocation and indicate the following:
    • StartOfAllocation set to 1, if the page is the start nepodkachivaemogo pool.
    • EndOfAllocation set to 1, if the page is the end nepodkachivaemogo pool.
    • LargeSessionAllocation set to 1 for large allocation in the space of the session.

For booster addresses these fields are as follows:

    • ReadInProgress set until the page is in the process of reading from disk
    • WriteInProgress set until the page is written to disk
    • PrototypePte installed when PTE, which refers to this PFN, is a prototype.
  • PageColor, it is sometimes called the Primary Page Color, or the color of the page. Used on some platforms to distribute lists of pages (allocation type MiRemoveAnyPage options are issued each time a different list of a different color and a few lists, supporting, for example, free pages, spread evenly). In the x86 and x64 uses only one color pages, and this field is always zero. Not to be confused with the Secondary Color, which is used to distribute pages of the second-level cache and used in functions MiRemoveZeroPage, MiRemoveAnyPage, etc. In addition to simple lists of free and zeroed pages so is supported by lists of free and zeroed pages in color - MmFreePagesByColor[list][SecondaryColor], where _spisok_ - it ZeroedPageList or FreePageList. Lists are maintained together with the common lists of free and zeroed pages if it detects a mismatch generated a blue screen PFN_LIST_CORRUPT.
  • PageLocation - the page type (just one of eight of the above ZeroedPageList to TransitionPage)
  • RemovalRequested - this bit marked pages requested for deletion. After reduction, the reference count to zero, PTE will invalidate the transition, but the page will get a list of bad (BadPageList)
  • CacheAttribute - attribute cache page. MmNonCached or MmCached.
  • Rom - innovation WinXP: physical page is read-only.
  • ParityError - on the page the error occurred honesty

Better written to help assimilate the example contained in the annex to the article. In the example driver, which shows the available Memory Runs and demonstrates the treatment of PDE/PTE/PFN. Code examples are well otkommentirovan and, given the material article, should not cause problems.

Managing virtual memory - paging file

However, placing all data permanently in the physical memory is unprofitable - to some this treatment are rare, the kind often, sometimes to a memory space larger than the available physical memory in the system. Therefore, in all modern operating system has a mechanism to paging. He called differently - unloading, swap, swap. In Windows, this mechanism is part of the memory manager that manages the swap, and a maximum of 16 different paging file (paging files in the terminology of Windows). In Windows there is pumping and nepodkachivaemaya memory, respectively, they can and can not be unloaded in the disc. Memory is pumping in the nucleus can be distinguished from the pool of memory is pumping, nepodkachivaemuyu - respectively from the pool nepodkachivaemoy (for small allocation). In user mode memory is pumping normally, unless it was blocked in the working set by calling VirtualLock. Paging file in the Windows kernel before the kernel variable MmPagingFile [MAX_PAGE_FILES] (maksimalnoe number of paging files, as you might guess at the very beginning by the size of the field page number in the page file to 4 bits, makes 16 pieces). Each page file in this array represented a pointer to the structure of the form:

     typedef struct _MMPAGING_FILE {
        PFN_NUMBER Size;
        PFN_NUMBER MaximumSize;
        PFN_NUMBER MinimumSize;
        PFN_NUMBER FreeSpace;
        PFN_NUMBER CurrentUsage;
        PFN_NUMBER PeakUsage;
        PFN_NUMBER Hint;
        PFN_NUMBER HighestPage;
        PVOID Entry[MM_PAGING_FILE_MDLS];
        PRTL_BITMAP Bitmap;
        PFILE_OBJECT File;
        UNICODE_STRING PageFileName;
        ULONG PageFileNumber;
        BOOLEAN Extended;
        BOOLEAN HintSetToZero;
    	BOOLEAN BootPartition;
    	HANDLE FileHandle;
    } MMPAGING_FILE, *PMMPAGING_FILE;
  • Size - the current size of the swap file (page)
  • MaximumSize - the maximum size of the swap file (page)
  • MinimumSize - minimum size of the swap file (page)
  • FreeSpace - the number of free pages
  • CurrentUsage - employment pages. Always true to the formula Size = FreeSpace + CurrentUsage +1 (the first page is not used)
  • PeakUsage - peak load on the paging file
  • Hint, HighestPage, HintSetToZero - [unknown purpose]
  • Entry - an array of two pointers to blocks MMMOD_WRITER_MDL_ENTRY, used flow records of modified pages.
  • Bitmap - bitmap RTL_BITMAP employment pages in the paging file.
  • File - object file, the file system used to read / write the paging file
  • PageFileName - the name of the paging file, for example, \?? \ C: \ pagefile.sys
  • PageFileNumber - number pagefile
  • Extended - the flag, presumably indicating, whether the paging file expanded ever since the inception
  • BootPartition - a flag indicating whether the paging file on the boot partition. If no paging file located on the boot partition, then during the BSoD crash dump will not be recorded.
  • FileHandle - Hendl paging file.

The annex to the article is an example of the withdrawal otkommentirovanny field structure MmPagingFile [0] working system.

When the system needs a page and free pages, there are few, there is a truncation of the working sets of processes (it occurs for other reasons, it is only one of them). Assume that truncate the working sets was initiated function MmTrimAllSystemPagableMemory (0). During truncation working sets, PTE pages are translated into a state of Transition, the reference count Pfn-> u3.e2.ReferenceCount umensheaetsya 1 (it performs the function MiDecrementReferenceCount). If the reference count reached zero, the actual pages are entered in the lists or StandbyPageList ModifiedPageList, depending on Pfn-> u3.e1.Modified. Pages on your watchlist StandbyPageList can be used as soon as they need - it is enough just to translate PTE in the state Paged-Out. Pages on your watchlist ModifiedPageList must first be recorded flow records of modified pages to disk, and only then they are transferred to StandbyPageList and can be used (for the discharge meets function MiGatherPagefilePages ()). Pseudocode remove pages from the working set (strongly cut code MiEliminateWorkingSetEntry and caused of its functions):

    TempPte = *PointerPte;
    PageFrameNumber = PointerPte->u.Hard.PageFrameNumber;
    
    if( Pfn->u3.e1.PrototypePte == 0)
    {
        //
        // Privacy page, to make transition.
        //
    	
        MI_ZERO_WSINDEX (Pfn);  // Pfn->u1.WsIndex = 0;
    	
        //
        // The following macro does this:
        //
        // TempPte.u.Soft.Valid = 0;
        // TempPte.u.Soft.Transition = 1;
        // TempPte.u.Soft.Prototype = 0;
        // TempPte.u.Trans.Protection = PROTECT;
        //
        
        MI_MAKE_VALID_PTE_TRANSITION (TempPte, 
                                      Pfn->OriginalPte.u.Soft.Protection);
    								  
        //
        // This call actually replaces the current PTE with TempPte and clears the buffer
        // Translation lookaside
        //
        // ( *PointerPte = TempPte );
        //
        
        PreviousPte.u.Flush = KeFlushSingleTb(Wsle[WorkingSetIndex].u1.VirtualAddress,
                                              TRUE,
                                              (Wsle == MmSystemCacheWsle),
                                              &PointerPte->u.Hard,
                                              TempPte.u.Flush);
        //
        // Decrement counter use. If he was equal to zero, the page is translated into a transition state
        // And decremented the reference count.
        //
        
        // MiDecrementShareCount()
        Pfn->u2.ShareCount -= 1;
        
        if( Pfn->u2.ShareCount == 0 )
        {
            if( Pfn->u3.e1.PrototypePte == 1 )
            {
                // ... Additional processing of the prototype PTE ...
            }
            
            Pfn->u3.e1.PageLocation = TransitionPage;
            
            //
            // Decreases by 1 the reference count. If he, too, became equal to zero, move
            //  Page in the list of modified pages, or idle, or completely remove
            // (Placing in the list of bad pages), depending on MI_IS_PFN_DELETED () and RemovalRequested.
            //
            
            // MiDecrementReferenceCount()
            Pfn->u3.e2.ReferenceCount -= 1;
            
            if( Pfn->u3.e2.ReferenceCount == 0 )
            {
                if( MI_IS_PFN_DELETED(Pfn) )
                {
                    // PTE no longer refer to this page. Move it to the list of free or delete, if necessary.
                    
                    MiReleasePageFileSpace (Pfn->OriginalPte); MiReleasePageFileSpace (Pfn-> OriginalPte);
                    
                    if( Pfn->u3.e1.RemovalRequested == 1 )
                    {
                        // Page is marked for deletion. Move it to the list of bad pages. It will not be used,
                        // Until someone does not remove it from the list.
                        
                        MiInsertPageInList (MmPageLocationList[BadPageList],
                                            PageFrameNumber);
                    }
                    else
                    {
                        // Put the page in the list of free
                        MiInsertPageInList (MmPageLocationList[FreePageList],
                                            PageFrameNumber);
                    }
                    return;
                }
                
                if( Pfn->u3.e1.Modified == 1 ) if (Pfn-> u3.e1.Modified == 1)
                {
                    // Page modified. We insert in the list of modified pages,
                    // Modified page writer thread writes it to disk.
                    MiInsertPageList (MmPageLocationList[ModfifiedPageList], PageFrameIndex);
                }
                else
                {
                    if (Pfn->u3.e1.RemovalRequested == 1)
                    {
                        // Remove page, but leave its status as idle.
                        Pfn->u3.e1.Location = StandbyPageList;
                        
                        MiRestoreTransitionPte (PageFrameIndex);
                        MiInsertPageInList (MmPageLocationList[BadPageList],
                                            PageFrameNumber);
                        return;
                    }
                    
                    // Put the page in the list of idle pages.
                    if (!MmFrontOfList) { if (! MmFrontOfList) (
                        MiInsertPageInList (MmPageLocationList[StandbyPageList],
                                            PageFrameNumber);
                    } else {
                        MiInsertStandbyListAtFront (PageFrameNumber);
                    }
                }
            }
        }
    }

In the annex to the article there is a program with source code to demonstrate the truncation of working sets of user mode by calling SetProcessWorkingSetSize (hProcess, -1, -1).

In contrast, when the flow turns to a page that was removed from the working set is an error page. By the paging file are two types of PTE: Transition and Paged-Out. If the page has been deleted from the working set, but have not yet been written to disk, or it does not need to be stored on disk, and it is still in physical memory (state Transition PTE), it is called MiResolveTransitionFault () and PTE simply translated into a state of Valid appropriate adjustment MMPFN and removal of pages from the list of idle or modified pages. If the page has been written to disk, or it did not need to be stored on disk and it is already used for some other purpose (state Paged-Out PTE), it is called MiResolvePageFileFault () and the read operation is initiated by a page from the paging file to the withdrawal of corresponding bit in the bitmap. Pseudocode permission Transition Fault (trimmed code MiResolveTransitionFault):

    if( Pfn->u4.InPageError )
    {
        return Pfn->u1.ReadStatus;  // # PF on the page, reading has not been successful.
    }
    if (Pfn->u3.e1.ReadInProgress)
    {
        // Re-error page. If you are back at the same flow,
        // Then returns STATUS_MULTIPLE_FAULT_VIOLATION;
        // If the other - then forward to the completion of reading.
    }
    
    MiUnlinkPageFromList (Pfn);
    Pfn->u3.e2.ReferenceCount += 1;
    Pfn->u2.ShareCount += 1;
    Pfn->u3.e1.PageLocation = ActiveAndValid;
    
    MI_MAKE_TRANSITION_PTE_VALID (TempPte, PointerPte);
    MI_WRITE_VALID_PTR (PointerPte, TempPte);
    
    MiAddPageToWorkingSet (...);

Pseudo load page from disk (cut code MiResolvePageFileFault):

    TempPte = *PointerPte; TempPte = * PointerPte;
    
    // Prepare the parameters for reading 
    PageFileNumber = TempPte.u.Soft.PageFileLow; 
    StartingOffset.QuadPart = TempPte.u.Soft.PageFileHigh << PAGE_SHIFT;
    FilePointer = MmPagingFile[PageFileNumber]->File;
    
    // Check empty page 
    PageColor = (PFN_NUMBER)((MmSystemPageColor++) & MmSecondaryColorMask);
    PageFrameIndex = MiRemoveAnyPage( PageColor );
    
    // build MDL...
    
    // Adjust its records in the database pages 
    Pfn = MI_PFN_ELEMENT (PageFrameIndex);
    Pfn->u1.Event = &Event;
    Pfn->PteAddress = PointerPte;
    Pfn->OriginalPte = *PointerPte;
    Pfn->u3.e2.ReferenceCount += 1;
    Pfn->u2.ShareCount = 0;
    Pfn->u3.e1.ReadInProgress = 1;
    Pfn->u4.InPageError = 0; 
    if( !MI_IS_PAGE_TABLE_ARRESS(PointerPte) ) Pfn->u3.e1.PrototypePte = 1;
    Pfn->u4.PteFrame = MiGetPteAddress(PointerPte)->PageFrameNumber;
    
    // Temporarily transfer options in Transition condition at the time of reading 
    MI_MAKE_TRANSITION_PTE ( TempPte, ... ); 
    MI_WRITE_INVALID_PTE (PointerPte, TempPte);
    
    // Read the page. 
    Status = IoPageRead (FilePointer,
                         Mdl,
                         StartingOffset,
                         &Event,
                         &IoStatus);
    
    if( Status == STATUS_SUCCESS )
    {
        MI_MAKE_TRANSITION_PTE_VALID (TempPte, PointerPte);
        MI_WRITE_VALID_PTE (PointerPte, TempPte);
        MiAddValidPageToWorkingSet (...);
    }

Working set

Working set by definition is a set of resident pages of the process (the system). There are three types of working sets:

  • Work process contains a set of resident pages that belong to the process - code, data process and all subsequent allocation to user-mode. Stored in EPROCESS::Vm
  • Operating system contains a set of memory-resident page is pumping system. It includes page is pumping code, kernel data, and device drivers, system cache and memory pool is pumping. Pointer to it stored in the variable kernel MmSystemCacheWs.
  • Work session contains a set of resident pages a session, for example, graphics subsystem Windows (win32k.sys). The pointer is stored in MmSessionSpace->Vm.

When the system needs to free the page, initiated by the truncation of working sets - Pages are sent to the lists of Standby or Modified, depending on whether or write to them, and PTE converted into a state of Transition. When the page was finally selected, then transferred to a state of PTE Paged-Out (if this page were discharged in the swap file), or Invalid, if it were projected page file. When a process accesses a page, the page or removed from the list Standy / Modified becomes ActiveAndValid, or initiating the transaction page is loaded from disk, if it was completely unloaded. If the memory is sufficient, the process is allowed to increase its working set, and even exceed the maximum for the page is loaded, otherwise the download page is unloaded some other, that is, the new page replaces the old one. There is a systematic flow of control or so-called working sets Manager balance. He expects the two sites KEVENT, the first of which is triggered by a timer once per second, while the second is triggered when you need to change the working sets. Configuration Manager also checks the balance of associative lists, adjusting depth of for optimum performance.

Atomic functions of memory management

This part will discuss some useful functions of memory management in kernel mode. Layers of kernel memory management functions can be divided as follows from the lowest to the highest level:

  • macros MI_WRITE_VALID_PTE / MI_WRITE_INVALID_PTE
  • low-level functions MiResolve .. Fault, MiDeletePte and other functions work with PDE / PTE, as well as the functions work with MMPFN and lists of pages - MiRemovePageByColor, MiRemoveAnyPage, MiRemoveZeroPage.
  • features provided drivers to work with physical memory: MmAllocatePagesForMdl, MmFreePagesFromMdl, MmAllocateContiguousMemory.
  • features provided drivers for use with a pool: ExAllocatePoolWith ..., ExFreePoolWith ..., MmAllocateContiguousMemory (applies to the previous layer and so)

For the user memory is the case a little differently:

  • macros MI_WRITE_VALID_PTE / MI_WRITE_INVALID_PTE
  • function with VAD and user memory - MiAllocateVad, MiCheckForConflictingVad, etc.
  • function of the virtual memory - NtAllocateVirtualMemory, NtFreeVirtualMemory, NtProtectVirtualMemory.

Describe them start from the lowest to the highest level, first for the memory management kernel, then the user memory.

Kernel-mode memory management

MI_WRITE_VALID_PTE / MI_WRITE_INVALID_PTE

These macros are used in all functions, which in any way affect the allocation or release of physical (eventually) memory. Respectively, they record the valid and invalid PTE in the page table of the process.

Low-level functions to work with PDE/PTE and lists the physical pages are all that I have described, when told about the listings pages, MMPFN and more, so I am bringing only the function prototypes with brief descriptions of their actions:

 PFN_NUMBER FASTCALL MiRemoveAnyPage (IN ULONG PageColor);

Selects a physical page of a given color (SecondaryColor) from the lists of free, zero or idle pages.

 PFN_NUMBER FASTCALL MiRemoveZeroPage (IN ULONG PageColor);

Selects a physical page of a given color (SecondaryColor) from the list of free pages.

 VOID MiRemovePageByColor (IN PFN_NUMBER Page, IN ULONG Color);

Selects the specified page, removing it from the list of free pages of this color.

The functions performed by drivers to work with the physical memory are as follows:

MmAllocatePagesForMdl

 PMDL MmAllocatePagesForMdl (IN PHYSICAL_ADDRESS LowAddress, IN PHYSICAL_ADDRESS HighAddress, 
                             IN PHYSICAL_ADDRESS SkipBytes, IN SIZE_T TotalBytes);

This function allocates physical pages (not necessarily consecutive, as does, for example, MmAllocateContiguousMemory), trying to provide a common page size TotalBytes, starting with the physical address LowAddress and ending HighAddress, "stepping" on SkipBytes. view lists zeroed, then the free pages. Certainly, Pages nepodkachivaemye. If pages are missing, the function tries to allocate as many pages as possible. Return value - Memory Descriptor List (MDL), describing the selected pages. They should be released and the corresponding call MmFreePagesFromMdl ExFreePool for the structure of MDL. The pages are not projected to any virtual address, this should take care of the programmer by calling MmMapLockedPages.

MmAllocateContiguousMemory

 PVOID MmAllocateContiguousMemory (IN ULONG NumberOfBytes, IN PHYSICAL_ADDRESS HighestAcceptableAddress);

This function allocates a physically contiguous area of physical pages total size NumberOfBytes, not higher HighestAcceptableAddress, just projecting them into the kernel address space. First, it tries to allocate pages from nepodkachivaemogo pool, if it is not enough, she begins to view the lists of free and zeroed pages, if they were not enough, it scans the page from the list of idle pages. Returns the base address of the selected memory location. Memory should be freed with a call MmFreeContiguousMemory.

Features provided by drivers for use with a pool are described in Article Four-F, therefore, dwell on this, I will not.

MmIsAddressValid

MmIsAddressValid checks memory page out there if the error page when accessing this address. So, in other words, it checks that the page is now located in physical memory. It should be noted that the state of transition, paged-out, prototype of its do not care, so it may be used only for verification of addresses at high IRQL (> = DPC / Dispatch), since under these IRQL not allowed error pages (and if you meet an error page be thrown blue screen IRQL_NOT_LESS_OR_EQUAL). If you want to check the atomic addresses for access at a low IRQL, then, to my knowledge, no documented way to do this. Apparently, it is believed that the driver must know what his address is correct and what did not and do not try to contact the wrong addresses.

In the annex to the article I have written a function MmIsAddressValidEx, which checks the address on the correctness of access at a low IRQL, given that the PTE can be in invalid state but a page fault will not cause a blue screen or exclusion (in the software sense). Given the stories told me structures invalid PTE, to understand its source code will be difficult.

MmIsNonPagedSystemAddressValid

MmIsNonPagedSystemAddressValid is somehow unfairly shunned by Windows developers and designated as obsolete when, in fact, it is useful. It is an order of magnitude simpler than MmIsAddressValid (which, incidentally, and encourages the use of Microsoft), and only checks that the address is passed to it is pumping or nepodkachivaemoy areas of kernel memory. Address is not checked for correctness, but the result of the function is not equivalent to MmIsAddressValid (in the sense that memory can be in the pool of memory is pumping, but can be swapped out to disk as well as loaded, so the return value is FALSE says nothing about how you can I refer to this memory), so I do not understand why Microsoft felt it "obsolete" and are not recommended, instead of palming MmIsAddressValid. We can use MmIsNonPagedSystemAddressValid, for example, as a function of output MMPFN in the annex when necessary to determine whether the address belongs to is pumping pool (field MMPFN, as you will recall, differ for is pumping and nepodkachivaemogo pools).

User-mode memory management

To start it's worth noting that the management of user memory uses additional mechanism - Virtual Address Descriptors (VAD), which describes the projection of the sections, as well as memory allocation through NtAllocateVirtualMemory (VirtualAlloc in the Win32 API). We present these VAD tree, a pointer to the top is contained in the field EPROCESS-> VadRoot. Sections can be created and projected on the user's email addresses with NtCreateSection, NtMapViewOfSection (Win32API-analogues them: CreateFileMapping, MapViewOfFileEx). Memory addresses can be set aside (reserve) and later the memory of these addresses can be transmitted (commit) process. This manages NtAllocateVirtualMemory.

Functions with VAD and VAD user memory element represented by the following structure:

     typedef struct _MMVAD_FLAGS {
        ULONG_PTR CommitCharge : COMMIT_SIZE;
        ULONG_PTR PhysicalMapping : 1;
        ULONG_PTR ImageMap : 1;
        ULONG_PTR UserPhysicalPages : 1;
        ULONG_PTR NoChange : 1;
        ULONG_PTR WriteWatch : 1;
        ULONG_PTR Protection : 5;
        ULONG_PTR LargePages : 1;
        ULONG_PTR MemCommit: 1;
        ULONG_PTR PrivateMemory : 1;
    } MMVAD_FLAGS;
    
    typedef struct _MMVAD_SHORT {
        ULONG_PTR StartingVpn;
        ULONG_PTR EndingVpn;
        struct _MMVAD *Parent;
        struct _MMVAD *LeftChild;
        struct _MMVAD *RightChild;
        union {
            ULONG_PTR LongFlags;
            MMVAD_FLAGS VadFlags;
        } u;
    } MMVAD_SHORT, *PMMVAD_SHORT;

Also there is a structure MMVAD, similar MMVAD_SHORT, but contains more fields and used for the projected files (additional fields are PCONTROL_AREA, required to maintain the projected and files containing such important pointers as PFILE_OBJECT and others, on projections of files that somehow the next time : and so for 50 kilobytes happened = \) and MMVAD_SHORT used for custom memory allocation. To distinguish what kind of VAD before pointer flag is used u.VadFlags.PrivateMemory: if it exists, is "private memory", ie the usual memory allocation. If dropped - the projection of the file. Paul StartingVpn and EndingVpn, respectively, indicate start and end the virtual page (Virtual Page Number) described area (for converting virtual addresses in the page number is used MI_VA_TO_VPN, which simply shifts the virtual address to PAGE_SHIFT bits to the right). Paul Parent, LeftChild, RightChild used to link the descriptors of virtual addresses into a tree. u.VadFlags contains some useful flags, namely:

  • CommitCharge. This field contains the actual number of pages allocated and transferred to the process, if the VAD describes transferred the memory, or 0 if the reserved memory is described.
  • PhysicalMapping. This flag indicates that the memory is actually a projection of the physical pages are created with the help of MmMapLockedPages at AccessMode == UserMode.
  • Flag ImageMap shows that VAD describes the downloaded executable (using the LoadLibrary and others, ultimately reducible to NtCreateSection with SEC_IMAGE).
  • UserPhysicalPages set when calling NtAllocateVirtualMemory with MEM_PHYSICAL | MEM_RESERVE, used to separate windows for physical pages when ispolschovanii AWE (Address Windowing Extensions).
  • NoChange installed when access is prohibited to change the attributes of the area described by this VAD. To create such an area, use the flag in SEC_NO_CHANGE NtCreateSection.
  • WriteWatch set when allocating memory with the flag MEM_WRITE_WATCH, while creating a bitmap pages, where he subsequently indicated which pages recording was made. This information can be obtained subsequently through Win32 API GetWriteWatch () and reset the card through ResetWriteWatch ()
  • Protection - the original attributes of memory access.
  • LargePages contains 1 when using large pages across MEM_LARGE_PAGES. In Windows XP/2000 is not supported.
  • MemCommit contains 1 if the memory was transferred to the process.
  • PrivateMemory, as has been said, differs from MMVAD MMVAD_SHORT.

The function allocates MiAllocateVad VAD process, reserving the address referred to it as a function MiCheckForConflictingVad (actually a macro drop down in the function call MiCheckForConflictingNode) checks whether there VAD in the process such that they describe a memory area overlaps with those virtual addresses. If so, returns the first VAD conflict area, otherwise NULL. The function is used to transfer the memory process in NtAllocateVirtualMemory to search VAD, the corresponding specified address. Function MiInsertVad adds VAD tree in the virtual address descriptors of the process and reorganize it if necessary. Function MiRemoveVad, respectively, removes and frees VAD. We now turn to the functions available to the user code and device drivers for memory management.

NtAllocateVirtualMemory function does the following:

  1. to back up the caller's address MiCheckForConflictingVad for verification, not whether this region or any part of it set aside or use other functions to work with memory (eg, projection section) previously.
    • If so - returns STATUS_CONFLICTING_ADDRESSES. Next allocated VAD function MiAllocateVad, filled in the appropriate fields and VAD is added to the tree with the help of MiInsertVad. If he describes the appearance or AWE-enabled WriteWatch, then still called MiPhysicalViewInserter.
  2. to send the caller's address MiCheckForConflictingVad, but to find an appropriate VAD, created with us. Then the relevant pages in the page table are set as nullable on demand, as well as changing the attributes of protection, if necessary. NtFreeVirtualMemory produces reverse action.

At this point I think, finally (!) That the article could be completed.

Appendix

In the annex to the article you can find:

  1. Working Sets program to demonstrate the truncation of working sets.
  2. Options manual loading and unloading of pages in the paging file. Note: very moist! Since the page is not added to the working set and is not removed from it, maybe a blue screen or MEMORY_MANAGEMENT PFN_LIST_CORRUPT (for unloading and loading, respectively), so experiment on a real system, I do not advise. It is better to run only investigates and analyzes the code that does not change any system settings. This function MiPageOut and MiLoadBack (prefixes Mi myself added to beauty:))
  3. Function output DbgPrint content MMPFN. This MiPrintPfnForPage.
  4. Function MmIsAddressValidEx for expanded access checks to the addresses at a low IRQL. Returns the status of verification - the element transfer VALIDITY_CHECK_STATUS is found below. The same can be treated as BOOLEAN, since the status of an invalid page 0, and all the rest is greater than zero.
  5. Complex sample driver that demonstrates all of these functions (manual loading and unloading are commented out).
          enum VALIDITY_CHECK_STATUS {
          	VCS_INVALID = 0,     //  = 0 (FALSE)
          	VCS_VALID,           //-|
          	VCS_TRANSITION,      // |
          	VCS_PAGEDOUT,        // |-   > 0
          	VCS_DEMANDZERO,      // |
          	VCS_PROTOTYPE,       //-|
          };

References

Memory management in the Windows XP kernel - Original article in Russian