Recently I ran into a problem with our production servers. Almost all of our production servers had a very high memory usage, and this happened in most of our servers, but not all.
Of course, the first culpture would be our java process. However, it turned out that our java process was not consuming much memory. We validated this by looking at the heap metrics after the garbage collector run. Our java process was for sure healthy, and it did not account for the hight memory usage on the server. We suspected that there might be some other process that is consuming most of the memory, but we could not find anything that uses high memory.
As it turned out, we were looking at the wrong metric. This was an example of the meminfo output:
MemTotal: 1882064 kB MemFree: 1376380 kB MemAvailable: 1535676 kB Buffers: 2088 kB Cached: 292324 kB SwapCached: 0 kB Active: 152944 kB Inactive: 252628 kB Active(anon): 111328 kB Inactive(anon): 16508 kB Active(file): 41616 kB Inactive(file): 236120 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 2097148 kB SwapFree: 2097148 kB Dirty: 40 kB Writeback: 0 kB AnonPages: 111180 kB Mapped: 56396 kB Shmem: 16676 kB Slab: 54508 kB SReclaimable: 25456 kB SUnreclaim: 29052 kB KernelStack: 2608 kB PageTables: 5056 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 3038180 kB Committed_AS: 577664 kB VmallocTotal: 34359738367 kB VmallocUsed: 14664 kB VmallocChunk: 34359717628 kB HardwareCorrupted: 0 kB AnonHugePages: 24576 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 69632 kB DirectMap2M: 2027520 kB
Originally, we were tracking the value of
Active / MemTotal > 0.8, then that’s a problem. However,
Active was not the true value of the amount of memory used on the system. Rather it is something little more complicated.
According to the documentation:
Active: Memory that has been used more recently and usually not reclaimed unless absolutely necessary.
So Active memory is the memory that is used, but freeable, but because it was used recently, it will not be reclaimed unless necessary. This is contasted with Inactive:
Inactive: Memory which has been less recently used. It is more eligible to be reclaimed for other purposes
Inactive can be repurposed immediately. To understand the connection between the two, and how linux worked, the process is documentated int he kernel code:
Double CLOCK lists
Per node, two clock lists are maintained for file pages: the inactive and the active list. Freshly faulted pages start out at the head of the inactive list and page reclaim scans pages from the tail. Pages that are accessed multiple times on the inactive list are promoted to the active list, to protect them from reclaim, whereas active pages are demoted to the inactive list when the active list grows too big.
fault ------------------------+ | +--------------+ | +-------------+ reclaim <- | inactive | <-+-- demotion | active | <--+ +--------------+ +-------------+ | | | +-------------- promotion ------------------+ Access frequency and refault distance
A workload is thrashing when its pages are frequently used but they are evicted from the inactive list every time before another access would have promoted them to the active list.
In cases where the average access distance between thrashing pages is bigger than the size of memory there is nothing that can be done - the thrashing set could never fit into memory under any circumstance.
However, the average access distance could be bigger than the inactive list, yet smaller than the size of memory. In this case, the set could fit into memory if it weren’t for the currently active pages - which may be used more, hopefully less frequently:
> +-memory available to cache-+ > | | > +-inactive------+-active----+ > a b | c d e f g h i | J K L M N | > +---------------+-----------+
It is prohibitively expensive to accurately track access frequency of pages. But a reasonable approximation can be made to measure thrashing on the inactive list, after which refaulting pages can be activated optimistically to compete with the existing active pages.
Approximating inactive page access frequency - Observations:
- When a page is accessed for the first time, it is added to the head of the inactive list, slides every existing inactive page towards the tail by one slot, and pushes the current tail page out of memory.
- When a page is accessed for the second time, it is promoted to the active list, shrinking the inactive list by one slot. This also slides all inactive pages that were faulted into the cache more recently than the activated page towards the tail of the inactive list.
- The sum of evictions and activations between any two points in time indicate the minimum number of inactive pages accessed in between.
- Moving one inactive page N page slots towards the tail of the list requires at least N inactive page accesses.
- When a page is finally evicted from memory, the number of inactive pages accessed while the page was in cache is at least the number of page slots on the inactive list.
- In addition, measuring the sum of evictions and activations (E) at the time of a page’s eviction, and comparing it to another reading (R) at the time the page faults back into memory tells the minimum number of accesses while the page was not cached. This is called the refault distance.
Because the first access of the page was the fault and the second access the refault, we combine the in-cache distance with the out-of-cache distance to get the complete minimum access distance of this page:
NR_inactive + (R - E)
And knowing the minimum access distance of a page, we can easily tell if the page would be able to stay in cache assuming all page slots in the cache were available:
NR_inactive + (R - E) <= NR_inactive + NR_active
which can be further simplified to
(R - E) <= NR_active
Put into words, the refault distance (out-of-cache) can be seen as a deficit in inactive list space (in-cache). If the inactive list had (R - E) more page slots, the page would not have been evicted in between accesses, but activated instead. And on a full system, the only thing eating into inactive list space is active pages.
Refaulting inactive pages
All that is known about the active list is that the pages have been accessed more than once in the past. This means that at any given time there is actually a good chance that pages on the active list are no longer in active use.
So when a refault distance of (R - E) is observed and there are at least (R - E) active pages, the refaulting page is activated optimistically in the hope that (R - E) active pages are actually used less frequently than the refaulting page - or even not used at all anymore.
That means if inactive cache is refaulting with a suitable refault distance, we assume the cache workingset is transitioning and put pressure on the current active list.
If this is wrong and demotion kicks in, the pages which are truly used more frequently will be reactivated while the less frequently used once will be evicted from memory.
But if this is right, the stale pages will be pushed out of memory and the used pages get to stay in cache.
Refaulting active pages
If on the other hand the refaulting pages have recently been deactivated, it means that the active list is no longer protecting actively used cache from reclaim. The cache is NOT transitioning to a different workingset; the existing workingset is thrashing in the space allocated to the page cache.
For each node’s file LRU lists, a counter for inactive evictions and activations is maintained (node->inactive_age).
On eviction, a snapshot of this counter (along with some bits to identify the node) is stored in the now empty page cache slot of the evicted page. This is called a shadow entry.
On cache misses for which there are shadow entries, an eligible refault distance will immediately activate the refaulting page.
From my understanding, this means that the OS keeps the memory active to optimize performance, and any new application that needs memory will claim it from inactive. However, if the active memory is unlikely to be used often, then it is demoted to inactive. This does not mean that active memory is being used all the time. Rather, it is taken for optimization purposes, and will be demoted when needed.
Because measuring the actual usable memory in linux might require some knowledge of the internal components, the value
MemAvailable was added in this commit:
Many load balancing and workload placing programs check /proc/meminfo to estimate how much free memory is available. They generally do this by adding up “free” and “cached”, which was fine ten years ago, but is pretty much guaranteed to be wrong today.
It is wrong because Cached includes memory that is not freeable as page cache, for example shared memory segments, tmpfs, and ramfs, and it does not include reclaimable slab memory, which can take up a large fraction of system memory on mostly idle systems with lots of files.
Currently, the amount of memory that is available for a new workload, without pushing the system into swap, can be estimated from MemFree, Active(file), Inactive(file), and SReclaimable, as well as the “low” watermarks from /proc/zoneinfo.
However, this may change in the future, and user space really should not be expected to know kernel internals to come up with an estimate for the amount of free memory.
It is more convenient to provide such an estimate in /proc/meminfo. If things change in the future, we only have to change it in one place.
Thus, to estimate the memory available on the server, use