Page Life Expectancy (PLE), where to start?

Posted on

Question :

I have inherited a SQL server {2012 (SP3), but this question is intended to be generic} we are using SCOM to monitor it. Previously I was getting an alert once or twice a month for PLE < 300. Now I am getting sometimes 2 or 3 a day.

There are multiple blog posts about PLE, a few tools you can get to monitor it, and many differing opinions about what is good, bad or indifferent. In the end there are a lot of variables. No solutions are one size fits all. Low PLE is not a problem so much as it is a symptom, with lots of potential causes, and related measures to consider.

{this paragraph might not add value to the question, I am open to removing it} I think everyone can agree that PLE falling to 299 once a month during an overnight report creation, is a symptom that does not need to be addressed (assuming the report completes before business hours). Most can also agree that PLE consistently at 350, is not good. There are a handful of cause to look at before making hardware change, with queries and index being near the top.

After reading about a dozen blog posts about PLE. I have tried to narrow down the key symptoms to get a good picture of what is going on. The query below is what I came up with. It gives values for 4 Buffer Manager items that interconnect with PLE

  • ‘Page life expectancy’
  • ‘Free list stalls/sec’
  • ‘Lazy writes/sec’
  • ‘Buffer cache hit ratio’

SELECT [object_name],
[counter_name],
[cntr_value] FROM sys.dm_os_performance_counters -- https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-performance-counters-transact-sql
WHERE [counter_name] = 'Page life expectancy' --if multiple NUMA on a server should return multiple Nodes, 
OR [counter_name] = 'Free list stalls/sec'  -- Number of requests per second that had to wait for a free page https://docs.microsoft.com/en-us/sql/relational-databases/performance-monitor/sql-server-buffer-manager-object
OR [counter_name] = 'Lazy writes/sec' --Flushes of dirty pages before a checkpoint runs.  
OR [counter_name] = 'Buffer cache hit ratio' --percentage of pages found in the buffer cache without having to read from disk you want this ratio to be high
Order by [counter_name] DESC, [object_name];

Additionally if you are looking at Lazy Writes on an inherited server you should check Recovery Interval

EXEC sp_configure @configname='recovery interval (min)';  --The  'config_value' default 0 indicates SQL is applying Checkpoints completely automatically https://docs.microsoft.com/en-us/sql/database-engine/configure-windows/configure-the-recovery-interval-server-configuration-option

If this first query does not return values:

SELECT COUNT(*) FROM sys.dm_os_performance_counters;  --If no values from the firs query, an value of 0 here indicates a seperate issue  https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-performance-counters-transact-sql

I have a pretty good idea what all these values represent, and how they work together. I have included comments and sources in my code above.

My question is two part

  1. Is my list of buffer items/values above adequate for a starting place when examining PLE? (i.e. values that will always be helpful to consider together, should something excluded or included)

  2. How to put the values in good context with each other? (i.e. there is a good answer here saying “Check the Free List Stalls/sec value as well. If above 2, consider adding memory to the server” while the body of the answer is helpful, I don’t think a value of 2 for ‘Free List Stalls/sec’ is a problem on most instances)

NOTE: This question is not about solving the PLE issue, it is about how/where to start looking when assessing the symptoms. Your Doctor checks your Pules, Blood Pressure, Respirations, and Temperature at the start of every exam.

Edit 4/13/2018; Attempt to Clarify
This is not about knee jerk reactions like checking indexes or waits. This is about identifying other native SQL performance data that should always be examined with PLE. PLE is one of the Buffer Management Objects, what other Buffer Management Objects or Performance Counters should or should not always be part of queries when you really do want to look at buffer management?

Answer :

You basically asked, “What should I do when Page Life Expectancy changes?”

My answer: nothing. I don’t start by looking at Page Life Expectancy. That metric made sense in the SQL Server 7/2000 days when it was all that we had, but today, in 2018, we can do better.

Start by looking at wait stats – that tells you what SQL Server is waiting on.

I don’t care whether PLE is 300 or 3,000 – tell me what you’re WAITING on, SQL Server, and then I’ll go troubleshoot that metric.

My personal favorite way to check waits is to use the open source sp_BlitzFirst (disclaimer: I wrote it.) By default, it takes a 5-second sample of your server’s metrics and gives you a few guesses as to why it’s slow right now.

Because you like writing long questions, you’ll probably also like these:

sp_BlitzFirst @SinceStartup = 1;

The first result set gives you your waits since startup, and:

sp_Blitz @ExpertMode = 1, @Seconds = 60;

Takes a longer sample, and tells your waits over that time range.

Wait stats can be kind of cryptic, so next to every wait type, I link to the SQLskills wait stats repository for that wait type. You can just copy/paste out the name of your top wait type, go to their site, and learn more about what causes that wait and how to fix it.

If PLE is dropping due to queries reading a lot of data pages from disk, for example, you might see PAGEIOLATCH% wait types. If it’s dropping due to queries getting huge memory grants, you might see RESOURCE_SEMAPHORE. If PLE isn’t the problem, then you’ll see different wait types altogether.

It has been a while since I asked this question, I have learned a lot since then.

As Brent points out in his answer PLE alerts by themselves don’t really tell you anything. By design, these pages should come and go, if they don’t stay long when no longer needed, that is fine.

Nevertheless, I have one specific instance throwing PLE alerts several times per day, I have been looking at it with several tools including query store, and not finding anything that needs attention. Even if I added memory, it does not look like the PLE alerts would stop. I went looking for a way to “Prove” if more memory was needed or not.

On small SQL instances with 4GB of available RAM, 75% or 3GB can be devoted to the plan cache. Normally this is NOT purged out with data pages, that PLE alerts on. I found a couple of ways of looking what was happening with memory and the plan cache.

Ultimately I developed (leveraging on links above) the query below that shows the life expectancy (in minutes) for cache plans.

    --plan cache Life expectancy
    SELECT sys.dm_exec_cached_plans.objtype AS [CacheType] 
    ,    COUNT_BIG(*) AS [Total Plans]
    ,    SUM(CAST(sys.dm_exec_cached_plans.size_in_bytes AS DECIMAL(18, 2))) / 1024 / 1024 AS [Total MBs]
    ,   AVG(sys.dm_exec_cached_plans.usecounts) AS [Avg Use Count]
    ,   AVG (DATEDIFF(MINUTE, PH_Time.creation_time, (GETDATE()))) AS [Avg Age in Minutes]
    FROM sys.dm_exec_cached_plans
    left join (
                Select  plan_handle
                , Min (creation_time) as creation_time --A plan can have several unique related quiries, this gets just one time per plan
                from sys.dm_exec_query_stats
                group by plan_handle
                ) as PH_Time On sys.dm_exec_cached_plans.plan_handle = PH_Time.plan_handle
    --left join sys.dm_exec_query_stats On sys.dm_exec_cached_plans.plan_handle = sys.dm_exec_query_stats.plan_handle 
    GROUP BY objtype
    ORDER BY [Total MBs] DESC
    GO

While no single item by itself is conclusive, a strong argument can be made that if the average life on plans in cache is longer than, the time between queries being re-run, no additional memory is needed. The specific time will very by use case.

There are a lot of reasons that plans are recompiled, see related Why is Query Store missing details? Early on, I focused a lot on high recompile with PLE, and did not find a helpful correlation.

TL:DR Memory is intended to have things come and go, low PLE is not a problem. BUT by design, plans used often should stay in the memory long enough to be reused. If you can show that plans are staying in memory long enough to be reused, it is difficult to justify adding memory without some other indicator.

Leave a Reply

Your email address will not be published.