Srivenu Kadiyala's Blog

January 30, 2012

SCHED_NOAGE and latch contention

Filed under: Uncategorized — srivenu @ 6:29 am

This was a performance issue i faced on a billing database of one of our telecom client’s. The server was having sporadic cpu spikes for 2-3 minutes before things come back to normal. The environment was 64-bit Oracle 10.2.0.4 on HP-UX ia64 B.11.31. The normal cpu usage was normally around 40-50% and during issue time, the cpu run queue doubles or trebles and usage peaks at 100%. As per the vendor recommendation, the memory comprised of 100GB buffer cache and 60GB shared pool!

I did an ASH analysis for the reported issue time and found out that the waits were mostly for “cursor: pin S”, “latch: row cache objects”, “log file sync” and “latch: library cache”. I also had the chance of monitoring the system during issue time. The “cursor: pin S” waits were across several sql’s. Many of the sql showing in this wait were simple and with only a single child. This raised my curiosity as this isn’t normal.
V$EVENT_HISTOGRAM had the following data (This data is after a recent instance bounce and after the first occurence of the issue).



           Event
    EVENT# Name                                     WAIT_TIME_MILLI WAIT_COUNT
---------- ---------------------------------------- --------------- ----------
       617 cursor: pin S                                          1    7052795
       617 cursor: pin S                                          2      16564
       617 cursor: pin S                                          4       1357
       617 cursor: pin S                                          8        438
       617 cursor: pin S                                         16        377
       617 cursor: pin S                                         32        478
       617 cursor: pin S                                         64        436
       617 cursor: pin S                                        128        910
       617 cursor: pin S                                        256        322
       617 cursor: pin S                                        512        118
       617 cursor: pin S                                       1024         81
       617 cursor: pin S                                       2048         52
       617 cursor: pin S                                       4096         30
       617 cursor: pin S                                       8192          8



           Event
    EVENT# Name                                     WAIT_TIME_MILLI WAIT_COUNT
---------- ---------------------------------------- --------------- ----------
       214 latch: library cache                                   1       3903
       214 latch: library cache                                   2        409
       214 latch: library cache                                   4        436
       214 latch: library cache                                   8        238
       214 latch: library cache                                  16        211
       214 latch: library cache                                  32        185
       214 latch: library cache                                  64        202
       214 latch: library cache                                 128        204
       214 latch: library cache                                 256        232
       214 latch: library cache                                 512        232
       214 latch: library cache                                1024        129
       214 latch: library cache                                2048         59
       214 latch: library cache                                4096          2

A quick search on the net revealed the blog entry from Tanel Poder “cursor: pin S waits, sporadic CPU spikes and systematic troubleshooting“. This matches exactly with the problem that this instance was facing.

The oracle processes were having different scheduling priorities at the os level.


priority 	count of oracle processes
		having that priority
--------	--------------------
134 		1
154 		1250
179 		1
188 		1
195 		1
196 		1
199 		1
200 		1
201 		1
207 		1
209 		2
213 		1
222 		1
225 		1
227 		1
229 		1
230 		1
232 		1
239 		3
240 		1
241 		4

There were around 50-100 logins per second during issue time. Also the huge SGA in this instance made the process creation even more costlier.

After implementing HPUX_SCHED_NOAGE, the cpu spikes were gone.

Following is the data post change


priority 	count of oracle processes
		having that priority
--------	--------------------
154 		2
178 		794
999999 		1




           Event
    EVENT# Name                                     WAIT_TIME_MILLI WAIT_COUNT
---------- ---------------------------------------- --------------- ----------
       617 cursor: pin S                                          1    5657072
       617 cursor: pin S                                          2       2889
       617 cursor: pin S                                          4       1177
       617 cursor: pin S                                          8        571
       617 cursor: pin S                                         16        228
       617 cursor: pin S                                         32         60
       617 cursor: pin S                                         64         32
       617 cursor: pin S                                        128         26



           Event
    EVENT# Name                                     WAIT_TIME_MILLI WAIT_COUNT
---------- ---------------------------------------- --------------- ----------
       214 latch: library cache                                   1     315419
       214 latch: library cache                                   2      45189
       214 latch: library cache                                   4      61475
       214 latch: library cache                                   8      40559
       214 latch: library cache                                  16      22104
       214 latch: library cache                                  32      14884
       214 latch: library cache                                  64      11129
       214 latch: library cache                                 128       7770
       214 latch: library cache                                 256       3419
       214 latch: library cache                                 512       3475
       214 latch: library cache                                1024       5525
       214 latch: library cache                                2048      12220
       214 latch: library cache                                4096      15814
       214 latch: library cache                                8192        829
       214 latch: library cache                               16384         61

They have yet to reduce their shared pool substantially from the existing value of 60GB.

I really have no words for my admiration of the genious, work and contribution to the DBA community by the guru’s – Jonathan Lewis, Tanel Poder, Tom Kyte.

9 Comments »

  1. […] Update: Srivenu Kadiyala has experienced the same problem and has written about it here. […]

    Pingback by cursor: pin s waits, high CPU usage troubleshooting | Tanel Poder's blog: IT & Mobile for Geeks and Pros — February 2, 2012 @ 12:39 pm | Reply

  2. Hi,

    Good one.But your event histogram listing for ‘latch: library cache’ shows more waits in the buckets ranges from 1 -16 seconds…

    Thanks

    Comment by antony — October 11, 2012 @ 12:38 pm | Reply

    • Yes it does. The do have a huge library cache. Unfortunately i didnt have access to that system for much time to analyze further.

      Comment by srivenu — October 12, 2012 @ 1:29 am | Reply

  3. […] Once you have the top database cpu consumer you can connect to it and diagnose more in depth (Example of cpu spike on this post). […]

    Pingback by Real-Time CPU utilization per oracle database « bdt's oracle blog — December 4, 2012 @ 11:22 am | Reply

  4. Hi Srivenu
    Excellent post. Can you share with me the command to list the oracle process priority as in your analysis as shown below?

    The oracle processes were having different scheduling priorities at the os level.

    priority count of oracle processes
    having that priority
    ——– ——————–
    134 1
    154 1250
    179 1
    188 1
    195 1
    196 1
    199 1
    200 1
    201 1
    207 1
    209 2
    213 1
    222 1
    225 1
    227 1
    229 1
    230 1
    232 1
    239 3
    240 1
    241 4

    Thank you
    -Haris

    Comment by Haris Ali — January 6, 2013 @ 11:12 pm | Reply

    • Sorry for the long delay.
      I dont have access to an unix system right now, but i think this is the command i used

      ps -u oracle -U oracle -o pri|
      	sort|
      	awk '
      	BEGIN 	{
      	 	prev=0;
      		sum=1
      		}
      	{
      		if ($1==prev) {
      			sum+=$2 
      			}
      		else 	{
      			print prev, sum; 
      			prev=$1; 
      			sum=0
      			} 
      	}
      	END 	{
      		print prev, sum
      		}
      	'
      

      Comment by srivenu — January 23, 2013 @ 8:44 am | Reply

  5. sum+=1 instead of sum+=$2 i assume

    Comment by dik pater — November 1, 2017 @ 4:42 am | Reply

    • You are right. I made a mistake

      Comment by srivenu — November 1, 2017 @ 7:12 am | Reply

  6. ps -u oracle -U oracle -o pri|
    sort|
    awk ‘
    BEGIN {
    prev=0;
    sum=1
    }
    {
    if ($1==prev) {
    sum+=1
    }
    else {
    print prev, sum;
    prev=$1;
    sum=0
    }
    }
    END {
    print prev, sum
    }

    Comment by srivenu — November 1, 2017 @ 7:14 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.