Oracle

November 22, 2010

IO issue

Filed under: Uncategorized — srivenu @ 10:27 am

Sometime at the end of October i was asked to look into the performance issue in one of the new databases. It was Oracle 10.2.0.3.0 on Sun OS 5.10. They were presently processing around 175 records/sec to 260 records/sec. They were worried as the performance was highly erratic and they were barely meeting the SLA and they expected an increase in the arrival rate (A term i took from craig shallahamer’s book). Since it was the month-end, they needed some immediate relief.
First glance of the session waits, showed numerous sessions wating on “free buffer waits” and some on “db file sequential read”. System Event stats also showed a huge time accumulation on “db file parallel write”.
It seems the db has been setup as per the application vendors specification. It was configured with 10GB of Buffer cache. The instance had 10 DBWR processes as specified by the vendor (More about this in another post).

I trussed the DBWR0 process and see some abnormal values for file 260 (the first datafile in UNDO tablespace)

 260: S_IFREG mode:0640 dev:329,9003 ino:9 uid:105 gid:501 size:68719460352
      O_RDWR|O_DSYNC|O_LARGEFILE FD_CLOEXEC
      /disk2/oradata/undotbs01.dbf

……..Part of the truss output………..

/193:	6530916.0239	 0.0013	pwrite(416, " ", 16384, 0x)   = 16384
/193:	6530916.0251	 0.0012	kaio(AIONOTIFY, 4406879552)   = 0
/1:	6530916.0251	 0.0023	kaio(AIOWAIT, 0x )            = 1
/204:	6530916.0762	53.8136	pwrite(260, " ", 16384, 0x)   = 16384
/204:	6530916.0782	 0.0020	kaio(AIONOTIFY, 4406837216)   = 0
/1:	6530916.0782	 0.0531	kaio(AIOWAIT, 0x )            = 1
/202:	6530916.0874	53.8139	pwrite(260, "  '", 16384, 0x) = 16384
/202:	6530916.0896	 0.0022	kaio(AIONOTIFY, 4406612544)   = 0
/1:	6530916.0896	 0.0114	kaio(AIOWAIT, 0x )            = 1
/203:	6530916.1181	53.8200	pwrite(260, " ", 16384, 0x)   = 16384
/203:	6530916.1205	 0.0024	kaio(AIONOTIFY, 4406417440)   = 0
/1:	6530916.1205	 0.0309	kaio(AIOWAIT, 0x)             = 1
/205:	6530916.1289	53.8266	pwrite(260, " ", 49152, 0x)   = 49152
/205:	6530916.1325	 0.0036	kaio(AIONOTIFY, 4406938016)   = 0
/1:	6530916.1325	 0.0120	kaio(AIOWAIT, 0x )            = 1
/206:	6530916.1526	53.7966	pwrite(260, " ", 16384, 0x)   = 16384
/206:	6530916.1552	 0.0026	kaio(AIONOTIFY, 4406243616)   = 0
/1:	6530916.1552	 0.0227	kaio(AIOWAIT, 0x)             = 1
/207:	6530916.1848	53.7976	pwrite(260, " ", 16384, 0x)   = 16384
/207:	6530916.1877	 0.0029	kaio(AIONOTIFY, 4407051808)   = 0
/1:	6530916.1878	 0.0326	kaio(AIOWAIT, 0x)             = 1
/220:	6530916.2522	53.8301	pwrite(260, " ", 16384, 0x)   = 16384

I also verified the IO stats recorded in V$FILESTAT.

                                                                                         
                                                Total           I/O   write
                                                write          Time     I/O
                                 Phy             Time        (milli    Time
Datafile          File#       writes           (secs)         secs)  (secs)
----------------- ----- ------------ ---------------- ------------- -------
undotbs01.dbf         2   21,632,569    1,017,710,807     47,045.31   1,164
undotbs05.dbf         6    3,536,387       75,307,326     21,294.99     878
undotbs06.dbf         7    3,532,369       74,229,652     21,014.13     838
undotbs02.dbf         3    3,652,466       72,533,656     19,858.82   1,023
undotbs04.dbf         5    3,537,340       72,053,123     20,369.30   1,028
undotbs03.dbf         4    3,597,883       70,619,202     19,627.99   1,019
.....................
ALARM_COMMENT.dbf   527        1,299                8          6.28       0
ALARM.dbf           526        1,299                8          6.07       0
SUMMARY25_01.dbf    479        1,299                7          5.65       0

After calculating iops on UNDO asked the storage admins to move the existing UNDO from RAID 5 to RAID 10. After making this change, they were able to consistently achieve an application throughput of 400 recs/sec.
I Will be analyzing the io system on that server in a month or so.

Advertisements

RAC system hang due to “global cache cr request”

Filed under: Uncategorized — srivenu @ 5:49 am

We have an Oracle CRM application running on 3-Node RAC. The system was scheduled for an OS upgrade during last weekend. As they faced some issues, they decided to rollback the OS upgrade.

Monday morning on coming to office, I see that a fire-fighting was going on. The RAC database had serious performance issues impacting all users.
On first observation i could see numerous session waits on “global cache cr request”. I cross-checked the network interface using oradebug ipc and verified the network using netstat. The next step was to check the average timings for the Global Cache events. These showed some abnormally high values.

Global Cache Service - Workload Characteristics
-----------------------------------------------
Ave global cache get time (ms):                           41.9
Ave global cache convert time (ms):                      162.2

Ave build time for CR block (ms):                          0.0
Ave flush time for CR block (ms):                          0.2
Ave send time for CR block (ms):                           0.2
Ave time to process CR block request (ms):                 0.5
Ave receive time for CR block (ms):                      119.0

Ave pin time for current block (ms):                       7.8
Ave flush time for current block (ms):                     0.0
Ave send time for current block (ms):                      0.2
Ave time to process current block request (ms):            8.0
Ave receive time for current block (ms):                  97.1

I thought of verifying the OS scheduler priorities of the lms processes. And as i was verifying them on Node 2 & Node 3, i was immediately informed that Node 1 had only 2 lms processes! Node 2 & Node 3 had 8 each. The CPU_COUNT was showing 8 on Node 1 while it was 56 on nodes 2 & 3. Looks like the cpu boards were hot plugged on server 1 after db startup. To avoid bouncing node 1, we tried to set “_cpu_count” to 56 and see if more lms’s would start up automatically. Bad Luck!, that didn’t work out. So we had to restart the db on node 1.

Create a free website or blog at WordPress.com.