Observing Oracle Exadata SmartScan internals, part 1

March 31, 2013, 8:56 am

≪ Previous: Determine VMWare ESX version from Linux as guest OS

In order to look how Exadata smartscans are different, first let’s have a peek the Oracle full segment/multiblock read evolution as short as possible:

a) Traditional multiblock reads, visible via the event ‘db file scattered read’
The essence is: Multiple adjacent blocks are read from disk, and put in the buffercache. Because every read is sequentially processed, IO latency is a performance penalty for every physical read. This works roughly this way: get a set of adjacent blocks from the segment header, fetch these blocks from disk, process these blocks, then get the next set of adjacent blocks, fetch these blocks from disk, process these blocks, etc.

b) Direct path multiblock reads, visible via the event ‘direct path read’
The essence is: Multiple IOs are done asynchronously, one or more IOs are reaped and processed, after which the number of IOs is brought back to the number of IOs the process want to keep in flight. Blocks are read to the process’ PGA (which means the IO result is not shared with other processes). Because of the asynchronous way of issuing multiple requests, the process does not suffer from the IO latency penalty of every single IO. This works roughly this way: get a set of adjacent blocks from the segment header, issue an asynchronous IO request for these, get a next set of adjacent blocks from the segment header, issue another asynchronous IO request, process one or more of the IO requests which are ready, issue IO requests for the number requests reaped, process one or more of the IO requests ready, etc. During processing, Oracle measures CPU and IO times, and can decide to add one or more concurrent requests to the two IO’s which it tries to keep in flight.

Back to smartscans

It’s not very hard to understand that direct path multiblock reads can perform much better than traditional multiblock reads. Probably at this time you think: yes, I know, but what does this have to do with Exadata? This is all information about regular processing! Well, Exadata uses the regular Oracle database executable. This means that part of the codepath of smartscans is shared with the normal/non-Exadata Oracle database. Obviously, there is a part that is unique to Exadata.

This is best viewed with a backtrace of the call to submit an IO request. This a full backtrace of the submit of an IO request of a full table scan on Linux to a database on ASM:

io_submit
skgfqio
ksfd_skgfqio
ksfdgo
ksfdaio
kfk_ufs_async_io
kfk_submit_ufs_io
kfk_submit_io
kfk_io1
kfkRequest
kfk_transitIO
kfioSubmitIO
kfioRequestPriv
kfioRequest
ksfd_kfioRequest
ksfd_osmgo
ksfdgo
ksfdaio
kcflbi
kcbldio
kcblrs
kcblgt
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010000kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

Of course the functions used internally in the executable are not documented. But it’s very useful to look at them to gain a better understanding of what is happening. First look at the function at line line 24, kcbgtcr (Kernel Cache Buffers GeT Consistent Read). This is the function to perform a logical IO. One line up on line number 23 is the function kcbldrget (Kernel Cache Buffers direct path LoaDeR GET). This function indicates that the execution did choose the direct path read code path. In fact, the kcbl prefixed functions are believed to belong to Oracle direct path read codepath. Then roughly the ksfd, kfio, kfk, ksfd, and lastly skgfqio is executed, which performs the actual submit of an IO using io_submit().

Now let’s look how the equivalent submit of an IO request looks like on Exadata with smartscan turned off:

sskgxp_sndmsg
skgxpfragsnd
skgxp_send_next_fragment
skgxpxmit
skgxpivsnd
skgxpivrpc
skgxpvrpc
ossnet_issue_vrpc
ossnet_queue_vrpc
ossdisk_issue_read
ossdisk_read
oss_read
kfk_submit_one_oss_io
kfk_submit_oss_io
kfk_submit_io
kfk_io1
kfkRequest
kfk_transitIO
kfioSubmitIO
kfioRequestPriv
kfioRequest
ksfd_kfioRequest
ksfd_osmgo
ksfdgo
ksfdaio
kcflbi
kcbldio
kcblrs
kcblgt
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010000kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

First locate the kcbgtcr function, which is on line 31 (forget about trying to find logic in the line numbers; backtraces are read from bottom to top, whilst the numbering logic is from top to bottom). One line up on number 30 is the function kcbldrget again. Okay, that looks the same. If we read the backtrace up, it’s easy to spot the same layers, in fact the same functions: ksfd, kfio up to the kfk layer.

In the kfk layer there is a slight difference, which is understandable: on line 7 of the non-Exadata backtrace we see the function kfk_submit_ufs_io, while on Exadata the same function is kfk_submit_oss_io on line 14. I think this deserves a little Explanation. Exadata is the marketing name of the database machine, which internally was called ‘Sage’ in Oracle, and this name still surfaces sometimes, like in Exadata naming, or in patch descriptions. Quite probably OSS means ‘Oracle Sage Software’. So, this means that the process is aware it needs to read something from an Exadata storage server, and chooses a function that is meant to set that up.

The next two functions (kfk_ufs_async_io on line 6 and kfk_submit_one_oss_io on line 13) probably do logically the same, but are different because the infrastructure is different.

The next function up in both backtraces is where it gets really interesting, because now the code has to do something entirely different: on the non-Exadata system the ksfd layer is entered again, in order to get to the function skgfqio, which submit’s the IO request using the io_submit call. On the Exadata system, we see a call which I have not encountered outside of Exadata: oss_read. With the knowledge gained above, we can tell this quite probably is an Exadata specific call, which is inside an entire layer: oss. From the function names we can guess it prepares the IO request, and then issues it. Once the oss layer is crossed, we enter another layer: skgxp. The skgxp (System Kernel Generic inter-process Communication (Xfer?) Protocol) layer is NOT unique to Exadata, it’s the communication layer which is used by an Oracle RAC database for inter-process communication. So it seems like Oracle re-used the knowledge gained with RAC inter-process communication for the communication with the (Exadata) storage server.

Can we now please get to smartscans?

Yes, we didn’t encounter a smartscan yet, I did show a backtrace of a regular direct path read on Exadata. The reason is to show the difference between a traditional system (meaning disks which are presented to the system as local disks) and an Exadata system, which has to use infiniband and has to fetch the information it needs from Exadata storage servers. This is how a smartscan read call backtrace looks like:

sskgxp_sndmsg
skgxpfragsnd
skgxp_send_next_fragment
skgxpxmit
skgxpivsnd
skgxpivrpc
skgxpvrpc
ossnet_issue_vrpc
ossnet_queue_vrpc
ossdisk_cread
oss_cread
kcfis_read
kcbl_predpush_get
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010010kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

The first thing which did struck me is the number of functions did decrease. On the other hand, this does not say much (you can make functions as long or as short as you wish). Also, a smartscan is done using a number of steps, of which some are already been done, which are not visible from this backtrace. After the submit of a smartscan there is another number of steps; of course reaping the submitted scan requests, but also verification of the reaped request.

It’s now interesting to see how much different the codepath looks like. The kcbgtcr function is still present, at line 15. One line up there’s the kcbldrget function, which reveals the process chose the direct read path codepath during execution. If we go up one line we see a function in the kcbl layer, which is Exadata specific as far as I know: kcbl_predpush_get. Again: all has been setup for doing a smartscan prior to the point where the process enters the point of this backtrace: submitting a request to an Exadata storage server. This means the process has the information needed (what information to ask from what storage server) prior to arriving at this point.

One layer up is a call to the kcfis (Kernel Cache File Intelligent Storage is my guess) layer, then approximately the same calls in the oss layer, but there’s a difference: it’s oss_cread and ossdisk_cread instead of oss_read and ossdisk_read, which make the call an Exadata request, instead of a request for database blocks. After the oss layer, there’s the skgxp layer again, which are exactly the same calls for both the non-smartscan and smartscan.

Conclusion

The purpose of this blogpost is to show the differences between submitting an IO request on a traditional system, on an Exadata system with smartscans turned off, and with smartscans turned on.

There is a lot of ground to cover on this. It’s not doable to cover this all in one blogpost. That’s the reason I try to pick a specific part, and work from that. Any comments are welcome.

From the backtraces it’s quite good visible regular IO’s are processed and done the same way on non-Exadata and Exadata, on Exadata the request is submitted via the skgxp layer to a storage server, instead of an IO request done via io_submit. This makes it very probable that single block IO’s are done the same way too, which means no ‘magic’ performance enhancement is possible, because it’s processed the same way on Exadata as on non-Exadata, the only difference is the IO request is done differently.

If we look at the comparison between smartscan and non-smartscan requests, it becomes prevalent there is something different happening. But the basic processing is the same, with which I mean the process is doing exactly the same as non-Exadata processes. From the smartscan backtrace it becomes visible that a process has to travel through the direct path code layer (kcbl), in order to get a smartscan, because otherwise it would be impossible to issue kcbl_predpush_get, which is the call for issuing a smartscan.

Disclaimer: I am not an Oracle employee. This information has all been obtained by profiling execution (using perf/nm/gdb). I have never seen any Oracle database and Exadata source code. There is a chance some assumptions are wrong.

Tagged: oracle exadata smartscan IO, stacktrace backtrace internals

↧

RMOUG and Hotsos

April 5, 2013, 3:39 am

≫ Next: Exadata and the db_block_checksum parameter.

≪ Previous: Observing Oracle Exadata SmartScan internals, part 1

Recently I’ve spoken at the RMOUG training days 2013 in Denver (the mile high city). It was a first time for me to speak for the RMOUG and being in Denver. Thanks to the “sequestration” (federal budget cuts) the lines piled at immigration at Minneapolis (Minneapolis and Saint Paul, the twin cities) airport, and because my plane left more than one hour to late and my layover time was one hour and fifteen minutes, I tried to rebook my flight from Minneapolis to Denver. But, this flight turned out to be delayed too. This meant I was able to get on this flight!

At Denver airport Tim Gorman volunteered to fetch me at the airport and bring me to my Hotel. Tim did fetch multiple people, which meant we gathered at a bar where Debra Lilley, Mogens Norgaard, Claes (the tank commander), Martin Widlake amongst others where already waiting.

The RMOUG training days where held in the Denver conference centre, which have a very iconic blue bear at the front:

This is the same conference centre where Collaborate 2013 will be held!

I delivered 3 talks: About multiblock reads (which I will be presenting at collaborate too, but named “How Oracle secretly changed multiblock reads”), Exadata OLTP (a study on processing single block reads on Exadata configurations and non-Exadata using Kevin Closson’s SLOB (Silly Little Oracle Benchmark) and a live Oracle function call tracing hacking session. The idea for the last presentation was conceived just prior to the conference, when I learned some people didn’t make it to Denver, and there where session slots to be filled. All three of these presentations will be presented at theOracle Usergroup Norway Vårseminar 2013 too!

After the conference we went skiing for a few days in Breckenridge with a group of friends, organised by Kellyn Pot’vin and Tim Gorman. Thank you Kellyn and Tim for all the work at both RMOUG and with this famous ‘Faux Table’ event, I really appreciate all the effort you put in these events! I also want to mention much other people who put effort in making things happen at the Faux table, for cooking, washing, driving, etc.

A few weeks further down the line was the annual HotSOS conference in Dallas. I travelled with Toon Koppelaars. Once again we found a long line for immigration, but eventually we where through immigration quick enough to fetch a beer at a bar near the gate of the connecting flight where each and every seat was equipped with an iPad, which you had to use to order something. Remarkable. At the conference I delivered my ‘About multiblock reads’ presentation. I was very pleased to be invited to speak at a conference which is dedicated to Oracle and performance. Once again this was a terrific meet-up with a huge amount of friends. This is an impression from the game night: Karl Arao from Enkitec tuning a pile of wood blocks (Jenga), overlooked by Gwen Shapira from Pythian and Mark Farnham.
There was a big number of dutch people at this year’s Hotsos conference: Toon Koppelaars, with whom I travelled, Marco Gralike, Gerwin Hendriksen and Jacco Landlust.

i would like to thank my employer VX Company for their support to make this possible.

↧

Exadata and the db_block_checksum parameter.

April 6, 2013, 2:39 am

≫ Next: Watching the “CopyBack” progress of a new disk on an Exadata compute node

≪ Previous: RMOUG and Hotsos

With Exadata version 11.2.3.2.0 came the Unbreakable Linux Kernel for Exadata, which had been the stock EL5 redhat kernel prior to this version (2.6.18). With the unbreakable kernel came the opportunity to run the perf utility. This utility has the opportunity to see which functions are active inside an executable when there’s a symbol table. And the oracle database executable has a symbol table! One reason to do this, is to get a more granular overview of what the Oracle database is doing than the wait interface, especially to get a more detailed overview of what the database is doing in what is visible in the wait interface as ‘on cpu’.

Right after the Exadata upgrade, I ran a simple query (which probably doesn’t reflect any real customer case) to get an idea. Previously I have been running ‘select count(*) from bigtable’ on Exadata before, and saw most of it being CPU, and a minor part of it being ‘cell smart table scan’. Now with perf I have the opportunity to get more details on what is the time spend on CPU!

These are the top 5 functions from that session:

    31.50%     476423   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
    30.20%     456774   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010010kmP
     7.48%     113083   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     6.96%     105301   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     2.94%      44475   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo

So the top function used during the processing of the SQL is a function called ‘sxorchk’. Sadly, Oracle doesn’t provide information about this. Anyway, this is executing a XOR function for the reason of checking information gotten from storage. Probably the name of this function is System XOR CHecK.

Alright, once we know this we can look into the parameters of the instance I am executing the SQL, which is taken from the default Exadata parameter template (!!):

SYS@t11203 AS SYSDBA> show parameter db_block

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
db_block_buffers		     integer	 0
db_block_checking		     string	 FALSE
db_block_checksum		     string	 TYPICAL
db_block_size			     integer	 8192

So, this should be related to db_block_checksum, db_block_checking is off/false. Well, since this is a test instance, let’s turn it off, and redo the scan:

    49.35%     480911   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010010kmP
    11.01%     107299   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     6.56%      63885   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     3.97%      38712   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo
     3.49%      33970   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdst_fetch

The sxorchk function is gone now! This is also reflected in the responsetime: the time with db_block_checksum set to typical is: 00:01:02.44, and with db_block_checksum set to false is: 00:00:41.68 on my system. That is a difference of roughly 20 seconds, or, as we can see from the first perf-function list: 31.5% of the time. That is a significant amount of time!

When I discussed this with Tuomas Pystynen, he asked me a very valid question: if this is a smartscan, the database does not get blocks, it gets result-sets. So blocks cannot be checked on the database layer. Is this offloaded to the storage/cell server?

Well, if it is offloaded to the storage, it will not be reflected in a function on the database layer. In other words, something is XOR’ed by the database foreground process, which is set (and can be reset) with db_block_checksum! Logical conclusion on these facts would be blocks are involved in some way…

There is way to know what is actually happening: looking at the backtrace of sxorchk function! Let’s do that!

    31.50%   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
                |          
                |--99.76%-- kcbhxoro
                |          kcbhvbo
                |          |          
                |          |--100.00%-- kcbhvb_corruptcheck
                |          |          kcbl_sage_block_check
                |          |          kcfis_oss_block_verify
                |          |          kcfis_validate_translation
                |          |          kcfis_process_completed_req
                |          |          kcfis_process_completed_buffer
                |          |          kcfis_process_reaped_io
                |          |          kcfis_read
                |          |          kcbl_predpush_get
                |          |          kcbldrget
                |          |          kcbgtcr
                |          |          ktrget3
                |          |          ktrget2
                |          |          kdst_fetch

What we see here is the function is called from the kcfis (Kernel Cache File Intelligent Storage is my guess) layer from a function called ‘kcfis_oss_block_verify’, in other words: a block, which is actually the resultset which is send from the cell server, is validated/checked. That is understandable, but the usage of the parameter ‘db_block_checksum’ for setting this is misleading, to put it in a nice way.

Next question which I asked myself is: how about a normal/non-Exadata database? Well, I can mimic a non-Exadata database by setting the parameter ‘cell_offload_processing’ to false to disable smartscans!

This is how the top-5 functions look like with db_block_checksum set to true without smartscan:

    20.83%     397620   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
    18.53%     353741   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010000kmP
    10.05%     191847   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     5.35%     102161   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     2.73%      52103   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo

And this is how the top-5 functions look like with db_block_checksum set to false: without smartscan:

    36.51%     706798   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010000kmP
    10.47%     202645   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     5.58%     107941   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     3.57%      69044   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo
     2.38%      46036   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] ktrgcm

If we get the backtrace of the sxorchk function without smartscan enabled:

    20.83%     397620   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
                |          
                |--99.79%-- kcbhxoro
                |          kcbhvbo
                |          kcbzvb
                |          kcbldrget
                |          kcbgtcr
                |          ktrget3
                |          ktrget2
                |          kdst_fetch

We can see the sxorchk function is called from kcbldrget (the direct path load function), with more or less the same helper function to prepare for XOR function, and no other layers (like kcfis in the backtrace with smartscan enabled).

My conclusion is: db_block_checksum is a parameter which enables checking of the data it has gotten from “storage”, which apparently is done via a XOR function. This checking is done for blocks on a regular/non-Exadata system, which the parameter suggests, but on Exadata with a smartscan the checking still is done, regardless the fact that no blocks but result-sets are gotten from “storage”.

The checking takes a significant portion of time during processing of my (very simple!) query. Probably other functions can make the overall processing more CPU intensive, which means the relative portion of time spend on checking gets less.

With Exadata/smartscan the relative amount of time spend on sxorchk with my simple SQL is 32%, without smartscan the relative amount of time drops to 21%. This is still a significant amount of time (and the function the most time is spend in, in both cases!).

Final conclusion is you should think about the setting of this parameter if you are doing much physical IO, and set it according to the needs of the database.

One addition: this is done on a V2 half rack version of Exadata (Nehalem CPU), so anyone with X2-[28] or X3-[28] is welcome to profile a ‘select count(*) from bigtable’ and post it here to see if there’s any difference. The database version is 11.2.0.3 BP15.

Tagged: oracle exadata performance xor perf profiling

↧

Watching the “CopyBack” progress of a new disk on an Exadata compute node

May 3, 2013, 3:34 am

≫ Next: When does an Oracle process know it’s on Exadata?

≪ Previous: Exadata and the db_block_checksum parameter.

This is just a very small post on how to watch the progress of the “CopyBack” state of a freshly inserted disk in an Exadata “Computing” (database) node. A disk failed in the (LSI Hardware) RAID5 set, and the hotspare disk was automatically used. The failed disk was replaced, and we are now awaiting the intermediate “CopyBack” phase.

The current state of the disks is visible using the following command:

# /opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0Firmware state: Copyback
Device Firmware Level: 0D70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Online, Spun Up
Device Firmware Level: 0B70

But what is the progress? Some googling came up with (almost) the answer (http://trac.camsentry.com/wordpress/tag/ldinfo/), I modified it a tiny bit to make it refresh:

while $(true); do /opt/MegaRAID/MegaCli/MegaCli64 adpeventlog getlatest 200 -f ~/adpeventlog.txt a0; awk '/^Time/{TIME=$0};/Seconds/{SECS=$5}/^Event Desc/{printf("%25.25s %5.5s %s\n",TIME,SECS,$0);TIME=" ";SECS=""}' ~/adpeventlog.txt|grep -v fan|tac; sleep 5; done

This wil refresh every 5 seconds and show the progress of the CopyBack, and the state changes after it once it gets to that.

↧

When does an Oracle process know it’s on Exadata?

May 3, 2013, 3:44 pm

≫ Next: Direct path read and fast full index scans

≪ Previous: Watching the “CopyBack” progress of a new disk on an Exadata compute node

When an Oracle process starts executing a query and needs to do a full segment scan, it needs to make a decision if it’s going to use ‘blockmode’, which is the normal way of working on non-Exadata Oracle databases, where blocks are read from disk and processed by the Oracle foreground process, either “cached” (read from disk and put in the database buffercache) or “direct” (read from disk and put in the process’ PGA), or ‘offloaded mode’, where part of the execution is done by the cell server.

The code layer where the Oracle database process initiates the offloading is ‘kcfis’; an educated guess is Kernel Cache File Intelligent Storage. Does a “normal” alias non-Exadata database ever use the ‘kcfis’ layer? My first guess would be ‘no’, but we all know guessing takes you nowhere (right?). Let’s see if a “normal” database uses the ‘kcfis’ functions on a Linux x64 (OL 6.3) system with Oracle 11.2.0.3 64 bit using ASM.

The only way to profile kernel functions that I am aware of is using ‘gdb’ and breaking on functions in the Oracle executable:
(the process id shown below ought to be the process id of an oracle database process id you are going to execute in. Do not, I repeat: not do this with other processes, especially the ones that do important tasks!)

# gdb -p 42
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
...
(gdb) rbreak ^kcfis.*
Breakpoint 1 at 0x2204094
<function, no debug info> kcfis_get_sched_delay;
Breakpoint 2 at 0x220410a
<function, no debug info> kcfis_capability_tab_get;
Breakpoint 3 at 0x2204150
<function, no debug info> kcfis_can_session_migrate;
Breakpoint 4 at 0x2204190
<function, no debug info> kcfis_fob_storage_attr;
Breakpoint 5 at 0x22041d0
<function, no debug info> kcfis_init_resource_limit;
Breakpoint 6 at 0x22041f2
<function, no debug info> kcfis_tablespace_is_on_sage;
...
(gdb) c
Continuing.

Okay, we got the debugger set, now let’s execute a simple query (doing a full scan) to see if kcfis is touched on a “normal” server or not!

TS@v11203 > select count(*) from t2;

I am on OS-X (Apple) using iTerm, and I see the tab of the gdb session turning red: gdb generated some output on the screen!

(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb)

So, we hit a kcfis function! Let me put upfront that I do not have sourcecode of any kind, so my statements about what a function does are actually guesses. Anyway: from the name of the function (kcfis_tablespace_is_on_sage) it looks like a function in the kcfis layer which determines if a tablespace is on an Exadata storage server (exadata’s internal name is/was ‘sage’). Nice. so we hit the layer for determination if the tablespace is on Exadata.

Let’s continue the profiling with gdb:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000220410a in kcfis_capability_tab_get ()
(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) c
Continuing.

We see another call to the function ‘kcfis_tablespace_is_on_sage’, and a call to ‘kcfis_capability_tab_get’. The last function probably tries to probe the table (but could be tablespace) to get the capabilities. This could be the function which checks the requirements for hybrid columnar compression, I am not sure about that.

At what point during the processing of the full segment scan does the kcfis_tablespace_is_on_sage occur? One way of investigating this, is profiling some functions we know a (direct path) full scan does, and see where the kcfis_tablespace_is_on_sage kicks in. When the buffer cache is flushed prior to executing a full scan, and the SQL is made unique, so it has to be parsed, the following sequence of events happens:

- Parse
- Execute
- A ‘SQL*Net message to client’ wait
- A ‘db file sequential read’ wait (for reading the segment header)
- Potentially a ‘asynch descriptor resize’ wait
- The full scan is done asynchronously (potentially revealing some ‘direct path read’ waits)

So if we profile on start and end of a wait (kslwtbctx and kslwtectx), a single block read (pread64), AIO (io_submit and io_getevents_0_4) and of course kcfis_tablespace_is_on_sage, we should be able to see that:

Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 1, pread64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 6, io_getevents_0_4 (ctx=0x7fb42f475000, min_nr=2, nr=128, events=0x7fffb4c66768, timeout=0x7fffb4c67770) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

So what do we see here?
Row 1 : Here the wait ending for ‘SQL*Net message from client’.
Row 2-3 : This is the ‘SQL*Net message to client’ wait.
Row 5-6-7 : This is begin wait, pread64 for reading the segment header and the end wait.
Row 8&9 : The session probes for Exadata.
Row 10-11 : Start and end of a wait, ‘asynch descriptor resize’ (verified by trace file).
Row 12- : Here the full scan takes off.

So…a query starts (I’ve made it unique by adding a dummy hint, so it’s parsed again) the SQL*Net roundtrip occurs, the segment header is read, then the process looks if it’s on Exadata, which it isn’t here, and starts an asynchronous full scan.

So let’s repeat this check on a database which is on Exadata! In order to do so, we need to be aware several calls are not done on exadata: pread64, io_submit and io_getevents_0_4, because we do not do local IO, but issue them over infiniband. So I breaked on kslwtbctx, kslwtectx, kcfis_tablespace_is_on_sage. This is how that looks like:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000905cf62 in kslwtectx ()
Breakpoint 3, 0x0000000002230366 in kcfis_tablespace_is_on_sage ()

Mmmh, this is different. The ‘SQL*Net message from client’ wait ends, and before any other wait occurs, the existence of exadata is checked. This is different from the non-exadata case. Let’s take a look at the backtrace of the break on kcfis_tablespace_is_on_sage:

(gdb) bt
#0  0x0000000002230366 in kcfis_tablespace_is_on_sage ()
#1  0x0000000001402eb0 in qesSageEnabled ()
#2  0x0000000009234d20 in kkdlgstd ()
#3  0x0000000001a6111d in kkmfcblo ()
#4  0x000000000922f76d in kkmpfcbk ()
#5  0x000000000942e538 in qcsprfro ()
#6  0x000000000942de29 in qcsprfro_tree ()
#7  0x000000000942de6e in qcsprfro_tree ()
#8  0x0000000002dd80c5 in qcspafq ()
#9  0x0000000002dd51d9 in qcspqbDescendents ()
#10 0x0000000002dd91e4 in qcspqb ()
#11 0x0000000001a6b2be in kkmdrv ()
#12 0x0000000002584c76 in opiSem ()
#13 0x000000000258ac8b in opiDeferredSem ()
#14 0x000000000257dc32 in opitca ()
#15 0x0000000001ec3d7d in kksFullTypeCheck ()
#16 0x00000000092a7256 in rpiswu2 ()
#17 0x0000000001eca977 in kksLoadChild ()
#18 0x0000000009298448 in kxsGetRuntimeLock ()
#19 0x000000000925aa34 in kksfbc ()
#20 0x000000000925556e in kkspsc0 ()
#21 0x0000000009254e6a in kksParseCursor ()
#22 0x000000000933cb25 in opiosq0 ()
#23 0x0000000001b82a46 in kpooprx ()
#24 0x0000000001b80d2c in kpoal8 ()
#25 0x00000000091fb8b8 in opiodr ()
#26 0x000000000939e696 in ttcpip ()
#27 0x000000000180f011 in opitsk ()
#28 0x0000000001813c0a in opiino ()
#29 0x00000000091fb8b8 in opiodr ()
#30 0x000000000180af4c in opidrv ()
#31 0x0000000001e0a77b in sou2o ()
#32 0x0000000000a0cc05 in opimai_real ()
#33 0x0000000001e106ec in ssthrdmain ()
#34 0x0000000000a0cb71 in main ()

What is interesting to see, is line 23, backtrace layer number 21: kksParseCursor. So actually during parsing the detection of storage servers happens already, not when it actually starts a full segment scan needs to make a decision to do a smartscan or not.

↧

Direct path read and fast full index scans

May 9, 2013, 1:47 pm

≫ Next: Installation of vmware tools on OL6u4 plus vmware shared folders

≪ Previous: When does an Oracle process know it’s on Exadata?

This is yet another blogpost on Oracle’s direct path read feature which was introduced for non-parallel query processes in Oracle version 11.

For full table scans, a direct path read is done (according to my tests and current knowledge) when:

- The segment is bigger than 5 * _small_table_threshold.
- Less than 50% of the blocks of the table is already in the buffercache.
- Less than 25% of the blocks in the buffercache are directy.

Also (thanks to Freek d’Hooge who pointed me to an article from Tanel Poder) you can change the optimizer statistics to change the segment size for the direct path decision. Please mind that whilst this uses the statistics the optimizer uses, this is NOT an optimizer decision, but a decision made in the “code path”, so during execution.

So let’s take a look at my lab environment (Oracle Linux 6.3, 64 bit, Oracle 11.2.0.3 and ASM)

Small table threshold:

NAME						   VALUE
-------------------------------------------------- -------
_small_table_threshold				   1011

Table information:

TS@v11203 > select blocks from user_segments where segment_name = 'T2';

    BLOCKS
----------
     21504

So if we take small table threshold times and multiply it by five, we get 5055. This means that the size of table T2 is more than enough so should be scanned via direct path:

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  0

TS@v11203 > select count(*) from t2;

  COUNT(*)
----------
   1000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  1

Well, that’s that, this seems quite simple.

I’ve created a relatively big table and created a (normal) index on it in the same database. The index is created on a single column, called ‘id’. If I issue a count(id), the whole index needs to be scanned, and Oracle will choose a fast full index scan. A fast full index scan is a scan which just needs to read all the blocks, not necessarily in leaf order. This means it can use multiblock reads (which reads in the order of allocated adjacent blocks).

Let’s check just to be sure:

TS@v11203 > select count(id) from bigtable;

Execution Plan
----------------------------------------------------------
Plan hash value: 106863591

------------------------------------------------------------------------------------
| Id  | Operation	      | Name	   | Rows  | Bytes | Cost (%CPU)| Time	   |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      | 	   |	 1 |	13 | 19662   (2)| 00:03:56 |
|   1 |  SORT AGGREGATE       | 	   |	 1 |	13 |		|	   |
|   2 |   INDEX FAST FULL SCAN| I_BIGTABLE |	34M|   425M| 19662   (2)| 00:03:56 |
------------------------------------------------------------------------------------

Note
-----
   - dynamic sampling used for this statement (level=2)

If we look at the index size, the size of the index makes this segment a candidate for direct path reads:

TS@v11203 > select blocks from user_segments where segment_name = 'I_BIGTABLE';

    BLOCKS
----------
     72704

If we look at number of small table threshold times five (5055), this index is much bigger than that. Also, this is bigger than table T2. Let’s execute select count(id) from bigtable, and look at the statistic ‘index fast full scans (direct read)’:

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

TS@v11203 > select count(id) from bigtable;

 COUNT(ID)
----------
  32000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

Huh? This statistic tells me there hasn’t been a direct path read! This means that this read has been done in the “traditional way”. This is a bit…counter intuitive. I’ve traced the session, and indeed it’s doing the traditional multiblock reads via the scattered read waits.

I did a fair bit of fiddling around with the parameters which are reported to be involved, and found out I can get the database to do direct path reads by changing the parameter “_very_large_object_threshold”. The information found on the internet reports this value is in megabytes. A quick stroll through a number of different database (all on 11.2.0.3) shows this parameter is quite probably statically set at “500″.

If I calculate the size in megabytes of the index I_BIGTABLE, the size is 568M. This is clearly higher than the value of “_very_large_object_threshold”. I can get the same index scanned via direct path reads by changing the value of “_very_large_object_threshold” to 100.

This interesting, because it looks like this parameter does the same for full scans on index segments as “_small_table_threshold” does for full scans on table segments: the size of the segment to be scanned needs to be bigger than five times.

There are also differences: small table threshold is set in blocks, (apparently) very large object threshold is set in megabytes. Also, small table threshold is set by default at 2% of the size of the buffercache (so it scales up with bigger caches), very large object threshold seems to be fixed at 500. If my finding is correct, then it means an index segment needs to be bigger than 500*5=2500M to be considered for direct path reads. It’s unknown to me if the 50% limit for blocks in the cache and the 25% limit for dirty blocks is subject to this too.

Tagged: oracle tune direct path IO

↧

Installation of vmware tools on OL6u4 plus vmware shared folders

July 5, 2013, 2:52 am

≫ Next: Extra huge database IOs

≪ Previous: Direct path read and fast full index scans

This post is just a reminder to myself how to install the vmware (fusion) stuff in a fresh installed OL6u4 installation. To be honest, it isn’t entirely “fresh”: I’ve ‘yum update’ed the installation, and I’ve installed the 11gR2 preinstall package (yum install oracle-rdbms-server-11gR2-preinstall) package already.

1. Reboot
After the yum update and yum install of the oracle database preinstall package everything is updated, but not yet activated (new kernel). This means you need to reboot your VM first.

2. Install the kernel headers
Now the whole system is installed and setup we can *almost* install the vmware tools. However: the vmware tools need the kernel headers. If you look for the kernel headers, you *think* you see them:

# rpm -qa | grep headers
glibc-headers-2.12-1.107.el6.x86_64
kernel-uek-headers-2.6.32-400.29.1.el6uek.x86_64

But this is not the case. In fact, the system uses a different kernel version (2.6.39 in my case, which is the UEK2 kernel). But more importantly: the kernel headers are not in the kernel-uek-headers package (!!). In order to install the kernel headers which the vmware tools driver installation needs, you need to install the kernel-uek-devel package:

# yum install kernel-uek-devel

3. Make the vmware tools available to the VM, and extract the installer package
Virtual Machine>Install VMWare Tools.
Inside the linux VM:
# mount /dev/dvd /mnt -t iso9660
# cd /tmp
# tar xzvf /mnt/VMwareTools-9.2.2-893683.tar.gz

4. Install the vmware tools
# cd vmware-tools-distrib
# ./vmware-install.pl -default
(the vmware tools installation starts, and it will choose all the default values)

For the installation of the grid and database software I’ve got the files setup/extracted outside of my VM, and use ‘shared folders’ feature of vmware (fusion) to mount these from my VM:

1. Enable shared folders in the vmware fusion hypervisor, and share a directory
Virtual Machine>Settings>Sharing
enable shared folders (off by default)
Press ‘+’, this will make a window available to select a folder. Select the folder in which the installation media (database and grid directory for Oracle 11.2.0.3 installation) is present, and click ‘add’.
(optional: set permissions in the shared folders window to ‘readonly’)
Take note of the name field, the name of the shared folder is what you need in the next step.
At this point the folder you selected is shared with the VM, and the sharing is enabled.

2. Make the shared folders visible in the Oracle Linux VM
The last step is to create a mountpoint for the shared folder (vmware creates it’s own directory ‘/mnt/hgfs’ but I like to create my own):

# mkdir /mnt/oracle11203

And make the mount permanent in fstab:

# vi /etc/fstab
.host:/oracle11203        /mnt/oracle11203      vmhgfs  default        0 0

All is setup now! Use the newly created static mount:

# mount -a

And make sure it’s there:

# mount 
...
.host:/ on /mnt/hgfs type vmhgfs (rw,ttl=1)
.host:/oracle11203 on /mnt/oracle11203 type vmhgfs (rw,ttl=1)

Tagged: oracle linux UEK UEK2 vmware tools vmware-tools installation install

↧

Extra huge database IOs

July 5, 2013, 2:01 pm

≫ Next: Extra huge database IOs, part 2

≪ Previous: Installation of vmware tools on OL6u4 plus vmware shared folders

It’s been a while since I presented the first incarnation of my ‘about multiblock reads’ presentation. When I did this at the UKOUG TEBS conference in Birmingham in 2011, Christian Antognini chaired my presentation. After my presentation Christian showed me it’s possible to set the parameter ‘db_file_multiblock_read_count’ higher than 1MB/db_block_size (which is 128 if your blocksize is 8kB), and you could benefit from it if your hardware is sufficient. In fact, Christian showed me AWR reports (could also be statspack reports, not sure) which showed the benefit.

My understanding of the parameter db_file_multiblock_read_count at the time was:

The maximum value is the operating system’s maximum I/O size expressed as Oracle blocks ((max I/O size)/DB_BLOCK_SIZE). If you set this parameter to a value greater than the maximum, Oracle uses the maximum.

Which is taken from the Reference guide, in which this text is there for both 10.2 and 11.2, and even for 12.1.

This is interesting, because the statement from the Oracle reference guide conflicts with Christians statement and measurements!

My test environment: Oracle 11.2.0.3 on Linux OL6u3 X64 with ASM, on my Macbook with SSD.

I’ve got a table created in a bigfile ASSM tablespace, system allocation type. This means objects get extents allocated which are increased progressively in size in steps:

 EXTENT_ID     BLOCKS
---------- ----------
	 0	    8
...
	15	    8
	16	  128
...
	78	  128
	79	 1024
...
	91	 1024

First let’s take a look at a full scan done “buffered”, alias blocks read to the buffercache, which is visible via the “db file scattered read” wait event. This is the “traditional” multiblock read.

I’ve flushed the buffer cache, set db_file_multiblock_read_count to 128, set event 10949 at level 1 and enabled SQL trace at level 8.
Event 10949 is needed to get a buffered multiblock read of an table which the Oracle database deems ‘big’ (to avoid non-buffered alias direct path reads).
Another thing is both ‘db file sequential read’ and ‘db file scattered read’ will always show up (even if the action (the physical read) is very fast), so there’s no need the change things (slow down I/O) to be sure we get these waits.

The statement executed is ‘select count(*) from t2′.

The first section of extent size 8, which shows the typical alteration of 8 and 7 blocks being read (after the initial smaller sizes) which is because of the space bitmap block every other extent:

WAIT #140150651756760: nam='db file sequential read' ela= 272 file#=5 block#=23938 blocks=1 obj#=75579 tim=1372970940124856
WAIT #140150651756760: nam='db file scattered read' ela= 2162 file#=5 block#=23939 blocks=5 obj#=75579 tim=1372970940127442
WAIT #140150651756760: nam='db file scattered read' ela= 725 file#=5 block#=23944 blocks=8 obj#=75579 tim=1372970940128478
WAIT #140150651756760: nam='db file scattered read' ela= 144 file#=5 block#=23953 blocks=7 obj#=75579 tim=1372970940128908
WAIT #140150651756760: nam='db file scattered read' ela= 253 file#=5 block#=23960 blocks=8 obj#=75579 tim=1372970940129636

Then the extents of 128 blocks, which show 126 blocks being read because of 2 space bitmap blocks at the beginning of every extent of 128 blocks:

WAIT #140150651756760: nam='db file scattered read' ela= 3741 file#=5 block#=24066 blocks=126 obj#=75579 tim=1372970940139996
WAIT #140150651756760: nam='db file scattered read' ela= 955 file#=5 block#=24194 blocks=126 obj#=75579 tim=1372970940142894
WAIT #140150651756760: nam='db file scattered read' ela= 2916 file#=5 block#=24322 blocks=126 obj#=75579 tim=1372970940148287
WAIT #140150651756760: nam='db file scattered read' ela= 2111 file#=5 block#=24450 blocks=126 obj#=75579 tim=1372970940152527
WAIT #140150651756760: nam='db file scattered read' ela= 2952 file#=5 block#=24578 blocks=126 obj#=75579 tim=1372970940157492

Then the extents of 1024 blocks, which start off with 4 space bitmap blocks at the beginning of every extents:

WAIT #140150651756760: nam='db file scattered read' ela= 3079 file#=5 block#=36228 blocks=128 obj#=75579 tim=1372970941036955
WAIT #140150651756760: nam='db file scattered read' ela= 2737 file#=5 block#=36356 blocks=128 obj#=75579 tim=1372970941048199
WAIT #140150651756760: nam='db file scattered read' ela= 2531 file#=5 block#=36484 blocks=128 obj#=75579 tim=1372970941059473
WAIT #140150651756760: nam='db file scattered read' ela= 2912 file#=5 block#=36612 blocks=128 obj#=75579 tim=1372970941071015
WAIT #140150651756760: nam='db file scattered read' ela= 2768 file#=5 block#=36740 blocks=128 obj#=75579 tim=1372970941082467
WAIT #140150651756760: nam='db file scattered read' ela= 287 file#=5 block#=36868 blocks=128 obj#=75579 tim=1372970941091378
WAIT #140150651756760: nam='db file scattered read' ela= 2831 file#=5 block#=36996 blocks=128 obj#=75579 tim=1372970941102969
WAIT #140150651756760: nam='db file scattered read' ela= 536 file#=5 block#=37124 blocks=124 obj#=75579 tim=1372970941111811

We see 7 lines of 128 blocks read (7*128=896) and one line of 124 blocks (896+124=1020) which is 4 blocks short of the extent size of 1024, which are the space bitmap blocks.

Okay, nothing new here.

If I raise the parameter db_file_multiblock_read_count to 2000, and reissue the same ‘select count(*) from t2′, I get exactly the same waits. In other words: nothing different.

So the conclusion is: when using buffered reads, the statement about 1MB being the limit of the size of a single multiblock read indeed is true.

Now let’s move on to direct path reads. In short, when a table is deemed ‘big’ by the database, meaning the number of blocks occupied by the table is more than five times _small_table_threshold (which is 2% of the buffercache by default) then the database foregrounds will not use the buffercache to read the table blocks to, but read them to the process’ PGA to avoid discarding usefull blocks in the cache, and avoid the overhead of concurrency control (latching). There are more criteria, but that’s outside the scope of this blogpost.

First let’s trace a session doing direct path reads. This is easier than the above situation where I forced buffered reads, because I can do the same as above, but skip setting event 10949. On the other hand it’s harder, because in order to see all the IOs using tracing (which means ‘get a wait for all IOs’), I need reduce the number of IOs (I use 1 IOPS), because otherwise some IOs will not get a wait, so not all IOs will be visible in the tracefile (look at this article for learning how to reduce IOPS with Linux, go see me present my About multiblock reads presentation to learn why some waits for direct path read IOs could not show up)

So, here are snippets from the tracefile of a ‘select count(*) from t2′, with db_file_multiblock_read_count set to 128 and SQL trace at level 8:

First we see the reads in the small extents (8 blocks) from BMB to BMB, not stopping at the extent border:

WAIT #140402935204600: nam='direct path read' ela= 996117 file number=5 first dba=23939 block cnt=13 obj#=75579 tim=1373055770533891
WAIT #140402935204600: nam='direct path read' ela= 997014 file number=5 first dba=23953 block cnt=15 obj#=75579 tim=1373055771531222
WAIT #140402935204600: nam='direct path read' ela= 1000275 file number=5 first dba=23969 block cnt=15 obj#=75579 tim=1373055772531716
WAIT #140402935204600: nam='direct path read' ela= 999111 file number=5 first dba=23985 block cnt=15 obj#=75579 tim=1373055773531249
WAIT #140402935204600: nam='direct path read' ela= 999628 file number=5 first dba=24001 block cnt=15 obj#=75579 tim=1373055774531093

Then we see the reads of 126 blocks, of the extents of 128 blocks:

WAIT #140402935204600: nam='direct path read' ela= 1004881 file number=5 first dba=24066 block cnt=126 obj#=75579 tim=1373055778541646
WAIT #140402935204600: nam='direct path read' ela= 989042 file number=5 first dba=24194 block cnt=126 obj#=75579 tim=1373055779531980
WAIT #140402935204600: nam='direct path read' ela= 1002835 file number=5 first dba=24322 block cnt=126 obj#=75579 tim=1373055780541509
WAIT #140402935204600: nam='direct path read' ela= 989545 file number=5 first dba=24450 block cnt=126 obj#=75579 tim=1373055781531621
WAIT #140402935204600: nam='direct path read' ela= 1007730 file number=5 first dba=24578 block cnt=126 obj#=75579 tim=1373055782539986

And then the reads of the extents of the size of 1024 blocks:

WAIT #140402935204600: nam='direct path read' ela= 2000086 file number=5 first dba=36228 block cnt=124 obj#=75579 tim=1373055875542615
WAIT #140402935204600: nam='direct path read' ela= 1995550 file number=5 first dba=36480 block cnt=128 obj#=75579 tim=1373055877539193
WAIT #140402935204600: nam='direct path read' ela= 1999327 file number=5 first dba=36736 block cnt=128 obj#=75579 tim=1373055879539584
WAIT #140402935204600: nam='direct path read' ela= 1998883 file number=5 first dba=36992 block cnt=128 obj#=75579 tim=1373055881539584
WAIT #140402935204600: nam='direct path read' ela= 1998753 file number=5 first dba=37252 block cnt=124 obj#=75579 tim=1373055883539416

If you take a close look, you’ll see IO’s are still missing (just add cnt to dba and see if it fits the next dba, with the exception of BMB’s, which are not read too, so will leave a gap). Nevertheless: this is as we expect.

Now let’s do the same, but set db_file_multiblock_read_count to 2000! In order to reduce length: the IO’s of the 8 blocks and 128 blocks extents are identical. Let’s look at the 1024 blocks extent IOs:

WAIT #140550163651400: nam='direct path read' ela= 7862301 file number=5 first dba=32132 block cnt=1020 obj#=75579 tim=1373057289713781
WAIT #140550163651400: nam='direct path read' ela= 7996018 file number=5 first dba=33156 block cnt=1020 obj#=75579 tim=1373057297713687
WAIT #140550163651400: nam='direct path read' ela= 7996272 file number=5 first dba=34180 block cnt=1020 obj#=75579 tim=1373057305713868
WAIT #140550163651400: nam='direct path read' ela= 7995719 file number=5 first dba=35204 block cnt=1020 obj#=75579 tim=1373057313713547
WAIT #140550163651400: nam='direct path read' ela= 7996310 file number=5 first dba=36228 block cnt=1020 obj#=75579 tim=1373057321713593

Wow! These are HUGE IOs! The entire 1024 block extent is read in one (oracle) IO excluding the 4 BMB’s.

So this means Christian was right, and the Oracle documentation is wrong about the maximum size of db_file_multiblock_read_count for direct path reads.

There’s much more to be said about huge IOs, but I’ll leave that for another blogpost.

Tagged: oracle IO multiblock read MBRC multiblock read count

↧

Extra huge database IOs, part 2

July 6, 2013, 4:44 am

≫ Next: The Oracle ‘db file parallel read’ wait event

≪ Previous: Extra huge database IOs

This is part 2 of a number of blogposts about huge Oracle database IO’s.
If you landed on this blogpost and did not read part 1, please read part 1 here.

But how much bigger can these IOs be? In part 1 I showed Oracle IOs of 1020 blocks. Is that the limit? To investigate this, I created a much bigger table (table T2 in part 1 had a maximum extent size of 1024 blocks, which meant that the 1020 is the biggest IO possible from this table).

For the sake of this investigation I created a much bigger table to get larger extents:

 EXTENT_ID     BLOCKS
---------- ----------
	 0	    8
...
       263	 8192
       264	 8192
       265	 8192

266 rows selected.

So with my new, big table I got extents up to 8192 blocks. Let’s do the same test on this table as I did in the previous post, which means setting db_file_multiblock_read_count to 2000 with SQL trace at level 8 with throttled IOPS to get waits!

The IOs with the “small” extents of 8 blocks and 128 blocks are done exactly the same as the previous post. The IOs done against the extents with a size of 1024 blocks are the same too, simply because reading the 1020 blocks in one IO is as efficient as it could be! But let’s take a look at the point where the extents become 8192 blocks:

WAIT #140281084141152: nam='direct path read' ela= 7996378 file number=5 first dba=174468 block cnt=1020 obj#=76227 tim=1373104660882677
WAIT #140281084141152: nam='direct path read' ela= 7995820 file number=5 first dba=175489 block cnt=1023 obj#=76227 tim=1373104668882345
WAIT #140281084141152: nam='direct path read' ela= 7996472 file number=5 first dba=176520 block cnt=632 obj#=76227 tim=1373104676882677
WAIT #140281084141152: nam='direct path read' ela= 7998049 file number=5 first dba=177152 block cnt=1024 obj#=76227 tim=1373104684883512
WAIT #140281084141152: nam='direct path read' ela= 7995472 file number=5 first dba=178176 block cnt=1024 obj#=76227 tim=1373104692882932
WAIT #140281084141152: nam='direct path read' ela= 7993677 file number=5 first dba=179200 block cnt=1024 obj#=76227 tim=1373104700880106
WAIT #140281084141152: nam='direct path read' ela= 7996969 file number=5 first dba=180224 block cnt=1024 obj#=76227 tim=1373104708880891
WAIT #140281084141152: nam='direct path read' ela= 5998630 file number=5 first dba=181248 block cnt=1024 obj#=76227 tim=1373104714882889
WAIT #140281084141152: nam='direct path read' ela= 9996459 file number=5 first dba=182272 block cnt=1024 obj#=76227 tim=1373104724882545

I’ve included one IO of 1020 blocks, after which Oracle issues an IO request of 1023 blocks, then 632 blocks, and then it issues a number of requests for 1024 blocks.

It seems that the maximal amount of blocks of multiblock IO requests done by Oracle “unbuffered” or “direct path” is 1024 blocks, which is 8MB! Please mind I have tested this ONLY with Oracle version 11.2.0.3 64 bits on Linux X64 on ASM, with a database and segment blocksize of 8kB.

There’s another peculiarity I want to introduce. Let’s take a look at the dba and block cnt only, together with block_id and blocks (number of blocks in the extent, taken from dba_extents):

                                  BLOCK_ID	  BLOCKS         EXTENT_ID
dba=174468 block cnt=1020           174464          1024               197
dba=175489 block cnt=1023           175488          1024               198
dba=176520 block cnt=632            176512          8192               199
dba=177152 block cnt=1024
dba=178176 block cnt=1024
dba=179200 block cnt=1024
dba=180224 block cnt=1024
dba=181248 block cnt=1024
dba=182272 block cnt=1024
dba=183296 block cnt=1024

There are two anomalies here: the first one is at dba 175489. It seems that dba 175489/extent id 198 has only one BMB, so Oracle can read 1023 blocks of the extent, instead of the 1020 blocks in the other extents with the size of 1024 blocks.

The second anomaly is much more interesting: extent id 199 starts at block id 176512. This is an extent consisting of 8192 blocks. The first IO starts at block id 176520. This means there are 176520-176512=8 blocks not read, which are probably BMB’s. But this is in line with my expectation. What is odd, is the first IO issued for datablocks in this extent which is 8192 blocks in length is “only” 632 blocks!

After the oddly sized IO request at the beginning of the extent (632), the other IO requests are the full 1024 blocks, which probably is the maximum size of direct path Oracle IO requests.

I wondered about the oddly sized request at the beginning. I looked at the dba’s of the requests, and noticed the dba’s of the 1024 blocks sized requests looked vaguely familiar. Then I spotted the familiarity and a possible reason for the “small” request at the beginning of the extent:

SYS@v11203 AS SYSDBA> select mod(176520,1024) from dual;

MOD(176520,1024)
----------------
	     392

SYS@v11203 AS SYSDBA> select mod(177152,1024) from dual;

MOD(177152,1024)
----------------
	       0

SYS@v11203 AS SYSDBA> select mod(178176,1024) from dual;

MOD(178176,1024)
----------------
	       0

ONLY when the direct path read requests can potentially read “the full” 1024 blocks in one IO, Oracle first issues an IO to align the later requests to a block id being a multiplication of 1024, after which it will issue IO requests of 1024 blocks.

I don’t know why the alignment takes place. I can imagine this alignment can line up with ASM, thus avoiding a single Oracle IO request touching more than one ASM allocation unit. But I am open to other people’s ideas and suggestions about potential reasons!

Stay tuned for further updates about extra huge database IOs!

Tagged: oracle IO direct path IO huge IO large wait internals

↧

The Oracle ‘db file parallel read’ wait event

July 7, 2013, 6:10 am

≫ Next: Extra huge database IOs, part 3

≪ Previous: Extra huge database IOs, part 2

This is a small note describing how Oracle implemented the situation which is covered by the db file parallel read wait event. This events happens if Oracle knows it must read multiple blocks which are not adjacent (thus from different random files and locations), and cannot continue processing with the result of a single block. In other words: if it cannot process something after reading a single block (otherwise Oracle will read a single block visible by the wait ‘db file sequential read’).

This is how it shows up if you enable sql trace:

WAIT #139658359011504: nam='db file parallel read' ela= 69997140 files=1 blocks=70 requests=70 obj#=76227 tim=1373200106669612

What this shows, is Oracle issuing a request for 70 blocks. This has an interesting implication for monitoring and looking at the time spend on the event ‘db file parallel read’: if you don’t know the number of blocks for which an IO request is issued, it’s impossible to say something about the time. So just monitoring or looking a cumulative time spend in ‘db file parallel read’ doesn’t say anything about IO latency, it only tells something about where the Oracle process did spend its time on.

How did Oracle implement this? This is obviously port specific (which means the implementation will be different on different operating systems). My test environment is Oracle Linux 6u3 X64, Oracle 11.2.0.3 64 bit using ASM.

This is how the requests are asynchronously submitted to the operating system:

Breakpoint 2, io_submit (ctx=0x7f04c0c8d000, nr=70, iocbs=0x7fff86d965f0) at io_submit.c:23

So all the IO requests are submitted in one go!

After the IO requests are submitted (which is not covered by a wait, which makes sense, because the io_submit call is/is supposed to be non blocking.

Next Oracle waits for ALL the IOs to finish, covered by the ‘db file parallel read’ wait event:

Breakpoint 13, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 1, io_getevents_0_4 (ctx=0x7f04c0c8d000, min_nr=70, nr=128, events=0x7fff86d9c798, timeout=0x7fff86d9d7a0) at io_getevents.c:46
Breakpoint 14, 0x0000000008fa1334 in kslwtectx ()

We see kslwtbctx which indicates the start of a waitevent, then a io_getevents call:

‘ctx’ is the IO context. This is how Linux keeps track of groups of asynchronous IO requests.
‘min_nr’ is the minimal number of requests that must be ready for this call to succeed, this call will wait until ‘timeout’ is reached. io_getevents will just peek if ‘timeout’ is set to zero.
‘nr’ is the maximal number of requests that io_getevents will “fetch”.
‘events’ is a struct (table like structure) that holds the information about the iocb’s (IO control blocks) of the requests.
‘timeout’ is a struct that sets the timeout of this call. For Oracle IO I see timeout being 600 (seconds) most of the time.

The last line show kslwtectx indicating that the wait has ended.

Tagged: oracle wait event gdb debuggin trace 10046 howstuffworks linux c call

↧

Extra huge database IOs, part 3

July 14, 2013, 3:01 am

≫ Next: Oracle IO on linux: log writer IO and wait events

≪ Previous: The Oracle ‘db file parallel read’ wait event

This is part 3 of a number of blogposts about huge Oracle database IO’s.
If you landed on this blogpost and did not read part 1 or part 2, please read part 1 here and part 2 here.

In part 1 I showed how database IOs of a full table scan could be bigger than 1MB by increasing the db_file_multiblock_read_count parameter to a number beyond 1MB expressed in Oracle blocks. These bigger IOs only happen with direct path reads, not with buffered multiblock reads (which are recognisable via the ‘db file scattered read’ wait event). In part 2 I showed that with Oracle Linux 6u4 X64, Oracle 11.2.0.3 64 bits and ASM the maximal size of an IO could go up to 1024 blocks if db_file_multiblock_read_count was set high enough, and that once an IO could be 1024, Oracle aligns the IO to block id being a multiplication of 1024.

With all the gained knowledge, the next thing I ask myself is: how did Oracle implement this? What operating system calls are done? How can we do IOs of 8MB whilst the maximal IO size of Linux is supposed to be 1MB? The obvious answer is take a peak!

In order to see the exact working, it is once again necessary to reduce the number of IOPS, to see both the io_getevents calls which probe the completion queue (which are not timed/captured by a wait event) and the blocking io_getevents call which is timed with the ‘direct path read’ wait event. To throttle IO (reduce IOPS), see this article.

Let’s focus on how direct path IO works first. Direct path IO (on Linux with ASM is what I researched) works roughly in 3 stages:

Before a direct path IO takes place, Oracle does some stuff which necessary to make direct path IO’s happen. Some of the things are issuing a object checkpoint (visible with the ‘enq: KO – fast object checkpoint’ waitevent) to make sure the latest version of the blocks involved are on disk, and a rebuild of an operating system asynchronous IO context (visible via the io_destroy() and io_setup() system calls; this is not captured by an Oracle waitevent. The Oracle waitevent ‘asynch descriptor resize’ which could show up during this phase actually is about the Oracle asynchronous IO tracking mechanism, which Oracle often refers to as ‘slots’).

a) submit/process phase:
After the above mentioned initial operations, Oracle submits IO’s via the io_submit() system call. Oracle issues two IO requests for two distinct ranges of blocks when it starts a direct path read scan, which is the minimal number of IO requests which are issued in parallel. This number could go up (up to 32 concurrent requests). These are two distinct io_submit() calls.

When Oracle enters the submit/process phase again, it does something interesting: it first processes a single (multiblock) IO request, then issues an IO request via io_submit(), then processes another IO request (if available), then issues an IO request (io_submit()), depending on the number of number IOs Oracle needs to submit and needs to process.

It’s important to note that Oracle did not put any waits in this phase. This is all instrumented as cpu time.

b) reap all phase:
Now Oracle has issued a number of IOs, Oracle does something truly interesting: it issues a number of IO requests (visible as io_getevents() calls) for ALL the IOs this process submitted with timeout set to 0 (zero). This does two things: it makes the call non blocking and it will be invisible for strace (the Linux system call trace utility) if they do not succeed (!!).

This phase is not covered by a waitevent. This means that if IOs are fast, and this phase succeeds in reaping all IOs, there is no wait, and are instrumented as cpu time.

Also, if all IOs are reaped in this phase, it can go back to the submit/process phase.

c) wait for IO phase:
If Oracle comes up to this phase, the process has done everything, and can do nothing but wait for IO to arrive. So that is exactly what the process is going to do: it issues a io_getevents() call with timeout set to 600 seconds, and waits for any SINGLE IO request to arrive. Here Oracle “truly” waits, and this is covered by a wait event: ‘direct path read’. Once a single IO request is reaped, this step is ready, and goes back to the submit/process phase.

Okay, now that we are up to date with how Oracle processes asynchronous direct path IO, let’s dig into how this works with huge IOs!

For investigating this issue, the linux strace utility actually works best, because it interprets the pointers in the system call, and shows the content of the struct, instead of just showing the pointer, which gdb does. As always, you need to careful choose the tools you use for the job. By using strace, the only hint in which phase the io_getevents() call was done is by looking at the timeout value(!).

This is a snippet of the strace output when Oracle was doing 1020 blocks IO requests:

io_submit(140019063115776, 8, {{0x7f58b89fbfb8, 0, 0, 0, 256}, {0x7f58b89fc210, 0, 0, 0, 257}, {0x7f58b89fc468, 0, 0, 0, 256}, {0x7f58b89fc6c0, 0, 0, 0, 257}, {0x7f58b89fc918, 0, 0, 0, 256}, {0x7f58b89fcb70, 0, 0, 0, 257}, {0x7f58b89fcdc8, 0, 0, 0, 256}, {0x7f58b89fd020, 0, 0, 0, 257}}) = 8
io_getevents(140019063115776, 8, 128, {{0x7f58b89f6db0, 0x7f58b89f6db0, 1048576, 0}, {0x7f58b89f7008, 0x7f58b89f7008, 1048576, 0}, {0x7f58b89f66a8, 0x7f58b89f66a8, 1048576, 0}, {0x7f58b89f6450, 0x7f58b89f6450, 1048576, 0}, {0x7f58b89f61f8, 0x7f58b89f61f8, 1048576, 0}, {0x7f58b89f7710, 0x7f58b89f7710, 1048576, 0}, {0x7f58b89f7968, 0x7f58b89f7968, 1048576, 0}, {0x7f58b89f7bc0, 0x7f58b89f7bc0, 1015808, 0}}, {600, 0}) = 8

This is a combination of an io_submit() call and an io_getevents(). There is a lot to say about this.
First, Oracle did not magically increase the maximum IO size of the operating system. As we can see, Oracle submits multiple IO requests in a single io_submit() call. If we peek at the size field in the io_getevents() calls, we see the size mostly being 1048576. In other words: the individual IO requests are 1MB, as we expect, Oracle submitted multiple requests in order to submit IO requests for the single (Oracle level) request of 1020 blocks.

If we add the IO sizes of the individual requests we get: (1048576*7)+1015808=8355840 bytes. 8355840/8192 (=blocksize) = 1020. So the size of the IO requests is the same as the number of Oracle blocks.

Next, if we look at the timeout “field” of io_getevents(), it reads {600, 0}. This means this was a blocking call, with timeout set to 600. We can deduct from that field that this was an io_getevents() call from the wait for IO phase.

I urge you to take an even closer look. If you look at the pointer to the iocb (the large hexadecimal number), you will see that the IOs submitted are different from the IOs reaped with io_getevents(). This is Oracle taking advantage of asynchronous IO.

I’ve throttled IO to 1 IOPS, so I know all the requests will go up to the wait for IO phase.

Let’s take a look at a strace output if IO is not throttled to 1 IOPS:

io_getevents(140136915922944, 28, 128, {{0x7f7429499710, 0x7f7429499710, 1048576, 0}, {0x7f7429499e18, 0x7f7429499e18, 1048576, 0}, {0x7f7429498900, 0x7f7429498900, 1048576, 0}, {0x7f742949c5f0, 0x7f742949c5f0, 1048576, 0}, {0x7f742949ba38, 0x7f742949ba38, 1048576, 0}, {0x7f7429498db0, 0x7f7429498db0, 1048576, 0}, {0x7f742949ac28, 0x7f742949ac28, 1048576, 0}, {0x7f742949a2c8, 0x7f742949a2c8, 1048576, 0}, {0x7f742949b7e0, 0x7f742949b7e0, 1048576, 0}, {0x7f742949a070, 0x7f742949a070, 1048576, 0}, {0x7f7429499260, 0x7f7429499260, 1048576, 0}, {0x7f74294981f8, 0x7f74294981f8, 1048576, 0}, {0x7f742949c848, 0x7f742949c848, 1048576, 0}, {0x7f742949a778, 0x7f742949a778, 1048576, 0}}, {0, 0}) = 14

The timeout struct is set to {0, 0}. So this is an io_getevents() call from the ‘get all IOs phase’. As we can see, all IOs in this case is 28 (that is a little less than 28MB!). Another interesting thing is this actual call returned 14 IO requests, but still succeeded, despite the minimal number of requests ready was set to 28. My guess is because because another io_getevents() call or calls reaped the other IOs, and this request completed the 28 requests, and made this call successful.

Because this is in the ‘get all IOs phase’, there is no wait event. So this huge amount of IOs being ready is not reflected in the wait events, nor can be assembled from it for this specific request. Well, you could calculate the number of IOs needed to read this segment by looking at DBA_EXTENTS, and calculate the total time taken for reading the segment, and calculate an average but that’s all.

Because the ‘direct path read’ wait event only shows up when we need to wait, which will be for a single request, a wait will always show maximally for 1024 blocks because that is the maximum (which I’ve found up to now). If all the IOs are ready before the ‘wait for IO phase’, it means there are more IO requests ready than a single Oracle IO.

Tagged: oracle IO wait waits waitevent event performance direct path io_submit io_getevents

↧

Oracle IO on linux: log writer IO and wait events

August 30, 2013, 2:08 am

≫ Next: Oracle IO on linux: database writer IO and wait events

≪ Previous: Extra huge database IOs, part 3

This post is about log writer (lgwr) IO.
It’s good to point out the environment on which I do my testing:
Linux X64 OL6u3, Oracle 11.2.0.3 (no BP), Clusterware 11.2.0.3, ASM, all database files in ASM.

In order to look at what the logwriter is doing, a 10046 trace of the lgwr at level 8 gives an overview.
A way of doing so is using oradebug. Be very careful about using oradebug on production environments, it can/may cause the instance to crash.

This is how I did it:

SYS@v11203 AS SYSDBA> oradebug setospid 2491
Oracle pid: 11, Unix process pid: 2491, image: oracle@ol63-oracle.local (LGWR)
SYS@v11203 AS SYSDBA> oradebug unlimit
Statement processed.
SYS@v11203 AS SYSDBA> oradebug event 10046 trace name context forever, level 8
Statement processed.

Of course 2491 is the Linux process id of the log writer, as is visible with “image”.

When you look at the tracefile, the logwriter trace of a non-used database will look like this:

*** 2013-08-29 21:10:45.593
WAIT #0: nam='rdbms ipc message' ela= 3000575 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803445592696

*** 2013-08-29 21:10:48.595
WAIT #0: nam='rdbms ipc message' ela= 3000566 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803448594695

*** 2013-08-29 21:10:51.597
WAIT #0: nam='rdbms ipc message' ela= 3000812 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803451596813

*** 2013-08-29 21:10:54.599
WAIT #0: nam='rdbms ipc message' ela= 3000733 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803454598806

*** 2013-08-29 21:10:57.601
WAIT #0: nam='rdbms ipc message' ela= 3000460 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803457600717

*** 2013-08-29 21:11:00.603
WAIT #0: nam='rdbms ipc message' ela= 3000526 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803460602699

*** 2013-08-29 21:11:03.605
WAIT #0: nam='rdbms ipc message' ela= 3000640 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803463604715

*** 2013-08-29 21:11:04.218
WAIT #0: nam='rdbms ipc message' ela= 612031 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803464218054

What we see here is the logwriter spending 3 seconds (ela= 3000812) on the event ‘rdbms ipc message’, simply because the timeout (timeout column obviously) was set so (timeout=300; timeout is set in centiseconds, Oracle old “fine” measurement of time). To understand what this means from the perspective of the operating system (Linux), there’s a need “trick”: use strace with verbose output of writing. That way the extended SQL trace will “document” the system calls. This is how this looks like:

[root@ol63-oracle ~]# strace -e write=all -e all -p 2491
Process 2491 attached - interrupt to quit
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028646
write(32, "\n*** 2013-08-29 21:35:50.740\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 30  2e 37 34 30 0a           21:35:50 .740.    |
lseek(32, 0, SEEK_CUR)                  = 118213
write(32, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 32 30  38 37 34 35 32 20 74 69   ela= 20 87452 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 30 34 39  35 30 37 34 30 33 37 33  13778049 50740373 |
write(33, "J?Jqz7~UW1\n", 11)           = 11
 | 00000  4a 3f 4a 71 7a 37 7e 55  57 31 0a                 J?Jqz7~U W1.      |
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028646
semtimedop(327683, {{15, -1, 0}}, 1, {0, 910000000}) = -1 EAGAIN (Resource temporarily unavailable)
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
write(32, "\n*** 2013-08-29 21:35:51.651\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 31  2e 36 35 31 0a           21:35:51 .651.    |
lseek(32, 0, SEEK_CUR)                  = 118339
write(32, "WAIT #0: nam='rdbms ipc message'"..., 94) = 94
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 39 31  30 36 37 37 20 74 69 6d   ela= 91 0677 tim |
 | 00030  65 6f 75 74 3d 39 31 20  70 32 3d 30 20 70 33 3d  eout=91  p2=0 p3= |
 | 00040  30 20 6f 62 6a 23 3d 2d  31 20 74 69 6d 3d 31 33  0 obj#=- 1 tim=13 |
 | 00050  37 37 38 30 34 39 35 31  36 35 31 37 32 34        77804951 651724   |
write(33, "J?HRU3~UU1\n", 11)           = 11
 | 00000  4a 3f 48 52 55 33 7e 55  55 31 0a                 J?HRU3~U U1.      |
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
semtimedop(327683, {{15, -1, 0}}, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
write(32, "\n*** 2013-08-29 21:35:54.652\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 34  2e 36 35 32 0a           21:35:54 .652.    |
lseek(32, 0, SEEK_CUR)                  = 118463
write(32, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 33 30  30 30 33 33 32 20 74 69   ela= 30 00332 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 30 34 39  35 34 36 35 32 38 32 36  13778049 54652826 |
write(33, "J?MZSB~UW1\n", 11)           = 11
 | 00000  4a 3f 4d 5a 53 42 7e 55  57 31 0a                 J?MZSB~U W1.      |
lseek(32, 0, SEEK_CUR)                  = 118559
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
write(33, "3?uG~01\n", 8)               = 8
 | 00000  33 3f 75 47 7e 30 31 0a                           3?uG~01.          |
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037

First thing to notice (which is not relevant to this text about logwriter IOs, but nice to know): any time there’s a timestamp in the Oracle trace file (like *** 2013-08-29 21:38:58.721), Oracle did issue a times() syscall prior to that. If you want to know what that does, just issue “man 2 times” on the linux prompt. As a summary: this syscall fetches the CPU time spend in user mode and system/kernel mode, for the current process and for its children.

Back to where this post is about: we see rdbms ipc message waits. This is the instrumentation of the time sleeping (actually) on a semaphore, with timeout mostly set to 3 seconds:

semtimedop(327683, {{15, -1, 0}}, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)

Use the manpage of semtimedop to understand all, but the last struct ({3, 0}) is a struct timespec. The first number is seconds, the second number is nanoseconds. But in other words: if you see the logwriter spending its time on ‘rdbms ipc message’ it’s waiting for work. The semaphore construction is made so the process can asked (by another process) to do something, and can do that immediately.

I guess most people reading this blog, and coming this far, are aware what the function of the logwriter is (writing the log buffer to disk), and the logwriter does that if it finds something to write, or if some other process asks the logwriter to do so. The process which asks the logwriter to write waits for a response from the logwriter on the event ‘log file sync’.

If you use the extended SQL trace with strace “trick” you see this:

io_submit(140612782833664, 1, {{0x7fe2f5369450, 0, 1, 0, 257}}) = 1
io_getevents(140612782833664, 1, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 1024, 0}}, {600, 0}) = 1
lseek(26, 0, SEEK_CUR)                  = 5190
write(26, "WAIT #0: nam='log file parallel "..., 104) = 104
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 30 37 20  write' e la= 107  |
 | 00030  66 69 6c 65 73 3d 31 20  62 6c 6f 63 6b 73 3d 32  files=1  blocks=2 |
 | 00040  20 72 65 71 75 65 73 74  73 3d 31 20 6f 62 6a 23   request s=1 obj# |
 | 00050  3d 2d 31 20 74 69 6d 3d  31 33 37 37 38 30 37 30  =-1 tim= 13778070 |
 | 00060  32 34 39 35 31 33 31 38                           24951318          |

What we see is an asynchronously submitted IO request (one IO request; io_submit), and it’s reaped after that (io_getevents), after which the logwriter writes a wait event to the tracefile indicating this has happened. The logwriter log file parallel write events statistics tell use it wrote to 1 file (equalling one IO, as can be seen from the IO system calls), with a length of 2 blocks. As we know, the log block size is and has been 512 bytes on Linux, which can be seen in the io_getevents call (1024 equals 2*512). The elapsed time (ela) is 107 microseconds.

The log blocksize is and has been 512 bytes on Linux, but can also be 4kB with more recent versions of Oracle, for newer SSD devices.

It would be logical to think in this example the IO time is 107 microseconds. But hold that thought, I’ll come to that. Let’s add a logfile member to every logfile, and switch to a new logfile to make writing to both logfile members happening. Let’s look at a logwriter writing something now:

io_submit(140612782833664, 2, {{0x7fe2f5369450, 0, 1, 0, 256}, {0x7fe2f53696a8, 0, 1, 0, 256}}) = 2
io_getevents(140612782833664, 2, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 512, 0}, {0x7fe2f53696a8, 0x7fe2f53696a8, 512, 0}}, {0, 0}) = 2
lseek(26, 0, SEEK_CUR)                  = 101469
write(26, "WAIT #0: nam='log file parallel "..., 104) = 104
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 31 37 20  write' e la= 117  |
 | 00030  66 69 6c 65 73 3d 32 20  62 6c 6f 63 6b 73 3d 32  files=2  blocks=2 |
 | 00040  20 72 65 71 75 65 73 74  73 3d 32 20 6f 62 6a 23   request s=2 obj# |
 | 00050  3d 2d 31 20 74 69 6d 3d  31 33 37 37 38 30 38 36  =-1 tim= 13778086 |
 | 00060  33 32 30 36 37 31 32 35                           32067125          |

We see here that Oracle submits two IOs at the same time, using one call, and reaps both of them (read the manpage of io_getevents, but the minimal number of IOs is set to 2). So indeed log file parallel write writes in parallel. The number of blocks is a bit puzzling, in total there are two (log-) blocks written (blocks=2), the logwriter IO to a single online redologfile had blocks set to 2, which meant it wrote 1024 bytes (2 512 blocks), here blocks=2 seems to mean the logwriter wrote 2 blocks in total.

How would Oracle write when asynchronous IO is not enabled? Well, let’s set disk_asynch_io to false, and redo enabling the extended sql trace, and strace it:

pwrite(256, "\1\"\0\0\362\0\0\0\n\1\0\0\20\200\330\227\330\1\0\0\r\0\0\0\275\203O\0\1\0\0\0"..., 512, 1415701504) = 512
pwrite(256, "\1\"\0\0\362\0\0\0\n\1\0\0\20\200\330\227\330\1\0\0\r\0\0\0\275\203O\0\1\0\0\0"..., 512, 6560015360) = 512
lseek(25, 0, SEEK_CUR)                  = 4600
write(25, "WAIT #0: nam='log file parallel "..., 103) = 103
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 35 20 66  write' e la= 15 f |
 | 00030  69 6c 65 73 3d 32 20 62  6c 6f 63 6b 73 3d 32 20  iles=2 b locks=2  |
 | 00040  72 65 71 75 65 73 74 73  3d 32 20 6f 62 6a 23 3d  requests =2 obj#= |
 | 00050  2d 31 20 74 69 6d 3d 31  33 37 37 38 30 39 33 39  -1 tim=1 37780939 |
 | 00060  37 38 38 31 38 31 39                              7881819           |

Ah! So when there’s no asynchronous IO, Oracle issues two pwrite() calls sequentially. Not really parallel!

Now think about what the wait means in both situations. In the first situation, with asynchronous IO (which should be your setting, if you have a modern system) the timing is about two IOs, and it seems to indicate the time both IOs took. Since they have been issued in parallel, if you have enough storage bandwidth left, this seems to indicate the time of the slowest IO. In the second situation, the IOs are issued serially, so the wait time seems to indicate the total latency time of the two IOs. This means you have to divide the wait time by the number of requests to get the average IO time of the IOs. Again: if the IOs are done synchronously instead of asynchronously.

But is this true?

Let’s dig a little deeper using gdb! First synchronous IO. Attach to the logwriter using gdb (gdb -p PID), and issue the following debugger commands:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break pwrite
Breakpoint 3 at 0x3f38a0ee90: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb)

Now insert something into a table, and commit.

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 3, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Hey! Do you see what I see? First I see logwriter nicely sleeping like a little baby. With that I mean: I see kslwtbctx() to indicate the starting of the timing of a wait event, then semtimedop() is issued for the sake of sleeping, and after semtimedop() has timed out, kslwtectx() to end the event. But then we start writing (which is the function of the log writer; the pwrite64 calls), but there seems to be a kslwtbctx() missing??? Actually, kslwtbctx() is there, but AFTER the pwrite64() calls!

Am I missing something? Is Oracle doing something incredibly smart? Well, let’s startup an extended SQL trace and throttle the IO. By throttling the write IO for the logwriter I can (severely) increase the IO time, so the IO time should increase drastically. (see the throttling blogpost on this blog; instead for the logwriter you must limit write IOPS instead read IOPS). I’ve done that, and this is what the extended SQL trace shows:

*** 2013-08-29 23:11:54.509
WAIT #0: nam='log file parallel write' ela= 512 files=2 blocks=4 requests=2 obj#=-1 tim=1377810714509122

*** 2013-08-29 23:11:54.509
Warning: log write elapsed time 2004ms, size 1KB

*** 2013-08-29 23:11:56.509
WAIT #0: nam='log file parallel write' ela= 541 files=2 blocks=2 requests=2 obj#=-1 tim=1377810716509653

*** 2013-08-29 23:11:56.509
Warning: log write elapsed time 2000ms, size 0KB

This is what I call caught with the fingers in the cooky jar! The wait event shows an (extremely) low timing for the IOs (the ela=), and after that we get a warning the log writes were terribly slow! Oracle doesn’t time it’s logwrite events correct if we do synchronous IO in this situation!!!

Now let’s have a look at asynchronous IO. I do the same: attach to lgwr with gdb, and now break and continue on the asynchronous IO system calls: io_submit and io_getevents (please mind this requires breaking on io_getevents_0_4 with linux):

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_submit
Breakpoint 3 at 0x3f38200660: file io_submit.c, line 23.
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_getevents_0_4
Breakpoint 4 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

You should now see a stream of kslwtbctx and kslwtectx coming, which is the timing of the semtimedop() system calls. If you insert something in a table you’ll see:

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, io_submit (ctx=0x7f8615d0c000, nr=2, iocbs=0x7fffc4001750) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40078f8, timeout=0x7fffc4008900) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Here’s something worthy to see! The first line shows the ending of a wait (kslwtectx), then the logwriter submitting an IO request (io_submit). Notice there isn’t any wait event timing started at this point, because kslwtbctx is called AFTER the IOs are submitted. This means that the wait event is the timing of the reap of the submitted IOs, instead of measuring full IO latency time. I know this is and should be a tiny fragment of time, but nevertheless the wait only shows the waiting for the reaping of the IO, not the timing of the entire IO.

Now look at the io_submit/io_getevents strace output again shown earlier in the post:

io_submit(140612782833664, 2, {{0x7fe2f5369450, 0, 1, 0, 256}, {0x7fe2f53696a8, 0, 1, 0, 256}}) = 2
io_getevents(140612782833664, 2, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 512, 0}, {0x7fe2f53696a8, 0x7fe2f53696a8, 512, 0}}, {0, 0}) = 2

Look at the struct timespec in the io_getevents call: {0, 0}. This is a non-blocking reap call just to peek at the O/S completion queue. In other words, if the submitted IOs take a long time/are not ready, this call will just be unsuccessful and execution continues. This looks like the asynchronous read implementation of direct path reads (see this blogpost about non-blocking io_getevents calls).

Let’s throttle the IO of the lgwr, and print out the timespec struct at the break on io_getevents:

(gdb) break io_getevents_0_4
Breakpoint 4 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>print *timeout
>c
>end

Now redo the insertion of data into a table, and commit to put the logwriter to work. Please mind IO is throttled!

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, io_submit (ctx=0x7f8615d0c000, nr=2, iocbs=0x7fffc4001640) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40077e8, timeout=0x7fffc40087f0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$6 = {tv_sec = 0, tv_nsec = 0}

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40077d8, timeout=0x7fffc40087e0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$7 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Ah! We see the io_submit call issuing two IOs (nr=2), then the starting of the timing of the log file parallel write event (kslwtbctx), and then two io_getevents calls, the non blocking first, which almost certainly does not succeed (and is NOT shown with strace!) after which a blocking io_getevents call is issued (because seconds in the timespec struct is set to 600).

I haven’t got an idea why the non-blocking io_getevents call is issued after which a blocking io_getevents call is issued.

Summary
This is the description taken from Oracle 11.2 reference guide of log file parallel write:

‘Wait Time: Time it takes for the I/Os to complete. Even though redo records are written in parallel, the parallel write is not complete until the last I/O is on disk.’

This is actually a very good description, surprisingly accurate for asynchronous IO.
As we’ve seen, this is not true for synchronous IO, there Oracle just issues the I/Os it needs to do sequentially, which are not timed at all, but a wait time is displayed, which is wrong.

It turns out that for asynchronous log writes, the log file parallel write event times the time waiting for the writes to be flushed to disk. This means waiting for the slowest write is what is made visible with the event timing, all the fast(er) writes can have any time below the timing of the slowest write. This means the wait is timing from the perspective of the process, not IO latency timing. It gives a fair hint in the IO time of the slowest IO, so for the way most people using it (measure the impact of logwriter IO time spend), it’s fairly accurate, but for actual IO latency measurement it’s weak. Because, if you want to go beyond measuring the time logwriter spend on IO, you should be very well aware of the implementation. For example, you cannot divide the log file parallel write time by the number of IOs; the resulting time does not reflect anything.

Correction: Chris Buckel pointed out I had ‘db file parallel write’ instead of ‘log file parallel write’ in the summary. This is corrected.
Correction: Kevin Closson pointed out that the log block size can be 4kB too with newer versions of Oracle.

Tagged: oracle IO performance gdb debug internal internals

↧

Oracle IO on linux: database writer IO and wait events

September 6, 2013, 6:44 am

≫ Next: Oracle Linux 6 and debuginfo packages

≪ Previous: Oracle IO on linux: log writer IO and wait events

This post is about database writer (dbwr, mostly seen as dbw0 nowadays) IO.
The testenvironment in which I made the measurements in this post: Linux X64 OL6u3, Oracle 11.2.0.3 (no BP), Clusterware 11.2.0.3, ASM, all database files in ASM. The test environment is a (VMWare Fusion) VM, with 2 CPU’s.

It might be a good idea to read my previous blog about logwriter IO.

The number of database writers is depended on the number of CPU’s visible to the instance (when not explicitly set with the DB_WRITER_PROCESSES parameter), and seems mostly to be CEIL(CPU_COUNT/8). There might be other things which could influence the number (NUMA comes to mind). In my case, I’ve got 2 CPU’s visible, which means I got one database writer (dbw0).

Let’s start the investigation! I enable a 10046 level 8 trace on the database writer. When you look at an idle Oracle database, you probably see the database writer process behaving the same as described in the first part of the log writer IO blog, which is sleeping at the wait event ‘rdbms ipc message’. This is the database writer waiting for work:

*** 2013-08-30 12:42:00.291
WAIT #0: nam='rdbms ipc message' ela= 3001013 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859320291675

*** 2013-08-30 12:42:03.292
WAIT #0: nam='rdbms ipc message' ela= 3000870 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859323292672

*** 2013-08-30 12:42:06.293
WAIT #0: nam='rdbms ipc message' ela= 3000806 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859326293581

There’s a description about this wait in the previously mentioned log writer blog. The short version is: ‘rdbms ipc message’ means: waiting to be posted for work.

It helps to make a little table and insert one or a few rows into it, and then commit. This dirties blocks in the buffercache. After that, issue ‘alter database flush buffer_cache’ to make Oracle write all the dirty blocks currently in the buffer cache. This is what is shown in the database writer trace file:

WAIT #0: nam='rdbms ipc message' ela= 328100 timeout=300 p2=0 p3=0 obj#=-1 tim=1377868410713860
WAIT #0: nam='db file async I/O submit' ela= 1 requests=14 interrupt=0 timeout=0 obj#=-1 tim=1377868410715372
WAIT #0: nam='db file parallel write' ela= 26 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1377868410715427

The first line is an ‘rdbms ipc message’ with an elapsed time (ela=) shorter than 3 seconds, because we triggered the database writer with the ‘alter database flush buffer_cache’. The database writer is notified to do something, and stops waiting.

Then we see two waits: ‘db file async I/O submit’ and ‘db file parallel write’. ‘db file parallel write’ is a wait I suspected, Oracle must write these dirty blocks, I/O costs time, and it’s logical to have a wait event for that. I did see the wait event ‘db file async I/O submit’ in some AWR reports, but haven’t encountered it at the top end of the waits, which means I haven’t looked/needed to look at it yet…

Well, let’s dive in one layer deeper, and combine strace with a 10046 level 8 tracefile using ‘strace -e write -e all -p PID’

semtimedop(851971, {{14, -1, 0}}, 1, {3, 0}) = 0
times({tms_utime=34, tms_stime=85, tms_cutime=0, tms_cstime=0}) = 433153247
write(18, "\n*** 2013-08-30 15:32:44.063\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 33 30 20  .*** 201 3-08-30  |
 | 00010  31 35 3a 33 32 3a 34 34  2e 30 36 33 0a           15:32:44 .063.    |
lseek(18, 0, SEEK_CUR)                  = 682279
write(18, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 31 32  38 31 32 31 31 20 74 69   ela= 12 81211 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 36 39 35  36 34 30 36 33 36 30 31  13778695 64063601 |
write(19, "J?2-u4~UW1\n", 11)           = 11
 | 00000  4a 3f 32 2d 75 34 7e 55  57 31 0a                 J?2-u4~U W1.      |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=34, tms_stime=85, tms_cutime=0, tms_cstime=0}) = 433153247
io_submit(139931766530048, 24, {{0x7f44671a7a50, 0, 1, 0, 256}, {0x7f44671a7ca8, 0, 1, 0, 257}, {0x7f44671a7f00, 0, 1, 0, 256}, {0x7f44671a8158, 0, 1, 0, 256}, {0x7f44671a83b0, 0, 1, 0, 256}, {0x7f44671a8608, 0, 1, 0, 257}, {0x7f44671a8860, 0, 1, 0, 257}, {0x7f44671a8ab8, 0, 1, 0, 257}, {0x7f44671a8d10, 0, 1, 0, 257}, {0x7f44671a8f68, 0, 1, 0, 257}, {0x7f44671a91c0, 0, 1, 0, 257}, {0x7f44671a9418, 0, 1, 0, 257}, {0x7f44671a9670, 0, 1, 0, 257}, {0x7f446719e1f8, 0, 1, 0, 256}, {0x7f44671a77f8, 0, 1, 0, 257}, {0x7f44671a75a0, 0, 1, 0, 257}, {0x7f44671a7348, 0, 1, 0, 256}, {0x7f44671a70f0, 0, 1, 0, 256}, {0x7f44671a6e98, 0, 1, 0, 256}, {0x7f44671a6c40, 0, 1, 0, 256}, {0x7f44671a69e8, 0, 1, 0, 256}, {0x7f44671a6790, 0, 1, 0, 256}, {0x7f44671a6538, 0, 1, 0, 256}, {0x7f44671a62e0, 0, 1, 0, 256}}) = 24
lseek(18, 0, SEEK_CUR)                  = 682376
write(18, "WAIT #0: nam='db file async I/O "..., 109) = 109
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 61 73  79 6e 63 20 49 2f 4f 20   file as ync I/O  |
 | 00020  73 75 62 6d 69 74 27 20  65 6c 61 3d 20 31 20 72  submit'  ela= 1 r |
 | 00030  65 71 75 65 73 74 73 3d  32 34 20 69 6e 74 65 72  equests= 24 inter |
 | 00040  72 75 70 74 3d 30 20 74  69 6d 65 6f 75 74 3d 30  rupt=0 t imeout=0 |
 | 00050  20 6f 62 6a 23 3d 2d 31  20 74 69 6d 3d 31 33 37   obj#=-1  tim=137 |
 | 00060  37 38 36 39 35 36 34 30  36 36 31 36 37           78695640 66167    |
write(19, "J?JV~1j1\n", 9)              = 9
 | 00000  4a 3f 4a 56 7e 31 6a 31  0a                       J?JV~1j1 .        |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
io_getevents(139931766530048, 2, 128, {{0x7f44671a7a50, 0x7f44671a7a50, 8192, 0}, {0x7f44671a7ca8, 0x7f44671a7ca8, 8192, 0}, {0x7f44671a7f00, 0x7f44671a7f00, 8192, 0}, {0x7f44671a8158, 0x7f44671a8158, 8192, 0}, {0x7f44671a83b0, 0x7f44671a83b0, 8192, 0}, {0x7f44671a8608, 0x7f44671a8608, 8192, 0}, {0x7f44671a8860, 0x7f44671a8860, 8192, 0}, {0x7f44671a8ab8, 0x7f44671a8ab8, 8192, 0}, {0x7f44671a8d10, 0x7f44671a8d10, 8192, 0}, {0x7f44671a8f68, 0x7f44671a8f68, 8192, 0}, {0x7f44671a91c0, 0x7f44671a91c0, 8192, 0}, {0x7f44671a9418, 0x7f44671a9418, 8192, 0}, {0x7f44671a9670, 0x7f44671a9670, 8192, 0}, {0x7f446719e1f8, 0x7f446719e1f8, 8192, 0}, {0x7f44671a77f8, 0x7f44671a77f8, 8192, 0}, {0x7f44671a75a0, 0x7f44671a75a0, 8192, 0}, {0x7f44671a7348, 0x7f44671a7348, 16384, 0}, {0x7f44671a70f0, 0x7f44671a70f0, 8192, 0}, {0x7f44671a6e98, 0x7f44671a6e98, 8192, 0}, {0x7f44671a6c40, 0x7f44671a6c40, 16384, 0}, {0x7f44671a69e8, 0x7f44671a69e8, 8192, 0}, {0x7f44671a6790, 0x7f44671a6790, 8192, 0}, {0x7f44671a6538, 0x7f44671a6538, 8192, 0}, {0x7f44671a62e0, 0x7f44671a62e0, 8192, 0}}, {600, 0}) = 24
write(18, "WAIT #0: nam='db file parallel w"..., 117) = 117
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 31 35 33 20 72  rite' el a= 153 r |
 | 00030  65 71 75 65 73 74 73 3d  32 20 69 6e 74 65 72 72  equests= 2 interr |
 | 00040  75 70 74 3d 30 20 74 69  6d 65 6f 75 74 3d 32 31  upt=0 ti meout=21 |
 | 00050  34 37 34 38 33 36 34 37  20 6f 62 6a 23 3d 2d 31  47483647  obj#=-1 |
 | 00060  20 74 69 6d 3d 31 33 37  37 38 36 39 35 36 34 30   tim=137 78695640 |
 | 00070  36 36 35 35 30                                    66550             |

I started this snippet from strace with the semtimedop() system call. If you look at the extended sql trace (10046 level 8) output, you see that the databasewriter stopped sleeping on the semaphore after 1.28 seconds. Then some writes are done (to the .trc and .trm files), and a times() system call to fetch usermode and kernelmode times, after which an important system call is shown: io_submit(). io_submit() submits 24 IOs here.
After the io_submit() system call, the process writes a wait event to the trace: ‘db file async I/O submit’, with an elapsed time (ela=) of 1 microsecond.

This looks like Oracle instrumented the submitting of IOs for the database writer! This is different from the logwriter and normal database foreground processes, where the io_submit() system call is not instrumented!

This wait event also shows the number of IOs (requests=24) which are submitted by the io_submit() call.

Then we see the call to reap the writes, io_getevents(). The io_getevents() system call needed at least 2 IOs ready to stop waiting (!). But instead, it was able to reap all 24 submitted IOs (you could count all the iocb’s in the io_getevents() call, or look at the returncode of the call, which is 24). The wait event of the reap is ‘db file parallel write’, as you can see from the write() system call. This is where it gets truly interesting, because when you look at the information in the wait line, you see requests=2, which probably has a relationship with the min_nr parameter in the io_getevents() call. We also see the io_getevents() call has a timeout of 600 seconds. This seems to be a general timeout for asynchronous IO calls done by Oracle on linux. What stands out for me is that the wait line does not contain any information about how many requests it did reap; two I/O’s ready was enough for the system call to succeed.

Now let’s switch to synchronous IO. I probably have mentioned this a couple of times, but if you don’t do asynchronous IO, you are doing it seriously wrong. Many of the newer IO functions in Oracle have been purpose-built for asychronous IO; direct path reads is a famous example, for which asynchronous IO gives you the opportunity to overcome the IO latency penalty of every distinct IO.

This is how writing by the database writer looks like with the combination extended SQL trace and strace when disk_asynch_io is set to false, when the databasewriter is triggered by ‘alter system flush buffer_cache’:

semtimedop(983043, {{14, -1, 0}}, 1, {3, 0}) = 0
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522701
write(18, "\n*** 2013-08-30 16:34:19.162\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 33 30 20  .*** 201 3-08-30  |
 | 00010  31 36 3a 33 34 3a 31 39  2e 31 36 32 0a           16:34:19 .162.    |
lseek(18, 0, SEEK_CUR)                  = 9915
write(18, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 31 36  35 34 35 38 36 20 74 69   ela= 16 54586 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 37 33 32  35 39 31 36 32 37 36 32  13778732 59162762 |
write(19, "J?e7K6~UW1\n", 11)           = 11
 | 00000  4a 3f 65 37 4b 36 7e 55  57 31 0a                 J?e7K6~U W1.      |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522701
pwrite(256, "\6\242\0\0\207 \f\0\214\255O\0\0\0\n\6\236\316\0\0\1\0\0\0%1\1\0X\231O\0"..., 8192, 6221258752) = 8192
pwrite(257, "\2\242\0\0\326\22\300\0\214\255O\0\0\0\t\4\10\207\0\0\1\0\16\0\371\4\0\0005\2\32\32"..., 8192, 3414867968) = 8192
pwrite(256, "\2\242\0\0*\2\300\0\205\255O\0\0\0\1\4\236W\0\0\10\0\33\0=\6\0\0\305\2\v\v"..., 8192, 1359298560) = 8192
pwrite(256, "\2\242\0\0d\1\300\0\207\255O\0\0\0\1\4\321e\0\0\4\0\33\0#\5\0\0\35\3\20\20"..., 8192, 1358725120) = 8192
pwrite(257, "&\242\0\0\360\0\300\0\206\255O\0\0\0\1\4[\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1368260608) = 8192
pwrite(257, "&\242\0\0\260\0\300\0\210\255O\0\0\0\1\4\320\351\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1367736320) = 8192
pwrite(257, "&\242\0\0\200\0\300\0\214\255O\0\0\0\2\4.)\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1367343104) = 8192
pwrite(257, "\6\242\0\0\6\t@\0\210\255O\0\0\0\1\6Z\37\0\0\1\0\0\0N&\1\0\21\254O\0"..., 8192, 749780992) = 8192
pwrite(256, "\6\242\0\0\231\2@\0\206\255O\0\0\0\1\6q\207\0\0\1\0\0\0J\0\0\0\253\254O\0"..., 8192, 734208000) = 8192
lseek(18, 0, SEEK_CUR)                  = 10012
write(18, "WAIT #0: nam='db file parallel w"..., 107) = 107
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 32 34 20 72 65  rite' el a= 24 re |
 | 00030  71 75 65 73 74 73 3d 39  20 69 6e 74 65 72 72 75  quests=9  interru |
 | 00040  70 74 3d 30 20 74 69 6d  65 6f 75 74 3d 30 20 6f  pt=0 tim eout=0 o |
 | 00050  62 6a 23 3d 2d 31 20 74  69 6d 3d 31 33 37 37 38  bj#=-1 t im=13778 |
 | 00060  37 33 32 35 39 31 38 33  32 32 37                 73259183 227      |
write(19, "J?ny4~1h1\n", 10)            = 10
 | 00000  4a 3f 6e 79 34 7e 31 68  31 0a                    J?ny4~1h 1.       |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
write(18, "WAIT #0: nam='db file parallel w"..., 106) = 106
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 31 20 72 65 71  rite' el a= 1 req |
 | 00030  75 65 73 74 73 3d 39 20  69 6e 74 65 72 72 75 70  uests=9  interrup |
 | 00040  74 3d 30 20 74 69 6d 65  6f 75 74 3d 30 20 6f 62  t=0 time out=0 ob |
 | 00050  6a 23 3d 2d 31 20 74 69  6d 3d 31 33 37 37 38 37  j#=-1 ti m=137787 |
 | 00060  33 32 35 39 31 38 33 34  32 30                    32591834 20       |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522703
semctl(983043, 16, SETVAL, 0x1)         = 0
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522703
semtimedop(983043, {{14, -1, 0}}, 1, {1, 320000000}) = -1 EAGAIN (Resource temporarily unavailable)

Mmmh...I see a number of very suspicious things. I started this snippet with the sleep on the semaphore, after which the process writes the wait event information about this. The next couple of system calls (9) are pwrite() system calls, which is the database writer writing dirty blocks to the datafiles. As you can see, this is happening in a sequential way (one by one). After the writes, the process writes the wait event information. We got no 'db file async I/O submit wait, because there is no asynchronous IO, it's directly the wait 'db file parallel write'. The number of requests in the wait is 9, which is the same as the number of write() system calls.

When I look further (still in the first wait line!) I see ela=24. This is very suspicious...can I do 9 write calls to the O/S in 24 microseconds? But wait, it gets more weird!

I see yet another wait event 'db file parallel write' after the first one without any pwrite()'s happening in between. That is strange. My guess is this wait represents the same 9 pwrite() requests, and it was timed at 1 microsecond (ela=1). I've verified this waitevent in this situation (two times a wait event for the same event), this happens at least with every 'alter system flush buffer_cache'. Also an elapsed time of 1 microsecond is very suspicious.

Let's dig one layer deeper and look at the start and end of the wait event timing. In order to do this, I use gdb and the following commands:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) command
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3f386ea7d0: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) command
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break pwrite
Breakpoint 4 at 0x3f38a0ee90: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) command
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

After gdb continues, I issue a ‘alter system flush buffer_cache’, and look at the gdb output:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

I’ve included the two semaphore waits. It easy to spot the pwrite64() calls are not timed. That is the reason the elapsed time in the ‘db file parallel write’ wait is so short…they both do not time any actual IO calls! Unless someone can prove me wrong, it looks like the synchronous IO implementation of database writer of wait events is seriously broken.

But, as I’ve said: you really should have a platform which has the ability to do asynchronous IO.

Let’s switch to asynchronous IO again, and look at it at the same layer. Here are the gdb commands I’ve used:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3f386ea7d0: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_submit
Breakpoint 4 at 0x3f38200660: file io_submit.c, line 23.
(gdb) command
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_getevents_0_4
Breakpoint 5 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

You should see the semtimedop() system calls with accompanying kslwt[be]ctx stepping over the screen every 3 seconds. Insert a few rows into a dummy table and commit, and issue an ‘alter system flush buffer_cache’, and look what happens:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=3, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 5, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=1, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

I included a semaphore wait at the beginning and at the end of this snippet.
The next remarkable thing is the io_submit() system call does not seem to be timed by a wait event, because it is not encosed in a kslwtbctx() and kslwtectx() combination. That’s a pity, because the wait event ‘db file async I/O submit’ strongly hints at being the timing of submitting IO to the operating system.
After this, we see kslwtbctx() io_getevents_0_4() kslwtectx(). This is Oracle timing the reap of the writes submitted earlier.

To be absolutely sure there isn’t some kind of magic happening with io_submit(), let’s make this call last longer.
In order to do so, you can change the commands with the break on io_submit():

(gdb) commands 4
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>shell sleep 1
>c
>end
(gdb)

The inclusion of ‘shell sleep 1′ means that when the debugger breaks execution of the database writer when encountering io_submit(), it will execute ‘sleep 1′ (wait/sleep 1 second), after which the ‘c’ (continue) command is executed by the debugger, which continues execution of the database writer process. The result is that for the database process, the io_submit() system call now will take at least 1 second.

However, when done so, a 10046/8 trace (still) looks like this on my test system:

*** 2013-09-06 11:12:16.337
WAIT #0: nam='db file async I/O submit' ela= 470 requests=1 interrupt=0 timeout=0 obj#=-1 tim=1378458736337487
WAIT #0: nam='db file parallel write' ela= 1386 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1378458736340152

*** 2013-09-06 11:12:17.344
WAIT #0: nam='db file async I/O submit' ela= 545 requests=1 interrupt=0 timeout=0 obj#=-1 tim=1378458737344837
WAIT #0: nam='db file parallel write' ela= 1554 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1378458737346981

What we see is that after making the io_submit() call take at least 1 second, the event ‘db file async I/O submit’ still is timed at approximately 0.5 milliseconds. So at this point it’s safe to say ‘db file async I/O submit’ is NOT the timing of submitting IO requests to the operating system.

Now take a look at the event ‘db file parallel write’ again in the 10046/8 output. We’ve already established that “requests” with the event ‘db file parallel write’ does not reflect the number of IO requests it did reap, but instead is the minimum number of requests that this io_getevents() call should reap from the operating system completion queue before the system call is ready. The actual number of IO requests it did reap is not shown. Another oddity is timeout. The number is recognisable as 2^31-1 by any person working in IT. If the measure of timeout is in hundredths of a second (according to the Oracle reference guide version 11.2), it means the write times out in 2147483 seconds. That should mean a write times out after almost 25 days. That seems a bit lengthy to me.

The nice thing is we can actually see the timeout of the operating system call, io_getevents(). This is extremely easy with strace (the timeout is the last argument showed enclosed in accolades). Another way to do so is asking the debugger to print the struct:

(gdb) commands 5
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>print *timeout
>c
>end
(gdb) c
Continuing.

This is the result:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=3, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 5, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=1, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$1 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

So, actually the timeout of io_getevents() is set to 600 seconds ($1 = {tv_sec = 600, tv_nsec = 0}), and not 2147483 seconds, as the documentation suggests. Of course this conclusion is about the system call the database writer executes, there is a possibility that the Oracle code retries the call, but I strongly doubt that.

Up to this point the database writer only needed to flush a few blocks to disk. Because of the low number of blocks to write, and because the system on which I perform my tests is a VM for which the IOs are likely to be cached, it means the latency of these IOs is extremely low. What if we limit the number of IOs the database writer can do? This blogpost describes how that can be done on linux.

At this point there are a few million rows in my test table (I tried generating more dirty blocks for writing using ‘insert into t select 1+(level-1) from dual connect by level < 100000;' for example). I did setup IO throttling using cgroups, and did put the database writer in the task-list to be throttled. After that, I started deleting all rows from my dummy table t using 'delete from t'.

Please mind the database writer still has event 10046/8 active, so I can follow what the database writer is doing at the Oracle level. Also I modified the breaks in gdb a bit so I can detect the number of IOs io_getevents() was able to reap: I use the 'finish' command, which waits until the current function ends, and prints out the return code. io_getevents() returns the number of IOs it was able to reap. The downside of using 'finish' is the execution is stopped, and must be manually continued. Nevertheless is this the result:

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=153, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=38, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$12 = {tv_sec = 600, tv_nsec = 0}
c
0x0000000002cfa74e in skgfrliopo ()
Value returned is $13 = 38
(gdb) c
Continuing.

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=115, nr=128, events=0x7fff8720afe8, timeout=0x7fff8720bff0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$14 = {tv_sec = 0, tv_nsec = 0}
0x0000000002cfa74e in skgfrliopo ()
Value returned is $15 = 0
(gdb) c
Continuing.

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=115, nr=128, events=0x7fff8720e0c8, timeout=0x7fff8720f0d0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$16 = {tv_sec = 0, tv_nsec = 0}
0x0000000002cfa74e in skgfrliopo ()
Value returned is $17 = 18
(gdb)

At the top we see kslwtectx(), and then io_submit() submitting 153 IO requests. After that, we see kslwtbctx()-kslwtectx() for the event ‘db file async I/O submit’.
Then we see the start of a wait (kslwtbctx()) for ‘db file parallel write’, and io_getevents(). This reap request must reap at least 38 IO requests, with a timeout set to 600 seconds. Because IOPS for both read and write are aggressively throttled at 1 IOPS, this request only reaps the minimal number of IO requests: 38.

I was able to see this because of the ‘finish’ command in gdb. It returns me to the calling function (skgfrliopo()), which is an Oracle function.

After continuing the process, we see something unexpected (to me at least): the wait is ended (kslwtectx()), and Oracle tries to reap the left over IOs using two non-blocking calls (the so-called ‘snacking’). Once both calls are done, either new IOs are submitted (which starts over the process I am describing here), or Oracle registers another ‘db file parallel write’ event, and issues a blocking io_getevents() call for a number IOs (which is lower than the total number of outstainding requests) to reap. If the number of active IO requests for this process grows, the number of io_getevents() calls in ‘db file parallel write’ can increase to two io_getevents() calls.

This is how it looks like when the number of active requests gets higher:

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=685, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$22 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=114, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$23 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720afe8, timeout=0x7fff8720bff0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$24 = {tv_sec = 0, tv_nsec = 0}

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720e0c8, timeout=0x7fff8720f0d0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$25 = {tv_sec = 0, tv_nsec = 0}

Summary
The database writer related events are kind of strange, and what is visible is absolutely not obvious.

The naming ‘db file parallel write’ is only true for asynchronous IO, with synchronous IO, the writes are done sequentially.

The database writer has a unique event named ‘db file async I/O submit’, which, despite its naming does NOT time submitting asychronous IOs to the operating system. This event is now shown with sychronous IO, obviously.

The ‘db file parallel write’ event does not time the write calls with synchronous IO, and seems to be printed two times for a set of write IOs. I consider this wait implementation severely broken.

The event ‘db file async I/O submit’ shows the accurate number of IO requests which are submitted to the operating system using io_submit().
The event ‘db file parallel write’ shows the minimal number of IO requests the process needs to reap using io_getevents(), which can be a lower number than submitted with io_submit(). The timeout which is shown in this event does not seem to relate to anything. The actual (operating system) timeout is 600 seconds, not 25 days.

The event ‘db file parallel write’ waits for a lower number of IO requests then are submitted, the database writer tries to reap any leftover IOs using up to two non-blocking io_getevents(), after which it’s possible new IO requests are introduced.

At this moment my conclusion is ‘db file parallel write’ is an “indication of database writer business”, but absolutely nothing more than that, certainly not a measurement of IO latency.

Tagged: database writer, dbwr, oracle linux performance io aio dio wait events

↧

Oracle Linux 6 and debuginfo packages

October 9, 2013, 2:49 am

≫ Next: The Exadata wait event ‘cell smart table scan’

≪ Previous: Oracle IO on linux: database writer IO and wait events

This is a quick post about how you can get the debuginfo packages on your Oracle Linux system in the easiest way thinkable: via yum.

I guess most people reading this are familiar with Oracle Linux, and do know how to install it, and how to use the public yum server to install and update packages on Linux right from Oracle’s free internet repository. If you do not know, follow the link and learn.

As a sidestep from the purpose of this blog articel: during the ACE Director briefing prior to Oracle Openworld Wim Coekaerts announced that the public-yum repository is now hosted on Akamai instead of somewhere “unofficial” in the Oracle infra. This is really, really noticeable when using it now, previously I could not get beyond a speed of approximately 500K/s, now I can get speeds of 10M/s.

Anyways, in some cases you want to look deep inside your Linux system (for example using gdb), and want gdb to understand the functions it encounters as much as possible. In order for gdb to understand functions, the debuginfo package of the executable you are running must be installed. Please mind some things simply do not have a corresponding debuginfo package, or even an rpm package; for example the Oracle database.

Finding debuginfo packages has been a painful process, at least for me this has been this way. For Oracle Linux, the debuginfo packages are at http://oss.oracle.com/ol6/debuginfo. As far as I know, this is not widely known (but this could be my ignorance).

But it gets even better! You can use this location as a yum location, and install the debuginfo packages via yum, if you add this piece of text in /etc/yum.repos.d as the file ‘debuginfo.repo’:

[ol6_debuginfo]
name=Oracle Linux 6 debuginfo
baseurl=http://oss.oracle.com/ol6/debuginfo
gpgkey=https://oss.oracle.com/ol6/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1

A word of caution: I’ve found the yum metadata to be outdated with the available packages, which means that I’ve had to install a package “directly” (via rpm) from the site instead of via yum, because the metadata didn’t contain this new package. So if you update your kernel (for example), and can’t find the corresponding debuginfo package, take a look at the package list yourself, it might be there, but not in the metadata at that time.

Tagged: oracle linux debug debugging gdb debuginfo rpm repo yum

↧

The Exadata wait event ‘cell smart table scan’

October 14, 2013, 7:01 am

≫ Next: Exadata and the passthrough or pushback mode

≪ Previous: Oracle Linux 6 and debuginfo packages

The purpose of this post is to show what the wait event ‘cell smart table scan’ means, based on reproducible investigation methods.

First of all, if you see the ‘cell smart table scan’ event: congratulations! This means you are using your exadata how it’s intended to be used, which means your full table scan is offloaded to the cells (storage layer), and potentially all kinds of optimisations are happening, like column filtering, predicate filtering, storage indexing, etc.

But what is exactly happening when you see the wait event ‘cell smart table scan’? Can we say anything based on this waits, like you can with other wait events?

If you have access to an exadata database machine, and have a table that is big enough to be smartscanned by the database, issue the following statements:
(Please mind that I’ve found these statements to be harmless, but execute these statements at your own risk. As with any execution of SQL: this uses resources on the Exadata, and could influence the performance)

SQL> alter session set tracefile_identifier='frits'; --make it easy to find the tracefile
SQL> alter session set events 'trace[libcell.client_library.*]';
SQL> alter session set events 'sql_trace level 8';
SQL> --- now issue a full table scan on a table 
SQL> select /*+ noparallel */ count(*) from bigtable1;

  COUNT(*)
----------
  63938560
SQL> exit

Now take the tracefile and look inside!

WAIT #139739525441360: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752114437651

*** 2013-10-14 07:02:01.516
WAIT #139739525441360: nam='SQL*Net message from client' ela= 7078360 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752121516483
CLOSE #139739525441360:c=0,e=13,dep=0,type=1,tim=1381752121516628
=====================
PARSING IN CURSOR #139739525440176 len=57 dep=0 uid=0 oct=3 lid=0 tim=1381752121516762 hv=2246504030 ad='85b3e090' sqlid='gmm4k0y2ydvky'
select /*+ noparallel */ count(*) from bigtable1
END OF STMT
PARSE #139739525440176:c=0,e=83,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3875318932,tim=1381752121516760
EXEC #139739525440176:c=0,e=87,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3875318932,tim=1381752121516992
WAIT #139739525440176: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1381752121517053
WAIT #139739525440176: nam='reliable message' ela= 2053 channel context=2700762248 channel handle=2700440152 broadcast message=2669087352 obj#=-1 tim=1381752121519402
WAIT #139739525440176: nam='enq: KO - fast object checkpoint' ela= 504 name|mode=1263468550 2=65571 0=1 obj#=-1 tim=1381752121520099
WAIT #139739525440176: nam='enq: KO - fast object checkpoint' ela= 330 name|mode=1263468545 2=65571 0=2 obj#=-1 tim=1381752121520594
WAIT #139739525440176: nam='Disk file operations I/O' ela= 4 FileOperation=2 fileno=6 filetype=2 obj#=-1 tim=1381752121520664

*** 2013-10-14 07:02:01.525
initializing OSS
ossnet_optip_trace: Host is not NUMA. Returning no IP preference

This is the start of my tracefile (but yours could be a bit different). First the usual SQL*Net waits, then the query which gets parsed and executed, then the wait for ‘enq: KO – fast object checkpoint’, which is an indication we are going to do direct path IO, because the KO enqueue indicates this object is checkpointed to make sure that the version of the blocks on disk is current.

Then there’s ‘initialising OSS’. This is where the communication with the Exadata storage servers is started. OSS probably means ‘Oracle Storage Server’ in this context. The next line tells us something nice: apparently, the code in the oracle database to behave as a client of the Exadata storage server is NUMA (Non Uniform Memory Access) aware. However, on Exadata versions “dash two” (X2-2, X3-2), NUMA is turned off on the compute layer (the kernel boot line in /etc/grub.conf explicitly disables NUMA with “numa=off”).

Now let’s look in the tracefile, but to get a better idea of what is happening, let’s filter the tracefile information. Issue the following command on the tracefile:

grep -e ^oss tracefile.trc | grep -v done | less

The first part looks something like this:

ossnet_optip_trace: Host is not NUMA. Returning no IP preference
oss_open called for o/192.168.12.5/DATA_CD_04_enkcel03 with guid b1b343d4-a94e4d3d-ac4e3f4a-a1c000b7
oss_open called for o/192.168.12.5//box/predicate
oss_open called for o/192.168.12.3/DATA_CD_01_enkcel01 with guid 205b49ff-c529462a-54735102-80dad762
oss_open called for o/192.168.12.3//box/predicate
oss_open called for o/192.168.12.3/DATA_CD_09_enkcel01 with guid 217e470e-d3f5298-1ffed28e-a00133d2
oss_open called for o/192.168.12.5/DATA_CD_00_enkcel03 with guid 78804872-173010d6-2c12c6d2-b7bc6780
oss_open called for o/192.168.12.5/DATA_CD_08_enkcel03 with guid fd1f4fce-15180efd-83fbde2f-8465401c
oss_open called for o/192.168.12.4/DATA_CD_00_enkcel02 with guid ead44d03-e64725e4-c25a79b5-bff05885
oss_open called for o/192.168.12.4//box/predicate
oss_open called for o/192.168.12.5/DATA_CD_05_enkcel03 with guid 20c344ef-53a0387b-c2dbd306-ad971476
oss_open called for o/192.168.12.4/DATA_CD_08_enkcel02 with guid e2294076-cce51fa-edaecc3f-a966ca8e
oss_open called for o/192.168.12.4/DATA_CD_09_enkcel02 with guid 3073459f-8f6928a0-c70f77dc-9ed9e840
oss_open called for o/192.168.12.5/DATA_CD_10_enkcel03 with guid 73b44c5a-a44544b2-82933066-8cf0410a

I’ve included the NUMA line again. Next we see ‘oss_open’ lines. For every griddisk there’s a line to indicate that the ‘disk’ is identified. Also, when a storage server is found for the “first time” for a smart scan, we see a line with “/box/predicate”, which probably sets up a communication channel to the storage server to push the predicate information, so the storage server can use that. It seems logical to me that first information about the storage environment is gathered by the foreground process that is going to use it.

The next phase looks like this:

oss_ioctl open file: 2, opcode: 99
oss_wait called for request: 0xdca3030
oss_ioctl open file: 4, opcode: 99
oss_wait called for request: 0xdca3030
oss_ioctl open file: 9, opcode: 99
oss_wait called for request: 0xdca3030
oss_ioctl open file: 2, opcode: 99
oss_ioctl open file: 4, opcode: 99
oss_ioctl open file: 9, opcode: 99
oss_wait called for request: 0xdca3030
oss_wait called for request: 0xdcb52e0
oss_wait called for request: 0xdcb59f0

The phase seems to execute “oss_ioctl”, and after every ioctl there seems to be a corresponding “oss_wait” line. “ioctl” probably means Input Output Control. I am not sure what happens during the ioctl phase. The number of ioctl requests (19 in my case) is too low to have a ioctl per griddisk, and the number of extents is much higher for too.

The next phase is executing “oss_cread”:

oss_cread open file: 2, offset: 0, size: 1048576 
oss_cread open file: 4, offset: 0, size: 1048576 
oss_cread open file: 9, offset: 0, size: 1048576 
oss_wait called for request: (nil)
oss_wait called for request: (nil)
osswait failed: context 0xdc71f40 childctx 0xdc71f40 timeout 0 errorcode 38

My guess (again) is “cread” means compressed read, and seems to be the indication that the database is reading the results of an offloaded query. When a non-offloaded query is executed, the OSS read function is “oss_read”, but is referred to as “BREAD” in the tracefile (block(mode) read). The offset of oss_cread seems to be always “0″, and the size seems to be “1048576″ (1MB) in all cases. The line with ‘failed’ does not seem to indicate a problem.

The IOCTL and CREAD’s seem to be mixed for a little while, after which the CREAD’s are the only ones left.

So, an offloaded query has a open, ioctl and cread phase. Now let’s add in the waits:

grep -e ^oss -e ^WAIT tracefile.trc | grep -v done | less

This is an example with different snippets to indicate what the ‘cell smart table scan’ means:

oss_open called for o/192.168.12.5/DATA_CD_04_enkcel03 with guid b1b343d4-a94e4d3d-ac4e3f4a-a1c000b7
oss_open called for o/192.168.12.5//box/predicate
WAIT #139739525440176: nam='cell smart table scan' ela= 384 cellhash#=379339958 p2=0 p3=0 obj#=15909 tim=1381752121550711
oss_open called for o/192.168.12.3/DATA_CD_01_enkcel01 with guid 205b49ff-c529462a-54735102-80dad762
oss_open called for o/192.168.12.3//box/predicate
WAIT #139739525440176: nam='cell smart table scan' ela= 526 cellhash#=3176594409 p2=0 p3=0 obj#=15909 tim=1381752121558781
oss_ioctl open file: 2, opcode: 99
oss_wait called for request: 0xdca3030
WAIT #139739525440176: nam='cell smart table scan' ela= 898 cellhash#=379339958 p2=0 p3=0 obj#=15909 tim=1381752121631152
oss_cread open file: 2, offset: 0, size: 1048576 
oss_cread open file: 4, offset: 0, size: 1048576 
oss_cread open file: 9, offset: 0, size: 1048576 
oss_wait called for request: (nil)
oss_wait called for request: (nil)
osswait failed: context 0xdc71f40 childctx 0xdc71f40 timeout 0 errorcode 38
WAIT #139739525440176: nam='cell smart table scan' ela= 9249 cellhash#=2133459483 p2=0 p3=0 obj#=15909 tim=1381752122240642
oss_wait called for request: (nil)
oss_wait called for request: (nil)
WAIT #139739525440176: nam='cell smart table scan' ela= 197 cellhash#=2133459483 p2=0 p3=0 obj#=15909 tim=1381752122250860

What we see, is that during the entire “life” of a smartscan, during the 3 phases the same wait event ‘cell smart table scan’ is used. This means that there’s little that can be derived from the wait event, except for the fact that a smartscan is used.

Tagged: exadata performance trace tracing deep dive, smart table scan, storage server, table scan

↧

Exadata and the passthrough or pushback mode

October 21, 2013, 2:37 pm

≫ Next: Exadata: what kind of IO requests has a cell been receiving?

≪ Previous: The Exadata wait event ‘cell smart table scan’

Exadata gets its performance by letting the storage (the exadata storage server) participate in query processing, which means part of the processing is done as close as possible to where the data is stored. The participation of the storage server in query processing means that a storage grid can massively parallel (depending on the amount of storage servers participating) process a smart scan request.

However, this also means additional CPU is used on the storage layer. Because there is no real limit on how many queries can use smartscans (and/or hybrid columnar compression, in other words: processing) on the available storage servers, this means a storage server can get overloaded, which could hurt performance. To overcome this problem, Oracle introduced the ‘passthrough’ functionality in the storage server. In the exadata book, it is explained that this functionality came with storage server version 11.2.2.3.0 and Oracle database version 11.2.0.3 and Exadata bundle patch 7. It also explains that the ‘passthrough’ functionality means that the storage server deliberately starts sending non-storage processed data during the smartscan. So when this happens, you still do a smartscan (!), but your foreground process or parallel query slave gets much more data, and needs to process more. The database-side statistic to know this is happening is “cell physical IO bytes sent directly to DB node to balance CPU usage” which is at the database level in v$sysstat and on the session level in v$sesstat.

But how does this work on the storage server?

On the storage server, the passthrough mode properties are governed by a few “underbar” or “undocumented” parameters in the storage server. In order to get the (current) values of the Exadata storage server, the following command can be used on the “cellcli”:

alter cell events="immediate cellsrv.cellsrv_dump('cellparams',0)";

The cell will echo the thread-logfile in which the output of this dump is put:

Dump sequence #1 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_87.trc
Cell enkcel01 successfully altered

Now load this tracefile (readonly) in your favourite text manipulation tool (I use ‘less’ because less is more).

The “underbar” parameters which are of interest are the following parameters:

_cell_mpp_cpu_freq = 2
_cell_mpp_threshold = 90
_cell_mpp_max_pushback = 50

The “MPP” function is responsible for the passthrough functionality. I can’t find anywhere what “MPP” means, my guess is “Measurement (of) Performance (for) Passthrough”. These parameters govern how it works.

_cell_mpp_cpu_freq seems to be the frequency at which the MPP code measures the host CPU, “2″ means per “200ms”.
_cell_mpp_threshold seems to be the CPU usage threshold after which the passthrough functionality kicks in.
_cell_mpp_max_pushback seems to be the maximum percentage of blocks (unsure what the exact granularity is) which are sent to the database in passthrough mode.

In order to get a good understanding of what MPP does, there is a MPP specific dump which could be very beneficial to diagnose MPP related matters. This dump is available on the storage server, which means in the cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('mpp_stats',0)";

The cell will once again echo the thread-logfile in which the output of this dump is put:

Dump sequence #8 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_22.trc
Cell enkcel01 successfully altered

Now peek in the tracefile!

Trace file /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_22.trc
ORACLE_HOME = /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109
System name:    Linux
Node name:      enkcel01.enkitec.com
Release:        2.6.32-400.11.1.el5uek
Version:        #1 SMP Thu Nov 22 03:29:09 PST 2012
Machine:        x86_64
CELL SW Version:        OSS_11.2.3.2.1_LINUX.X64_130109

*** 2013-10-21 10:54:05.994
UserThread: LWPID: 16114 userId: 22 kernelId: 22 pthreadID: 0x4d1e5940
2013-10-21 14:36:04.910675*: [MPP] Number of blocks executed in passthru mode because of high CPU utilization: 0 out of 4232 total blocks.  Percent = 0.000000%
2013-10-21 14:36:04.910675*: Dump sequence #3:
[MPP] Current cell cpu utilization: 7
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 7 [Pushback Rate] 0
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 8 [Pushback Rate] 0
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 7 [Pushback Rate] 0
...
[MPP] Mon Oct 21 14:05:57 2013 [Cell CPU History] 1 [Pushback Rate] 0
[MPP] Mon Oct 21 14:05:56 2013 [Cell CPU History] 1 [Pushback Rate] 0
[MPP] Mon Oct 21 14:05:56 2013 [Cell CPU History] 1 [Pushback Rate] 0

So, what do we see here? We see a cell tracefile which is in the well-known Oracle trace format. This means it starts off with a header which is specific to the cell server.

Then we see a timestamp with three asterisks in front of it. The time in the timestamp is (10:54:05.994), which is is roughly 3 hours and 30 minutes earlier than the timestamp with the next messages, which is 14:36:04.910. The line with the three asterisks is the creation timestamp of the tracefile, which is when the thread which we are using was created. The next line is also created just after that time, which lists the LWPID, userId, etc.

The line with [MPP] is created because of the mpp_stats dump. The timestamp has a single asterisk after it, which means the time is an approximation. The line tells important information: during the approximate 30 minutes in this dump, 4232 blocks where processed by this cell, and 0 blocks where “executed” in “passthru” mode.

Next, the measurements which where taken every 200ms of the past are printed to get an exact overview of measured CPU business and the rate at which “pushback” alias “passthru” was applied.

To see what this means, let’s generate CPU business on the cell, and see if we can get the storage server to invoke “passthru”. There is a simple trick to let a fake process take 100% on it’s CPU thread with common linux shell tools: ‘yes > /dev/null &’. The storage server which I use has 16 CPU threads, so I start 16 of these processes, to effectively have a process that can monopolise every CPU thread.

Next, I started a (sufficiently large; 7GB) scan on a table via sqlplus, and then dumped ‘mpp_stats’ using the method described in this blog.

2013-10-21 16:30:27.977624*: [MPP] Number of blocks executed in passthru mode because of high CPU utilization: 2728 out of 13287 total blocks.  Percent = 20.531347%
2013-10-21 16:30:27.977624*: Dump sequence #10:
[MPP] Current cell cpu utilization: 8
[MPP] Mon Oct 21 16:30:27 2013 [Cell CPU History] 8 [Pushback Rate] 0
...
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 96 [Pushback Rate] 5
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 96 [Pushback Rate] 10
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 95 [Pushback Rate] 15
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 95 [Pushback Rate] 20
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 95 [Pushback Rate] 25
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 30
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 35
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 40
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 95 [Pushback Rate] 45
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 98 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 97 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 98 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 100 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 99 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 97 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 100 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 98 [Pushback Rate] 45
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 97 [Pushback Rate] 40
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 98 [Pushback Rate] 35
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 30
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 25
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 20
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 99 [Pushback Rate] 15
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 100 [Pushback Rate] 10
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 99 [Pushback Rate] 5
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 99 [Pushback Rate] 0
...

This shows it all! The header shows that during the last 30 minutes, this storage server sended 2728 blocks out of the total of 13287 blocks via passthrough mode. Further, in the lines which contain the historical measurements, the “pushback rate” can be seen climbing up to 50%, because the CPU usage was above 90%.

Please mind the techniques I’ve described here are done on one storage server, while a normal Exadata setup has 3 (8th/quarter rack), 7 (half rack) or 14 (full rack) storage servers.

Tagged: cell, cell server, exadata, oracle, oss, passthrough, passthru, performance, pushback

↧

Exadata: what kind of IO requests has a cell been receiving?

October 23, 2013, 10:22 am

≫ Next: Exadata: measuring IO latencies experienced by the cell server

≪ Previous: Exadata and the passthrough or pushback mode

When you are administering an Exadata or more Exadata’s, you probably have multiple databases running on different database or “computing” nodes. In order to understand what kind of IO you are doing, you can look inside the statistics of your database, and look in the data dictionary what that instance or instances (in case of RAC) have been doing. When using Exadata there is a near 100% chance you are using either normal redundancy or high redundancy, of which most people know the impact of the “write amplification” of both normal and high redundancy of ASM (the write statistics in the Oracle data dictionary do not reflect the additional writes needed to satisfy normal (#IO times 2) or high (#IO times 3) redundancy). This means there might be difference in IOs between what you measure or think for your database is doing, and actually is done at the storage level.

But what if you want to know what is happening on the storage level, so on the level of the cell or actually “cellsrv”, which is the process which makes IO flow to your databases? One option is to run “iostat -x”, but that gives a list that is quite hard readable (too much disk devices); and: it doesn’t show you what the reason for the IO was: redo write? controlfile read? Archivelog? This would especially be great if you want to understand what is happening if your IO behaves different than you expect, and you’ve ruled out IORM.

Well, it is possible to get an IO overview (cumulative since startup)! Every storage server keeps a table of IO reasons. This table can be dumped into a trace file on the cell; to generate a dump with an overview of what kind of IOs are done; use “cellcli” locally on a cell, and enter the following command:

alter cell events="immediate cellsrv.cellsrv_dump('ioreasons',0)";

This doesn’t generate anything useful as output on the command line, except for the name of the thread-logfile where we can find the contents of the dump we requested:

Dump sequence #18 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_14.trc
Cell enkcel01 successfully altered

As an aid for searching your dump in thread-logfile: search (“/” when you use “less” for it), enter the following (using the above example, with sequence #18): “/sequence\ #18″, without ‘”‘.

This is an example from a cell in the Enkitec lab, which I used for this example:

Cache::dumpReasons           I/O Reason Table
2013-10-23 08:11:06.869047*: Dump sequence #18:
Cache::dumpReasons Reason                  Reads Writes
Cache::dumpReasons ------------------------------------
Cache::dumpReasons UNKNOWN                436784 162942
Cache::dumpReasons RedoLog Write               0  80329
Cache::dumpReasons RedoLog Read              873      0
Cache::dumpReasons ControlFile Read       399993      0
Cache::dumpReasons ControlFile Write           0 473234
Cache::dumpReasons ASM DiskHeader IO        4326      4
Cache::dumpReasons BufferCache Read        27184      0
Cache::dumpReasons DataHeader Read          2627      0
Cache::dumpReasons DataHeader Write            0   1280
Cache::dumpReasons Datafile SeqRead           45      0
Cache::dumpReasons Datafile SeqWrite           0    373
Cache::dumpReasons HighPriority Checkpoint Write      0   6146
Cache::dumpReasons DBWR Aged Write             0    560
Cache::dumpReasons ReuseBlock Write            0    150
Cache::dumpReasons Selftune Checkpoint Write      0 116800
Cache::dumpReasons RequestLit Write            0     25
Cache::dumpReasons Archivelog IO               0    255
Cache::dumpReasons TrackingFile IO          2586   2698
Cache::dumpReasons ASM Relocate IO             0    200
Cache::dumpReasons ASM Replacement IO          0     91
Cache::dumpReasons ASM CacheCleanup IO         0   4514
Cache::dumpReasons ASM UserFile Relocate       0   2461
Cache::dumpReasons ASM Redo IO                 0  10610
Cache::dumpReasons ASM Cache IO             1953      0
Cache::dumpReasons ASM PST IO                  0     44
Cache::dumpReasons ASM Heartbeat IO           26 162984
Cache::dumpReasons ASM BlockFormat IO          0   3704
Cache::dumpReasons ASM StaleFile IO            0    675
Cache::dumpReasons OSD Header IO               0    315
Cache::dumpReasons Smart scan              11840      0

Please mind the numbers here are IOs, it doesn’t say anything about the size of the IOs. Also please mind these are numbers of a single cell, you probably have 3, 7 or 14 cells.

In my opinion this IO summary can be of much value during IO performance investigations, but also during proofs of concept.

If the cell has been running for a while, these number may grow very big. In order to make an easy baseline, the IO reason numbers can be reset, so you can start off your test or proof-of-concept run and measure what actually has happened on the cell layer! In order to reset the IO reason table, enter the following command in the cellcli:

alter cell events = "immediate cellsrv.cellsrv_resetstats(ioreasons)";

This will reset the IO reasons table in the cell.

PS1: Thanks to Nikolay Kovachev for pointing out the ‘ioreasons’ resetstats parameter. Indeed ‘all’ is way too blunt.
PS2: The IO numbers seem to be the number IO requests the cell has gotten from it’s clients (ASM and database) for data, not for metadata. During a smartscan metadata flows in between the database and the cell server before data is actually served.

Tagged: cellcli, cellsrv, exadata, IO, ioreasons, oracle, performance

↧

Exadata: measuring IO latencies experienced by the cell server

October 24, 2013, 8:16 am

≫ Next: Exadata: disk level statistics

≪ Previous: Exadata: what kind of IO requests has a cell been receiving?

Exadata is about doing IO. I think if there’s one thing people know about Exadata, that’s it. Exadata brings (part of the) processing potentially closer to the storage media, which will be rotating disks for most (Exadata) users, and optionally can be flash.

But with Exadata, you either do normal alias regular IO, which will probably be single block IO, or multiblock IO, which hopefully gets offloaded. The single block reads are hopefully coming from the flashcache, which can be known by looking at v$sysstat/v$sesstat at the statistic (“cell flash cache read hits”), not directly by looking at the IO related views. To understand the composition of the response time of a smartscan, there is even lesser instrumentation in the database (for background, look at this blogpost, where is shown that the smartscan wait does not detail any of the steps done in a smartscan. In other words: if you experience performance differences on Exadata, and the waits point towards IO, there’s not much analysis which can be done to dig deeper.

Luckily, the Exadata storage server provides a very helpful dump which details IO latencies of what the cell considers celldisks (which are both flash and rotating disks). The dump provides:

- IO size by number of reads and writes
- IO size versus latency for reads and writes
- IO size versus pending IO count for reads and writes
- IO size versus pending IO sizes for reads and writes

This is how this dump is executed (in the cellcli of course):

alter cell events="immediate cellsrv.cellsrv_dump('iolstats',0)";

As with the other dumps, the cellcli provides the name of the trace file where the requested dump has been written to. If we look inside this trace file, this is how an IO latencies dump looks like:

IO length (bytes):          Num read IOs:       Num write IOs:
[    512 -    1023)                212184               104402
[   1024 -    2047)                     0               138812
[   2048 -    4095)                     0               166282
[   4096 -    8191)                    35               134095
[   8192 -   16383)                498831               466674
[  16384 -   32767)                  2006                73433
[  32768 -   65535)                    91                15072
[  65536 -  131071)                   303                 4769
[ 131072 -  262143)                   297                 6376
[ 262144 -  524287)                  1160                  230
[ 524288 - 1048575)                  2278                   36
[1048576 - 2097151)                   459                   21

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [       32 | [       64 | [      128 | [      256 | [      512 | [     1024 | [     2048 | [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 | [   524288 |
                     ||        63) |       127) |       255) |       511) |      1023) |      2047) |      4095) |      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |   1048575) |
---------------------||------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|
[     512,     1023) ||      31075 |      14592 |      69575 |      55370 |       7744 |        385 |        725 |       6489 |       7044 |      11663 |       4030 |       1770 |       1310 |        408 |          4 |
[    4096,     8191) ||          0 |          6 |          5 |          6 |          0 |          0 |          0 |          0 |          7 |          8 |          3 |          0 |          0 |          0 |          0 |
[    8192,    16383) ||         66 |        101 |       3189 |       6347 |        717 |       1826 |      23168 |     124246 |     191169 |      79157 |      37032 |      18508 |      12778 |        526 |          1 |
[   16384,    32767) ||         22 |         46 |         22 |       1403 |         90 |         46 |         57 |         65 |         77 |        124 |         39 |          5 |          7 |          3 |          0 |
...

What struck me as odd, is the name of the celldisk (CD_02_enkcel01 here) is below the first table (IO lengths) about this celldisk(!)

In my previous post we saw a command to reset statistics (a cell events command). There is a command to reset the statistics for this specific dump (‘iolstats’) too (to be executed in the cellcli of course):

alter cell events = "immediate cellsrv.cellsrv_resetstats(iolstats)";

Next, I executed a smartscan

IO length (bytes):          Num read IOs:       Num write IOs:
[   4096 -    8191)                     0                   24
[ 524288 - 1048575)                     8                    0
[1048576 - 2097151)                   208                    0

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 |
                     ||      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |
---------------------||------------|------------|------------|------------|------------|------------|------------|
[  524288,  1048575) ||          0 |          0 |          3 |          1 |          0 |          2 |          2 |
[ 1048576,  2097151) ||          1 |          3 |         15 |         22 |         89 |         59 |         19 |

As can be seen, the statistics have been reset (on the local cell/storage server!). This makes diagnosing the physical IO subsystem of Exadata possible!

Tagged: cell, cell server, exadata, IO, IO latency, latency, oracle, performance, response time

↧

Exadata: disk level statistics

October 28, 2013, 5:01 am

≫ Next: Compressing sqlplus output using a pipe

≪ Previous: Exadata: measuring IO latencies experienced by the cell server

This is the fourth post on a serie of postings on how to get measurements out of the cell server, which is the storage layer of the Oracle Exadata database machine. Up until now, I have looked at the measurement of the kind of IOs Exadata receives, the latencies of the IOs as as done by the cell server, and the mechanism Exadata uses to overcome overloaded CPUs on the cell layer.

This post is about the statistics on the disk devices on the operating system, which the cell server also collects and uses. The disk statistics are ideal to combine with the IO latency statistics.

This is how a dump of the collected statistics (which is called “devio_stats”) is invoked on the cell server, using cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('devio_stats',0)";

This will output the name of the thread-log file, in which the “devio_stats” dump has been made.

This is a quick peek at the statistics this dump provides (first 10 lines):

[IOSTAT] Dump IO device stats for the last 1800 seconds
2013-10-28 04:57:39.679590*: Dump sequence #34:
[IOSTAT] Device - /dev/sda
ServiceTime Latency AverageRQ numReads numWrites DMWG numDmwgPeers numDmwgPeersFl trigerConfine avgSrvcTimeDmwg avgSrvcTimeDmwgFl
0.000000 0.000000 10 0 6 0 0 0 0 0.000000 0.000000
0.111111 0.111111 15 7 38 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 4 8 0 0 0 0 0.000000 0.000000
0.000000 0.000000 31 0 23 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 0 1 0 0 0 0 0.000000 0.000000
0.058824 0.058824 25 0 17 0 0 0 0 0.000000 0.000000
etc.

These are the devices for which the cell server keeps statistics:

grep \/dev\/ /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_85.trc
[IOSTAT] Device - /dev/sda
[IOSTAT] Device - /dev/sda3
[IOSTAT] Device - /dev/sdb
[IOSTAT] Device - /dev/sdb3
[IOSTAT] Device - /dev/sdc
[IOSTAT] Device - /dev/sde
[IOSTAT] Device - /dev/sdd
[IOSTAT] Device - /dev/sdf
[IOSTAT] Device - /dev/sdg
[IOSTAT] Device - /dev/sdh
[IOSTAT] Device - /dev/sdi
[IOSTAT] Device - /dev/sdj
[IOSTAT] Device - /dev/sdk
[IOSTAT] Device - /dev/sdl
[IOSTAT] Device - /dev/sdm
[IOSTAT] Device - /dev/sdn
[IOSTAT] Device - /dev/sdo
[IOSTAT] Device - /dev/sdp
[IOSTAT] Device - /dev/sdq
[IOSTAT] Device - /dev/sdr
[IOSTAT] Device - /dev/sds
[IOSTAT] Device - /dev/sdt
[IOSTAT] Device - /dev/sdu

What is of interest here is that if the cell disk is allocated inside a partition instead of the whole disk, the cell server will keep statistics on both the entire device (/dev/sda, dev/sdb) and the partition (/dev/sda3, dev/sdb3). Also, the statistics are kept on both the rotating disks and the flash disks, as you would expect.

When looking in the “devio_stats” dump, there are a few other things which are worthy to notice. The lines with statistics do not have timestamp or other time indicator, it’s only statistics. The lines are displayed per device, with the newest line on top. The dump indicates it dumps the IO device statistics which the cell keeps for the last 1800 seconds (30 minutes). If you count the number of lines which (apparently) are kept by the cell server, the count is 599, not 1800. If you divide the time by the number of samples, it appears the cell takes a device statistics snapshot every 3 seconds. The cell server picks up the disk statistics from /proc/diskstats. Also, mind the cell measures the differences between two periods in time, which means the numbers are averages over a period of 3 seconds.

Two other things are listed in the statistics: ‘trigerConfine’ (which probably should be “triggerConfine”), which is a mechanism for Oracle to manage under performing disks.
The other thing is “DMWG”. At this moment I am aware DMWG means “Disk Media Working Group”, and works with the concept of peers.

To get a better understanding of what the difference is between the ServiceTime and Latency columns, see this excellent writeup on IO statistics from Bart Sjerps. You can exchange the ServiceTime for svctm of iostat or storage wait as Bart calls it, and Latency for await or host wait as Bart calls it.

Tagged: database, exadata, instrumentation, IO, IO latency, latency, measurement, oracle, performance

↧

Compressing sqlplus output using a pipe

October 31, 2013, 4:27 am

≫ Next: When the Oracle wait interface isn’t enough

≪ Previous: Exadata: disk level statistics

Recently I am involved in a project which requires a lot of data to be extracted from Oracle. The size of the data was so huge that the filesystems filled up. Compressing the output (using tar j (bzip2) or z (gzip)) is an obvious solution, but this can only be done after the files are created. This is why I proposed compressing the output without ever existing in uncompressed form.

This solution works with a so called ‘named pipe’, which is something for which I know for sure it can be done on Linux and unix. A named pipe has the ability to let two processes transfer data between each other. This solution will look familiar to “older” Oracle DBA’s: this was how exports where compressed from the “original” export utility (exp).

I’ve created a small script which calls sqlplus embedded in it, and executes sqlplus commands using a “here command”:

mknod /tmp/oracle.pipe p

sqlplus / as sysdba << _EOF
set escape on

host nohup gzip -c < /tmp/oracle.pipe > /tmp/out1.gz \&
spool /tmp/oracle.pipe
select * from dual;
spool off

host nohup gzip -c < /tmp/oracle.pipe > /tmp/out2.gz \&
spool /tmp/oracle.pipe
select * from dual;
spool off

_EOF

rm /tmp/oracle.pipe

First a pipe is created (mknod filename p). As far as I know, this command is the same on Linux and the unixes. This pipe is removed as the last step of the script.

Inside the sqlplus script, I issue the gzip operating system command using the ‘host’ command. The line with the host command starts the gzip command with the pipe as input, and output to a .gz file in /tmp. The process is put in the background using ‘&’.
Next, the sqlplus spool command starts output to the pipe, and I execute a dummy sql (select * from dual).
With ‘spool off’, the output to the pipe is stopped. This makes the gzip process in the background to stop.
Because the gzip process is not compressing anymore to the first file, it can be used for a second time, and more times of course.

The result is two gzipped files:

zcat -v /tmp/out*.gz
/tmp/out1.gz:	SQL> select * from dual;

D                                                                               
-                                                                               
X                                                                               

SQL> spool off
 84.2%
/tmp/out2.gz:	SQL> select * from dual;

D                                                                               
-                                                                               
X                                                                               

SQL> spool off
 84.2%

Tagged: csv, fifo, linux, oracle, output, pipe, script, scripts, sqlplus

↧