This post is about database writer (dbwr, mostly seen as dbw0 nowadays) IO.
The testenvironment in which I made the measurements in this post: Linux X64 OL6u3, Oracle 11.2.0.3 (no BP), Clusterware 11.2.0.3, ASM, all database files in ASM. The test environment is a (VMWare Fusion) VM, with 2 CPU’s.
It might be a good idea to read my previous blog about logwriter IO.
The number of database writers is depended on the number of CPU’s visible to the instance (when not explicitly set with the DB_WRITER_PROCESSES parameter), and seems mostly to be CEIL(CPU_COUNT/8). There might be other things which could influence the number (NUMA comes to mind). In my case, I’ve got 2 CPU’s visible, which means I got one database writer (dbw0).
Let’s start the investigation! I enable a 10046 level 8 trace on the database writer. When you look at an idle Oracle database, you probably see the database writer process behaving the same as described in the first part of the log writer IO blog, which is sleeping at the wait event ‘rdbms ipc message’. This is the database writer waiting for work:
*** 2013-08-30 12:42:00.291
WAIT #0: nam='rdbms ipc message' ela= 3001013 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859320291675
*** 2013-08-30 12:42:03.292
WAIT #0: nam='rdbms ipc message' ela= 3000870 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859323292672
*** 2013-08-30 12:42:06.293
WAIT #0: nam='rdbms ipc message' ela= 3000806 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859326293581
There’s a description about this wait in the previously mentioned log writer blog. The short version is: ‘rdbms ipc message’ means: waiting to be posted for work.
It helps to make a little table and insert one or a few rows into it, and then commit. This dirties blocks in the buffercache. After that, issue ‘alter database flush buffer_cache’ to make Oracle write all the dirty blocks currently in the buffer cache. This is what is shown in the database writer trace file:
WAIT #0: nam='rdbms ipc message' ela= 328100 timeout=300 p2=0 p3=0 obj#=-1 tim=1377868410713860
WAIT #0: nam='db file async I/O submit' ela= 1 requests=14 interrupt=0 timeout=0 obj#=-1 tim=1377868410715372
WAIT #0: nam='db file parallel write' ela= 26 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1377868410715427
The first line is an ‘rdbms ipc message’ with an elapsed time (ela=) shorter than 3 seconds, because we triggered the database writer with the ‘alter database flush buffer_cache’. The database writer is notified to do something, and stops waiting.
Then we see two waits: ‘db file async I/O submit’ and ‘db file parallel write’. ‘db file parallel write’ is a wait I suspected, Oracle must write these dirty blocks, I/O costs time, and it’s logical to have a wait event for that. I did see the wait event ‘db file async I/O submit’ in some AWR reports, but haven’t encountered it at the top end of the waits, which means I haven’t looked/needed to look at it yet…
Well, let’s dive in one layer deeper, and combine strace with a 10046 level 8 tracefile using ‘strace -e write -e all -p PID’
semtimedop(851971, {{14, -1, 0}}, 1, {3, 0}) = 0
times({tms_utime=34, tms_stime=85, tms_cutime=0, tms_cstime=0}) = 433153247
write(18, "\n*** 2013-08-30 15:32:44.063\n", 29) = 29
| 00000 0a 2a 2a 2a 20 32 30 31 33 2d 30 38 2d 33 30 20 .*** 201 3-08-30 |
| 00010 31 35 3a 33 32 3a 34 34 2e 30 36 33 0a 15:32:44 .063. |
lseek(18, 0, SEEK_CUR) = 682279
write(18, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
| 00000 57 41 49 54 20 23 30 3a 20 6e 61 6d 3d 27 72 64 WAIT #0: nam='rd |
| 00010 62 6d 73 20 69 70 63 20 6d 65 73 73 61 67 65 27 bms ipc message' |
| 00020 20 65 6c 61 3d 20 31 32 38 31 32 31 31 20 74 69 ela= 12 81211 ti |
| 00030 6d 65 6f 75 74 3d 33 30 30 20 70 32 3d 30 20 70 meout=30 0 p2=0 p |
| 00040 33 3d 30 20 6f 62 6a 23 3d 2d 31 20 74 69 6d 3d 3=0 obj# =-1 tim= |
| 00050 31 33 37 37 38 36 39 35 36 34 30 36 33 36 30 31 13778695 64063601 |
write(19, "J?2-u4~UW1\n", 11) = 11
| 00000 4a 3f 32 2d 75 34 7e 55 57 31 0a J?2-u4~U W1. |
write(18, "\n", 1) = 1
| 00000 0a . |
times({tms_utime=34, tms_stime=85, tms_cutime=0, tms_cstime=0}) = 433153247
io_submit(139931766530048, 24, {{0x7f44671a7a50, 0, 1, 0, 256}, {0x7f44671a7ca8, 0, 1, 0, 257}, {0x7f44671a7f00, 0, 1, 0, 256}, {0x7f44671a8158, 0, 1, 0, 256}, {0x7f44671a83b0, 0, 1, 0, 256}, {0x7f44671a8608, 0, 1, 0, 257}, {0x7f44671a8860, 0, 1, 0, 257}, {0x7f44671a8ab8, 0, 1, 0, 257}, {0x7f44671a8d10, 0, 1, 0, 257}, {0x7f44671a8f68, 0, 1, 0, 257}, {0x7f44671a91c0, 0, 1, 0, 257}, {0x7f44671a9418, 0, 1, 0, 257}, {0x7f44671a9670, 0, 1, 0, 257}, {0x7f446719e1f8, 0, 1, 0, 256}, {0x7f44671a77f8, 0, 1, 0, 257}, {0x7f44671a75a0, 0, 1, 0, 257}, {0x7f44671a7348, 0, 1, 0, 256}, {0x7f44671a70f0, 0, 1, 0, 256}, {0x7f44671a6e98, 0, 1, 0, 256}, {0x7f44671a6c40, 0, 1, 0, 256}, {0x7f44671a69e8, 0, 1, 0, 256}, {0x7f44671a6790, 0, 1, 0, 256}, {0x7f44671a6538, 0, 1, 0, 256}, {0x7f44671a62e0, 0, 1, 0, 256}}) = 24
lseek(18, 0, SEEK_CUR) = 682376
write(18, "WAIT #0: nam='db file async I/O "..., 109) = 109
| 00000 57 41 49 54 20 23 30 3a 20 6e 61 6d 3d 27 64 62 WAIT #0: nam='db |
| 00010 20 66 69 6c 65 20 61 73 79 6e 63 20 49 2f 4f 20 file as ync I/O |
| 00020 73 75 62 6d 69 74 27 20 65 6c 61 3d 20 31 20 72 submit' ela= 1 r |
| 00030 65 71 75 65 73 74 73 3d 32 34 20 69 6e 74 65 72 equests= 24 inter |
| 00040 72 75 70 74 3d 30 20 74 69 6d 65 6f 75 74 3d 30 rupt=0 t imeout=0 |
| 00050 20 6f 62 6a 23 3d 2d 31 20 74 69 6d 3d 31 33 37 obj#=-1 tim=137 |
| 00060 37 38 36 39 35 36 34 30 36 36 31 36 37 78695640 66167 |
write(19, "J?JV~1j1\n", 9) = 9
| 00000 4a 3f 4a 56 7e 31 6a 31 0a J?JV~1j1 . |
write(18, "\n", 1) = 1
| 00000 0a . |
io_getevents(139931766530048, 2, 128, {{0x7f44671a7a50, 0x7f44671a7a50, 8192, 0}, {0x7f44671a7ca8, 0x7f44671a7ca8, 8192, 0}, {0x7f44671a7f00, 0x7f44671a7f00, 8192, 0}, {0x7f44671a8158, 0x7f44671a8158, 8192, 0}, {0x7f44671a83b0, 0x7f44671a83b0, 8192, 0}, {0x7f44671a8608, 0x7f44671a8608, 8192, 0}, {0x7f44671a8860, 0x7f44671a8860, 8192, 0}, {0x7f44671a8ab8, 0x7f44671a8ab8, 8192, 0}, {0x7f44671a8d10, 0x7f44671a8d10, 8192, 0}, {0x7f44671a8f68, 0x7f44671a8f68, 8192, 0}, {0x7f44671a91c0, 0x7f44671a91c0, 8192, 0}, {0x7f44671a9418, 0x7f44671a9418, 8192, 0}, {0x7f44671a9670, 0x7f44671a9670, 8192, 0}, {0x7f446719e1f8, 0x7f446719e1f8, 8192, 0}, {0x7f44671a77f8, 0x7f44671a77f8, 8192, 0}, {0x7f44671a75a0, 0x7f44671a75a0, 8192, 0}, {0x7f44671a7348, 0x7f44671a7348, 16384, 0}, {0x7f44671a70f0, 0x7f44671a70f0, 8192, 0}, {0x7f44671a6e98, 0x7f44671a6e98, 8192, 0}, {0x7f44671a6c40, 0x7f44671a6c40, 16384, 0}, {0x7f44671a69e8, 0x7f44671a69e8, 8192, 0}, {0x7f44671a6790, 0x7f44671a6790, 8192, 0}, {0x7f44671a6538, 0x7f44671a6538, 8192, 0}, {0x7f44671a62e0, 0x7f44671a62e0, 8192, 0}}, {600, 0}) = 24
write(18, "WAIT #0: nam='db file parallel w"..., 117) = 117
| 00000 57 41 49 54 20 23 30 3a 20 6e 61 6d 3d 27 64 62 WAIT #0: nam='db |
| 00010 20 66 69 6c 65 20 70 61 72 61 6c 6c 65 6c 20 77 file pa rallel w |
| 00020 72 69 74 65 27 20 65 6c 61 3d 20 31 35 33 20 72 rite' el a= 153 r |
| 00030 65 71 75 65 73 74 73 3d 32 20 69 6e 74 65 72 72 equests= 2 interr |
| 00040 75 70 74 3d 30 20 74 69 6d 65 6f 75 74 3d 32 31 upt=0 ti meout=21 |
| 00050 34 37 34 38 33 36 34 37 20 6f 62 6a 23 3d 2d 31 47483647 obj#=-1 |
| 00060 20 74 69 6d 3d 31 33 37 37 38 36 39 35 36 34 30 tim=137 78695640 |
| 00070 36 36 35 35 30 66550 |
I started this snippet from strace with the semtimedop() system call. If you look at the extended sql trace (10046 level 8) output, you see that the databasewriter stopped sleeping on the semaphore after 1.28 seconds. Then some writes are done (to the .trc and .trm files), and a times() system call to fetch usermode and kernelmode times, after which an important system call is shown: io_submit(). io_submit() submits 24 IOs here.
After the io_submit() system call, the process writes a wait event to the trace: ‘db file async I/O submit’, with an elapsed time (ela=) of 1 microsecond.
This looks like Oracle instrumented the submitting of IOs for the database writer! This is different from the logwriter and normal database foreground processes, where the io_submit() system call is not instrumented!
This wait event also shows the number of IOs (requests=24) which are submitted by the io_submit() call.
Then we see the call to reap the writes, io_getevents(). The io_getevents() system call needed at least 2 IOs ready to stop waiting (!). But instead, it was able to reap all 24 submitted IOs (you could count all the iocb’s in the io_getevents() call, or look at the returncode of the call, which is 24). The wait event of the reap is ‘db file parallel write’, as you can see from the write() system call. This is where it gets truly interesting, because when you look at the information in the wait line, you see requests=2, which probably has a relationship with the min_nr parameter in the io_getevents() call. We also see the io_getevents() call has a timeout of 600 seconds. This seems to be a general timeout for asynchronous IO calls done by Oracle on linux. What stands out for me is that the wait line does not contain any information about how many requests it did reap; two I/O’s ready was enough for the system call to succeed.
Now let’s switch to synchronous IO. I probably have mentioned this a couple of times, but if you don’t do asynchronous IO, you are doing it seriously wrong. Many of the newer IO functions in Oracle have been purpose-built for asychronous IO; direct path reads is a famous example, for which asynchronous IO gives you the opportunity to overcome the IO latency penalty of every distinct IO.
This is how writing by the database writer looks like with the combination extended SQL trace and strace when disk_asynch_io is set to false, when the databasewriter is triggered by ‘alter system flush buffer_cache’:
semtimedop(983043, {{14, -1, 0}}, 1, {3, 0}) = 0
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522701
write(18, "\n*** 2013-08-30 16:34:19.162\n", 29) = 29
| 00000 0a 2a 2a 2a 20 32 30 31 33 2d 30 38 2d 33 30 20 .*** 201 3-08-30 |
| 00010 31 36 3a 33 34 3a 31 39 2e 31 36 32 0a 16:34:19 .162. |
lseek(18, 0, SEEK_CUR) = 9915
write(18, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
| 00000 57 41 49 54 20 23 30 3a 20 6e 61 6d 3d 27 72 64 WAIT #0: nam='rd |
| 00010 62 6d 73 20 69 70 63 20 6d 65 73 73 61 67 65 27 bms ipc message' |
| 00020 20 65 6c 61 3d 20 31 36 35 34 35 38 36 20 74 69 ela= 16 54586 ti |
| 00030 6d 65 6f 75 74 3d 33 30 30 20 70 32 3d 30 20 70 meout=30 0 p2=0 p |
| 00040 33 3d 30 20 6f 62 6a 23 3d 2d 31 20 74 69 6d 3d 3=0 obj# =-1 tim= |
| 00050 31 33 37 37 38 37 33 32 35 39 31 36 32 37 36 32 13778732 59162762 |
write(19, "J?e7K6~UW1\n", 11) = 11
| 00000 4a 3f 65 37 4b 36 7e 55 57 31 0a J?e7K6~U W1. |
write(18, "\n", 1) = 1
| 00000 0a . |
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522701
pwrite(256, "\6\242\0\0\207 \f\0\214\255O\0\0\0\n\6\236\316\0\0\1\0\0\0%1\1\0X\231O\0"..., 8192, 6221258752) = 8192
pwrite(257, "\2\242\0\0\326\22\300\0\214\255O\0\0\0\t\4\10\207\0\0\1\0\16\0\371\4\0\0005\2\32\32"..., 8192, 3414867968) = 8192
pwrite(256, "\2\242\0\0*\2\300\0\205\255O\0\0\0\1\4\236W\0\0\10\0\33\0=\6\0\0\305\2\v\v"..., 8192, 1359298560) = 8192
pwrite(256, "\2\242\0\0d\1\300\0\207\255O\0\0\0\1\4\321e\0\0\4\0\33\0#\5\0\0\35\3\20\20"..., 8192, 1358725120) = 8192
pwrite(257, "&\242\0\0\360\0\300\0\206\255O\0\0\0\1\4[\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1368260608) = 8192
pwrite(257, "&\242\0\0\260\0\300\0\210\255O\0\0\0\1\4\320\351\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1367736320) = 8192
pwrite(257, "&\242\0\0\200\0\300\0\214\255O\0\0\0\2\4.)\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1367343104) = 8192
pwrite(257, "\6\242\0\0\6\t@\0\210\255O\0\0\0\1\6Z\37\0\0\1\0\0\0N&\1\0\21\254O\0"..., 8192, 749780992) = 8192
pwrite(256, "\6\242\0\0\231\2@\0\206\255O\0\0\0\1\6q\207\0\0\1\0\0\0J\0\0\0\253\254O\0"..., 8192, 734208000) = 8192
lseek(18, 0, SEEK_CUR) = 10012
write(18, "WAIT #0: nam='db file parallel w"..., 107) = 107
| 00000 57 41 49 54 20 23 30 3a 20 6e 61 6d 3d 27 64 62 WAIT #0: nam='db |
| 00010 20 66 69 6c 65 20 70 61 72 61 6c 6c 65 6c 20 77 file pa rallel w |
| 00020 72 69 74 65 27 20 65 6c 61 3d 20 32 34 20 72 65 rite' el a= 24 re |
| 00030 71 75 65 73 74 73 3d 39 20 69 6e 74 65 72 72 75 quests=9 interru |
| 00040 70 74 3d 30 20 74 69 6d 65 6f 75 74 3d 30 20 6f pt=0 tim eout=0 o |
| 00050 62 6a 23 3d 2d 31 20 74 69 6d 3d 31 33 37 37 38 bj#=-1 t im=13778 |
| 00060 37 33 32 35 39 31 38 33 32 32 37 73259183 227 |
write(19, "J?ny4~1h1\n", 10) = 10
| 00000 4a 3f 6e 79 34 7e 31 68 31 0a J?ny4~1h 1. |
write(18, "\n", 1) = 1
| 00000 0a . |
write(18, "WAIT #0: nam='db file parallel w"..., 106) = 106
| 00000 57 41 49 54 20 23 30 3a 20 6e 61 6d 3d 27 64 62 WAIT #0: nam='db |
| 00010 20 66 69 6c 65 20 70 61 72 61 6c 6c 65 6c 20 77 file pa rallel w |
| 00020 72 69 74 65 27 20 65 6c 61 3d 20 31 20 72 65 71 rite' el a= 1 req |
| 00030 75 65 73 74 73 3d 39 20 69 6e 74 65 72 72 75 70 uests=9 interrup |
| 00040 74 3d 30 20 74 69 6d 65 6f 75 74 3d 30 20 6f 62 t=0 time out=0 ob |
| 00050 6a 23 3d 2d 31 20 74 69 6d 3d 31 33 37 37 38 37 j#=-1 ti m=137787 |
| 00060 33 32 35 39 31 38 33 34 32 30 32591834 20 |
write(18, "\n", 1) = 1
| 00000 0a . |
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522703
semctl(983043, 16, SETVAL, 0x1) = 0
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522703
semtimedop(983043, {{14, -1, 0}}, 1, {1, 320000000}) = -1 EAGAIN (Resource temporarily unavailable)
Mmmh...I see a number of very suspicious things. I started this snippet with the sleep on the semaphore, after which the process writes the wait event information about this. The next couple of system calls (9) are pwrite() system calls, which is the database writer writing dirty blocks to the datafiles. As you can see, this is happening in a sequential way (one by one). After the writes, the process writes the wait event information. We got no 'db file async I/O submit wait, because there is no asynchronous IO, it's directly the wait 'db file parallel write'. The number of requests in the wait is 9, which is the same as the number of write() system calls.
When I look further (still in the first wait line!) I see ela=24. This is very suspicious...can I do 9 write calls to the O/S in 24 microseconds? But wait, it gets more weird!
I see yet another wait event 'db file parallel write' after the first one without any pwrite()'s happening in between. That is strange. My guess is this wait represents the same 9 pwrite() requests, and it was timed at 1 microsecond (ela=1). I've verified this waitevent in this situation (two times a wait event for the same event), this happens at least with every 'alter system flush buffer_cache'. Also an elapsed time of 1 microsecond is very suspicious.
Let's dig one layer deeper and look at the start and end of the wait event timing. In order to do this, I use gdb and the following commands:
(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) command
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3f386ea7d0: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) command
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break pwrite
Breakpoint 4 at 0x3f38a0ee90: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) command
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.
After gdb continues, I issue a ‘alter system flush buffer_cache’, and look at the gdb output:
Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
I’ve included the two semaphore waits. It easy to spot the pwrite64() calls are not timed. That is the reason the elapsed time in the ‘db file parallel write’ wait is so short…they both do not time any actual IO calls! Unless someone can prove me wrong, it looks like the synchronous IO implementation of database writer of wait events is seriously broken.
But, as I’ve said: you really should have a platform which has the ability to do asynchronous IO.
Let’s switch to asynchronous IO again, and look at it at the same layer. Here are the gdb commands I’ve used:
(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3f386ea7d0: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_submit
Breakpoint 4 at 0x3f38200660: file io_submit.c, line 23.
(gdb) command
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_getevents_0_4
Breakpoint 5 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.
You should see the semtimedop() system calls with accompanying kslwt[be]ctx stepping over the screen every 3 seconds. Insert a few rows into a dummy table and commit, and issue an ‘alter system flush buffer_cache’, and look what happens:
Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=3, iocbs=0x7fff872031a0) at io_submit.c:23
23 io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 5, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=1, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
I included a semaphore wait at the beginning and at the end of this snippet.
The next remarkable thing is the io_submit() system call does not seem to be timed by a wait event, because it is not encosed in a kslwtbctx() and kslwtectx() combination. That’s a pity, because the wait event ‘db file async I/O submit’ strongly hints at being the timing of submitting IO to the operating system.
After this, we see kslwtbctx() io_getevents_0_4() kslwtectx(). This is Oracle timing the reap of the writes submitted earlier.
To be absolutely sure there isn’t some kind of magic happening with io_submit(), let’s make this call last longer.
In order to do so, you can change the commands with the break on io_submit():
(gdb) commands 4
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>shell sleep 1
>c
>end
(gdb)
The inclusion of ‘shell sleep 1′ means that when the debugger breaks execution of the database writer when encountering io_submit(), it will execute ‘sleep 1′ (wait/sleep 1 second), after which the ‘c’ (continue) command is executed by the debugger, which continues execution of the database writer process. The result is that for the database process, the io_submit() system call now will take at least 1 second.
However, when done so, a 10046/8 trace (still) looks like this on my test system:
*** 2013-09-06 11:12:16.337
WAIT #0: nam='db file async I/O submit' ela= 470 requests=1 interrupt=0 timeout=0 obj#=-1 tim=1378458736337487
WAIT #0: nam='db file parallel write' ela= 1386 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1378458736340152
*** 2013-09-06 11:12:17.344
WAIT #0: nam='db file async I/O submit' ela= 545 requests=1 interrupt=0 timeout=0 obj#=-1 tim=1378458737344837
WAIT #0: nam='db file parallel write' ela= 1554 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1378458737346981
What we see is that after making the io_submit() call take at least 1 second, the event ‘db file async I/O submit’ still is timed at approximately 0.5 milliseconds. So at this point it’s safe to say ‘db file async I/O submit’ is NOT the timing of submitting IO requests to the operating system.
Now take a look at the event ‘db file parallel write’ again in the 10046/8 output. We’ve already established that “requests” with the event ‘db file parallel write’ does not reflect the number of IO requests it did reap, but instead is the minimum number of requests that this io_getevents() call should reap from the operating system completion queue before the system call is ready. The actual number of IO requests it did reap is not shown. Another oddity is timeout. The number is recognisable as 2^31-1 by any person working in IT. If the measure of timeout is in hundredths of a second (according to the Oracle reference guide version 11.2), it means the write times out in 2147483 seconds. That should mean a write times out after almost 25 days. That seems a bit lengthy to me.
The nice thing is we can actually see the timeout of the operating system call, io_getevents(). This is extremely easy with strace (the timeout is the last argument showed enclosed in accolades). Another way to do so is asking the debugger to print the struct:
(gdb) commands 5
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>print *timeout
>c
>end
(gdb) c
Continuing.
This is the result:
Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=3, iocbs=0x7fff872031a0) at io_submit.c:23
23 io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 5, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=1, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$1 = {tv_sec = 600, tv_nsec = 0}
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
So, actually the timeout of io_getevents() is set to 600 seconds ($1 = {tv_sec = 600, tv_nsec = 0}), and not 2147483 seconds, as the documentation suggests. Of course this conclusion is about the system call the database writer executes, there is a possibility that the Oracle code retries the call, but I strongly doubt that.
Up to this point the database writer only needed to flush a few blocks to disk. Because of the low number of blocks to write, and because the system on which I perform my tests is a VM for which the IOs are likely to be cached, it means the latency of these IOs is extremely low. What if we limit the number of IOs the database writer can do? This blogpost describes how that can be done on linux.
At this point there are a few million rows in my test table (I tried generating more dirty blocks for writing using ‘insert into t select 1+(level-1) from dual connect by level < 100000;' for example). I did setup IO throttling using cgroups, and did put the database writer in the task-list to be throttled. After that, I started deleting all rows from my dummy table t using 'delete from t'.
Please mind the database writer still has event 10046/8 active, so I can follow what the database writer is doing at the Oracle level. Also I modified the breaks in gdb a bit so I can detect the number of IOs io_getevents() was able to reap: I use the 'finish' command, which waits until the current function ends, and prints out the return code. io_getevents() returns the number of IOs it was able to reap. The downside of using 'finish' is the execution is stopped, and must be manually continued. Nevertheless is this the result:
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=153, iocbs=0x7fff872031a0) at io_submit.c:23
23 io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=38, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$12 = {tv_sec = 600, tv_nsec = 0}
c
0x0000000002cfa74e in skgfrliopo ()
Value returned is $13 = 38
(gdb) c
Continuing.
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=115, nr=128, events=0x7fff8720afe8, timeout=0x7fff8720bff0) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$14 = {tv_sec = 0, tv_nsec = 0}
0x0000000002cfa74e in skgfrliopo ()
Value returned is $15 = 0
(gdb) c
Continuing.
Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=115, nr=128, events=0x7fff8720e0c8, timeout=0x7fff8720f0d0) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$16 = {tv_sec = 0, tv_nsec = 0}
0x0000000002cfa74e in skgfrliopo ()
Value returned is $17 = 18
(gdb)
At the top we see kslwtectx(), and then io_submit() submitting 153 IO requests. After that, we see kslwtbctx()-kslwtectx() for the event ‘db file async I/O submit’.
Then we see the start of a wait (kslwtbctx()) for ‘db file parallel write’, and io_getevents(). This reap request must reap at least 38 IO requests, with a timeout set to 600 seconds. Because IOPS for both read and write are aggressively throttled at 1 IOPS, this request only reaps the minimal number of IO requests: 38.
I was able to see this because of the ‘finish’ command in gdb. It returns me to the calling function (skgfrliopo()), which is an Oracle function.
After continuing the process, we see something unexpected (to me at least): the wait is ended (kslwtectx()), and Oracle tries to reap the left over IOs using two non-blocking calls (the so-called ‘snacking’). Once both calls are done, either new IOs are submitted (which starts over the process I am describing here), or Oracle registers another ‘db file parallel write’ event, and issues a blocking io_getevents() call for a number IOs (which is lower than the total number of outstainding requests) to reap. If the number of active IO requests for this process grows, the number of io_getevents() calls in ‘db file parallel write’ can increase to two io_getevents() calls.
This is how it looks like when the number of active requests gets higher:
Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=685, iocbs=0x7fff872031a0) at io_submit.c:23
23 io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$22 = {tv_sec = 600, tv_nsec = 0}
Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=114, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$23 = {tv_sec = 600, tv_nsec = 0}
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720afe8, timeout=0x7fff8720bff0) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$24 = {tv_sec = 0, tv_nsec = 0}
Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720e0c8, timeout=0x7fff8720f0d0) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$25 = {tv_sec = 0, tv_nsec = 0}
Summary
The database writer related events are kind of strange, and what is visible is absolutely not obvious.
The naming ‘db file parallel write’ is only true for asynchronous IO, with synchronous IO, the writes are done sequentially.
The database writer has a unique event named ‘db file async I/O submit’, which, despite its naming does NOT time submitting asychronous IOs to the operating system. This event is now shown with sychronous IO, obviously.
The ‘db file parallel write’ event does not time the write calls with synchronous IO, and seems to be printed two times for a set of write IOs. I consider this wait implementation severely broken.
The event ‘db file async I/O submit’ shows the accurate number of IO requests which are submitted to the operating system using io_submit().
The event ‘db file parallel write’ shows the minimal number of IO requests the process needs to reap using io_getevents(), which can be a lower number than submitted with io_submit(). The timeout which is shown in this event does not seem to relate to anything. The actual (operating system) timeout is 600 seconds, not 25 days.
The event ‘db file parallel write’ waits for a lower number of IO requests then are submitted, the database writer tries to reap any leftover IOs using up to two non-blocking io_getevents(), after which it’s possible new IO requests are introduced.
At this moment my conclusion is ‘db file parallel write’ is an “indication of database writer business”, but absolutely nothing more than that, certainly not a measurement of IO latency.
Tagged:
database writer,
dbwr,
oracle linux performance io aio dio wait events