Quantcast
Channel: Frits Hoogland Weblog
Viewing all 190 articles
Browse latest View live

When the Oracle wait interface isn’t enough

$
0
0

Oracle has done a great job with the wait interface. It has given us the opportunity to profile the time spend in Oracle processes, by keeping track of CPU time and waits (which is time spend not running on CPU). With every new version Oracle has enhanced the wait interface, by making the waits more detailed. Tuning typically means trying to get rid of waits as much as possible.

But what if your execution is optimised to the point that there are (almost) no waits left? Before you think this is theoretical: this is possible, especially with Oracle adaptive direct path reads (which are non Oracle cached IOs), visible by the wait “direct path read”. Of course I am talking about the omission of waits, which happen with adaptive direct path reads if your system is able to provide the request results fast enough. There isn’t a wait because if the IO request result is returned fast enough, the process doesn’t have to wait. Whilst this sounds very obvious, the “traditional” Oracle IO requests (visible with the waits “db file sequential read” and “db file scattered read”) do always generate a wait, no matter how fast the IO requests where.

Here is a trace excerpt from a fill table scan where the IO was fast enough not to generate only a few waits:

PARSE #140145843472584:c=0,e=28,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1385308947947766
EXEC #140145843472584:c=0,e=31,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1385308947947823
WAIT #140145843472584: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1385308947947871
WAIT #140145843472584: nam='asynch descriptor resize' ela= 1 outstanding #aio=0 current aio limit=1562 new aio limit=1592 obj#=75579 tim=1385308947947969
WAIT #140145843472584: nam='direct path read' ela= 428 file number=5 first dba=28418 block cnt=126 obj#=75579 tim=1385308947989097
FETCH #140145843472584:c=161976,e=174323,p=20941,cr=20944,cu=0,mis=0,r=1,dep=0,og=1,plh=3321871023,tim=1385308948122218
WAIT #140145843472584: nam='SQL*Net message from client' ela= 249 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1385308948122600
FETCH #140145843472584:c=0,e=2,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,plh=3321871023,tim=1385308948122689
WAIT #140145843472584: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1385308948122709
WAIT #140145843472584: nam='SQL*Net message from client' ela= 210 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1385308948122938
CLOSE #140145843472584:c=0,e=15,dep=0,type=3,tim=1385308948555460

The most interesting part of the raw trace file is between the EXEC line and the first FETCH line. There is first a ‘SQL*Net message to client’ wait, then a ‘asynch descriptor resize’ wait, and then a single ‘direct path read’ wait. This is a single wait line for doing IO, while the fetch line shows that 20941 blocks are read by doing physical IO. The fetch line shows that most of the elapsed time (e) is spend on running on cpu (c). This means that details about how those 20941 blocks where read are (except for the single ‘direct path read’ wait) not available.

But what if you want to understand more about what the process is doing here? Except for a few wait lines, all the processing details that waits give are gone. It’s more or less only the PARSE/EXEC/FETCH lines, where the first fetch line contains more than 99% of all the time.

The answer to that on linux is perf. Perf is a profiler that is embedded in the linux kernel (since 2.6.32). I’ve written more about perf, use the search field on this blog find articles on how to setup and use perf. Now let’s see what is happening in this situation: what is Oracle doing to execute the above mentioned SQL (select count(*) from t2)?

I’ve ran perf on the session above with ‘perf record -g -p PID’, and the result (with ‘perf report’) is shown below:

    67.58%   oracle  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
             |
             --- _raw_spin_unlock_irqrestore
                |          
                |--99.19%-- mptspi_qcmd
                |          scsi_dispatch_cmd
                |          scsi_request_fn
                |          __blk_run_queue
                |          queue_unplugged
                |          blk_flush_plug_list
                |          blk_finish_plug
                |          generic_file_read_iter
                |          generic_file_aio_read
                |          aio_rw_vect_retry
                |          aio_run_iocb
                |          io_submit_one
                |          do_io_submit
                |          sys_io_submit
                |          system_call_fastpath
                |          io_submit
                |          
                 --0.81%-- __wake_up
                           fsnotify_add_notify_event
                           inotify_handle_event
                           send_to_group
                           fsnotify
                           vfs_write
                           sys_write
                           system_call_fastpath
                           __write_nocancel

     4.40%   oracle  oracle             [.] sxorchk

What is shown here, is that 68% of the time the process ran on CPU, it was spending it’s time in kernel mode ([k]), on a function called _raw_spin_unlock_irqrestore. This function was called in two different ways, but in 99% of the time it came from mptspi_qcmd. This is the device specific kernel driver. What is even more remarkable, is that when we follow the backtrace up (by reading down), that the process was in fact issuing IO’s (the io_submit system call)!

This means that instead of spending time on waiting for IOs to finish, this system is spending time on spinning on a spin lock (alike what is latch in Oracle) for issuing commands to a SCSI device.

The next function in which the Oracle process spend time, is an Oracle function (visible by [.], which means user land function), called sxorchk. This function is a xor check (governed by the db_block_checking parameter).

As a summary: does this means the Oracle wait interface is useless? Of course not. But if the wait interface simply does not provide enough information, like when 99% of the time is only visible as CPU time, you need to step to another layer and investigate there. Perf opens up the CPU time, and is able to tell you how the CPU time is composed.



Building vmware-tools in your Oracle Linux UEK3 VM

$
0
0

This is a quick writeup of an oddity I found while trying to install the vmwareware tools in an Oracle Linux host with the UEK3 kernel enabled (which is by default).

This is what is encountered during the vmware tools installation dialog when running vmwaretools.pl:

Searching for a valid kernel header path...
The path "" is not a valid path to the 3.8.13-16.2.2.el6uek.x86_64 kernel 
headers.
Would you like to change it? [yes]

The building of vmware tools fail because the kernel headers can not be found: the installer doesn’t see the kernel headers, whilst you probably installed it (it’s the kernel-uek-devel package belonging to the running kernel).

The reason is vmwaretools.pl is searching for /usr/src/kernel/KERNELVERSION/include/linux/version.h. And that file is not there anymore. The workaround is to symlink the version.h file from /usr/src/kernels/KERNELVERSION/include/generated/uapi/linux/version.h to its old place:

ln -s /usr/src/kernels/3.8.13-16.2.2.el6uek.x86_64/include/generated/uapi/linux/version.h /usr/src/kernels/3.8.13-16.2.2.el6uek.x86_64/include/linux/version.h

Tagged: kernel, linux, UEK, UEK3, vmware, vmware-tools

When the Oracle wait interface isn’t enough, part 2: understanding measurements.

$
0
0

In my blogpost When the oracle wait interface isn’t enough I showed how a simple asynchronous direct path scan of a table was spending more than 99% of it’s time on CPU, and that perf showed me that 68% (of the total elapsed time) was spent on a spinlock unlock in the linux kernel which was called by io_submit().

This led to some very helpful comments from Tanel Poder. This blogpost is a materialisation of his comments, and tests to show the difference.

First take a look at what I gathered from ‘perf’ in the first article:

# Samples: 501  of event 'cpu-clock'
# Event count (approx.): 501
#
# Overhead  Command       Shared Object                               Symbol
# ........  .......  ..................  ...................................
#
    52.50%   oracle  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore    
             |
             --- _raw_spin_unlock_irqrestore
                 mptspi_qcmd
                 scsi_dispatch_cmd
                 scsi_request_fn
                 __blk_run_queue
                 queue_unplugged
                 blk_flush_plug_list
                 blk_finish_plug
                |          
                |--99.24%-- do_io_submit
                |          sys_io_submit
                |          system_call_fastpath
                |          io_submit
                |          skgfqio
                |          ksfd_skgfqio
                |          ksfdgo

This shows 52.5% of the time of profiling a “select count(*) from t2″ on the server process was spending it’s time on unlocking a spinlock.

This was in the previous blogpost, and tanel commented the following:

I would be suspicious of any of the “irqrestore” functions shown as the main CPU cycle consumer – as right after enabling interrupts again on a CPU may be just the first chance for the profiler interrupt to kick in and do the RIP and stack backtrace read. This is highly dependent on the hardware (how new CPUs) and OS version + VM version + whether the VM allows the guest OS to use hardware performance counters directly.

Let’s reiterate what I was doing: I was profiling the execution using Linux’ in-kernel perf functionality, but, because of the lack of access of the kernel’s performance registers because I was running on VMWare Fusion (desktop virtualisation), I was using perf in the following way: perf record -e cpu-clock.

These are a partial list of perf’s triggering events:

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  ref-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  ...etc...

If no specific event is specified, perf tries to use ‘cpu-cycles’, which has the indication [Hardware event], which means the kernel’s performance registers are used to gather information. If this is not possible (because virtualisation disables access to the performance registers), the software event ‘cpu-clock’ can be used. This is what I used in the previous article.

However, cpu-clock is a software event. And this event (cpu-clock) is depended on the timer interrupt. And the function we see we spent most time on (_raw_spin_unlock_irqrestore) is the re-enabling of IRQ’s for this process when this spinlock is unlocked. So this _could_ mean we did not spend our time on this function, but can not tell, because the timing source was disabled.

However, there was another helpful comment from Tanel:

VMWare Fusion 5.x should already allow some CPU perf counters to be accessed directly in the VM guest. It requires a new enough CPU though (it works in my late 2011 MBP, but not in the 2009 MBP). There’s a setting under “advanced options” under “processors & memory” -> “You can use code profiling applications such as VTune or OProfile to optimize or debug software that runs inside a virtual machine.”

Indeed, there is such a function, and let’s enable it and try again in EXACTLY the same way, but now using the ‘cpu-cycles’ method (which is default).

# Samples: 669  of event 'cycles'
# Event count (approx.): 288603593
#
# Overhead  Command      Shared Object                                   Symbol
# ........  .......  .................  .......................................
#
    11.31%   oracle  oracle             [.] sxorchk                            
             |
             --- sxorchk
                |          
                |--98.50%-- kcbhxoro
                |          kcbhvbo
                |          kcbzvb
                |          kcbldrget
                |          kcbgtcr
                |          ktrget3
                |          ktrget2
                |          kdst_fetch
                |          kdstf00000010000kmP
                |          kdsttgr
                |          qertbFetch
                |          qergsFetch
                |          opifch2
                |          kpoal8
                |          opiodr
                |          ttcpip
                |          opitsk
                |          opiino
                |          opiodr
                |          opidrv
                |          sou2o
                |          opimai_real
                |          ssthrdmain
                |          main
                |          __libc_start_main
                |          
                 --1.50%-- kcbhvbo
                           kcbzvb

This is radically different! All of a sudden the top function is not a spinlock in the kernel any more, but an Oracle function!

Let’s look at the top 5 locations where time is spend with exactly the same case, but with -e cycles (the default) and -e cpu-clock (non-default/software timer):

# perf record -g -p 2527 
^C
# perf report -n -g none
...
# Samples: 580  of event 'cycles'
# Event count (approx.): 256237297
#
# Overhead      Samples  Command       Shared Object                                   Symbol
# ........  ...........  .......  ..................  .......................................
#
    17.47%          100   oracle  oracle              [.] sxorchk                            
     7.99%           47   oracle  oracle              [.] kdstf00000010000kmP                
     6.01%           35   oracle  oracle              [.] kcbhvbo                            
     3.25%           19   oracle  oracle              [.] kdst_fetch                         
     3.01%           17   oracle  [kernel.kallsyms]   [k] __wake_up_bit        

And now the same execution, but with the software timer:

# perf record -g -p 2527 -e cpu-clock
^C
# perf report -n -g none
...
# Samples: 422  of event 'cpu-clock'
# Event count (approx.): 422
#
# Overhead      Samples  Command      Shared Object                            Symbol
# ........  ...........  .......  .................  ................................
#
    78.67%          332   oracle  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore 
     4.03%           17   oracle  oracle             [.] sxorchk                     
     2.13%            9   oracle  oracle             [.] kcbhvbo                     
     1.90%            8   oracle  oracle             [.] kdstf00000010000kmP         
     0.95%            4   oracle  oracle             [.] qeaeCn1Serial                 

This reveals some information: it seems that when profiling with the software timer, the “_raw_spin_unlock_irqrestore” function “eats” a lot of samples, which are “stolen” from the functions where they are spent:
sxorchk has 100 samples with the hardware timer, and 17 with the software timer.
kcbhvbo has 35 samples with the hardware timer, and has 9 with the software timer.
kdstf00000010000kmP has 47 samples with the hardware timer, and has 8 with the software timer.

So, general conclusion is that it’s important to understand what you are measuring, and if that method has implication on what you are measuring.
Conclusion specific to perf: do not use cpu-clock if you can use the hardware event.


Tagged: debug, internals, kernel, linux, perf, performance, profiling

Printing system call arguments in gdb

$
0
0

This blogpost is about how to print the system call arguments of a system call which is caught with ‘catch’ or ‘break’ in gdb. The reason for this blogpost is I spend quite some time on searching for this, and working around this, so writing it in a blogpost might help others who spend (some of) their time in the gdb debugger, and encounter the same issue.

When you break on a system call in gdb, it will show you something like this:

Breakpoint 2, semctl () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

All fine, but most of the people who break on something want to know what the arguments of the call are! We are actually breaking on what is called a “syscall wrapper” (explanation in link). This means we do not have the arguments of the system call nicely shown on screen, as we would like (will, I do!).

An example of a system call which does show it’s arguments nice and dandy on screen is “io_submit ()”:

Breakpoint 3, io_submit (ctx=0x7f42badba000, nr=1, iocbs=0x7fff0cc0f30) at io_submit.c:23
23      io_syscall(in, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

This allows me to look into the arguments, like for example:

(gdb) print *iocbs[0]

But now back to the main problem: for some system calls, when I break on them, I end up in the system call wrapper, not showing any arguments. After some time, I found this answer on stackoverflow. Which is exactly what I needed: the system call arguments are passed via registers!!

Let me show you how this works. I was investigating the system call “nanosleep()” recently. In order to understand its arguments, first issue “man nanosleep” to understand the arguments:

NAME
      nanosleep - high-resolution sleep

SYNOPSIS
      #include <time.h>

      int nanosleep(const struct timespec *req, struct timespec *rem);

So, when nanosleep is called, it passes the pointers to two structs of the type “timespec”. The first one is a constant (it determines the specifics of the call), the second one is used if the nanosleep was interrupted with a signal, which means the remaining time is written in it. For the sake of understanding what is happening, I really only want to know what is in the first struct (*req).

So this is how I setup a break on nanosleep() in gdb, and print the argument I want to see:

(gdb) break nanosleep
Breakpoint 2 at 0x32e0e0ef10: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) commands
Type commands for breakpoint(s), 2, one per line.
End with a line saying just "end".
>print (struct timespec) *$rdi
>c
>end
(gdb) c
Continuing.

Breakpoint 2, nanosleep () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
$1 = {tv_sec = 0, tv_nsec = 867779000}

There you go: we got the argument of a system call (that I was interested in) that was “hidden” by the linux system call wrapper.


Tagged: break, debugging, gdb, linux, sys call, system calls

Investigating the wait interface via gdb.

$
0
0

For some time now, I am using gdb to trace the inner working of the Oracle database. The reason for using gdb instead of systemtap or Oracle’s dtrace is the lack of user-level tracing with Linux. I am using this on Linux because most of my work is happening on Linux.

In order to see the same information with gdb on the system calls of Oracle as strace, there’s the Oracle debug info repository. This requires a bit of explanation. When strace is used on a process doing IO that Oracle executes asynchronous, the IO calls as seen with strace look something like this:

io_submit(140425370206208, 1, {{0x7fb7516c4bc0, 0, 0, 0, 257}}) = 1
io_getevents(140425370206208,1,128,{{0x7fb7516c45e8,0x7fb7516c45e8,106496,0}}, {600, 0}) = 1

This reveals exactly how Oracle used these calls. In case you wonder how to read these calls: Linux (as well as any other Unix like operating system) provides man pages (manual pages) for not only for the command line tools, but also on system calls, c library functions and device and special files, among others. So if you wonder what the io_submit line means, type ‘man io_submit’, or to be 100% sure you look in the manual pages of the system calls, type ‘man 2 io_submit’ to specify you want section 2: system calls.

When I use gdb, and break on io_submit and io_getevents, I get this information:

Breakpoint 1, 0x00007fa883926660 in io_submit () from /lib64/libaio.so.1
Breakpoint 1, 0x00007fa883926660 in io_submit () from /lib64/libaio.so.1
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()

I think everybody can spot that I got less information now. In fact, I now know the calls have happened, and that’s all, there is no additional information. In order to get part of the information back that was visible with strace, use the debuginfo package of libaio. The debug info package must match 100% the version of the package it is meant to provide debug symbols about, because it provides debug information about the executable or library based on physical code locations.

In order to get information on these specific calls (libaio calls), the libaio-debuginfo package can be installed. Once done, we get a great deal of information which resembles strace:

Breakpoint 1, io_submit (ctx=0x7ff8b626c000, nr=1, iocbs=0x7fffa5c31a80) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 2, io_getevents_0_4 (ctx=0x7ff8b626c000, min_nr=2, nr=128, events=0x7fffa5c37b68, timeout=0x7fffa5c38b70) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

This shows all the arguments which are used by the process which is traced with gdb. Please mind that gdb breaks on entering the call, so it doesn’t give a return code. And the return code of io_getevents() is what returns the number of IO’s which are ready, so that information is still not visible, but is visible with strace, which does provides the return code.

How about the Oracle user land calls? I use breaking on kslwtbctx() and kslwtectx() a lot, which indicate the starting (kslwtbctx()) and stopping (kslwtectx()) of a wait event. When doing so, this is how it looks like:

Breakpoint 1, 0x00007f40a05c3660 in io_submit () from /lib64/libaio.so.1
Breakpoint 1, 0x00007f40a05c3660 in io_submit () from /lib64/libaio.so.1
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()
Breakpoint 4, 0x0000000007cf47b6 in kslwtbctx ()
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()
Breakpoint 5, 0x0000000007cfb4f2 in kslwtectx ()

Here we see the libaio functions again, together with the Oracle wait event functions. When using these calls this way, we can safely say that there are some calls done outside of a wait, and one call is done inside of a wait. Because this measurement is done on a well known piece of Oracle code (well known to me: executing a full table scan via direct path), I just know the wait is ‘direct path read’. But what if you do not know? Wouldn’t it be nice to know which wait is called here?

The simplest way to get more information on Oracle function calls is to get the debug information for the Oracle database. However, since that makes references to the source code, that will probably never happen. So, does that mean this is all we can get? No.

In order to get more information out of a function call, we need to dive a little deeper into the internals of Linux x86_64. When a function is called, the arguments are passed on via processor registers. This is implementation specific, and differs between 32-bit and 64-bit. An overview of how that works is summarised in this table. The important line is: “The first six integer or pointer arguments are passed in registers RDI, RSI, RDX, RCX, R8, and R9, while XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6 and XMM7 are used for floating point arguments. For system calls, R10 is used instead of RCX.”

So. This means that if I look at the CPU registers when breaking on a function, there might be something usable. I say “something usable” deliberately, because the Oracle function calls are not publicly documented (I think/hope they are inside Oracle development). I’ve done some investigation, and it turns out that at the END of a wait event, there are a few functions which are called which have some information stored in a CPU register which is useful:
a) First the function kslwtectx() is called to mark the ending of a wait event.
b) Then a function called kslwtrk_enter_wait_int is called, which stores the time the took in the register R13.
c) Next a function called kskthewt is called, which stores the number of the wait event (V$EVENT_NAME.EVENT#) in RSI.

If we combine that information in a little gdb macro, it looks like this:

break kslwtbctx
  commands
    silent
    printf "kslwtbctx\n"
    c
  end
break kslwtectx
  commands
    silent
    printf "kslwtectx -- "
    c
  end
break kslwtrk_enter_wait_int
  commands
    silent
    set $time=$r13
    c
  end
break kskthewt
  commands
    silent
    printf "wait: %d, time: %d\n", $rsi, $time
    c
  end

Put this in a text file, and once attached to a process to trace with gdb, load it using ‘source ‘.
Here is how it looks like when you put it on a process (I’ve put it on the checkpoint process):

kslwtbctx
kslwtectx -- wait: 7, time: 2999054
kslwtbctx
kslwtectx -- wait: 81, time: 1979
kslwtbctx
kslwtectx -- wait: 81, time: 1050
kslwtbctx
kslwtectx -- wait: 81, time: 1216
kslwtbctx
kslwtectx -- wait: 81, time: 2031
kslwtbctx
kslwtectx -- wait: 83, time: 10443

If you want to learn more about this stuff, don’t forget I will be doing a hands-on session on using gdb as a pre-conference training during Collaborate 2014 in Las Vegas.


Tagged: advanced, function call, gdb, linux, oracle, system call, tracing

Linux strace doesn’t lie after all.

$
0
0

strace is a linux utility to profile system calls. Using strace you can see the system calls that a process executes, in order to investigate the inner working or performance. In my presentation about multiblock reads I put the text ‘strace lies’. This is NOT correct. My current understanding is that strace does show every system call made by an executable. So…why did I make that statement? (editorial note: this article dives into the inner working of Linux AIO)

During the hotsos symposium in Dallas I was chatting with Tanel Poder, and he asked me to look a little bit more into the linux io_getevents() call and strace, because there might be an optimisation which keeps the call from truly issuing a system call, which means strace could be right. We started thinking about it a bit, and came to the conclusion it should be possible for the linux AIO code to cut the corner and peek at the IOs before executing the io_getevents system call (as a spoiler: because the IO context is in userspace).

So, what to do to investigate this? Well, let’s just look at how it works. The Oracle executable executes io_getevents_0_4() in order to do the system call io_getevents(). The function io_getevents_0_4() comes from libaio (the linux asynchronous IO library). After a small search, it appears libaio has a git source code repository, so we can peek into the source code directly from our browser!

If you browse to the source tree, you see the file io_getevents.c. If you click on it, you see the contents of this file, which has the function io_getevents_0_4() in it. This is a very simple function (actual function source code):

int io_getevents_0_4(io_context_t ctx, long min_nr, long nr, struct io_event * events, struct timespec * timeout)
{
	struct aio_ring *ring;
	ring = (struct aio_ring*)ctx;
	if (ring==NULL || ring->magic != AIO_RING_MAGIC)
		goto do_syscall;
	if (timeout!=NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0) {
		if (ring->head == ring->tail)
			return 0;
	}
	
do_syscall:	
	return __io_getevents_0_4(ctx, min_nr, nr, events, timeout);
}

If you look at line 7, you see ‘if (timeout!=NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0)’. In other words: if timeout (struct) is set to any value (not NULL), and if the timeout->tv_sec (the seconds portion of the timeout struct) is set to 0 and if the timeout->nsec (the nanoseconds portion of the timeout struct) is set to 0, we enter this function. Once in the function, we look at the struct ring, which is defined as a struct aio_ring from the pointer ctx which is passed to the function io_getevents_0_4(); the first argument. If ring->head is the same as ring-> tail, in other words: if the ring (buffer) is empty, we cut the corner and return 0, without executing the system call. In any other case, the function __io_getevents_0_4() is executed, which executes the system call.

A way to check if this truly is happening, is using the gdb ‘catch syscall’ functionality. In my investigation, I executed ‘break io_getevents_0_4′, which breaks on the userland portion of the io_getevents() function, ‘catch syscall io_getevents’, which breaks when the system call truly is executed, and ‘break io_submit’ to understand which getevents are executed in what number. I setup a testcase with a sqlplus session with the server process throttled to 1 IO per second (see my article on using cgroups to throttle IO), attached to the server process with gdb, and executed the following commands:

break io_submit
commands
silent
printf "io_submit\n"
c
end
break io_getevents_0_4
commands
silent
printf "io_getevents_0_4-libaio\n"
c 
end
catch syscall io_getevents
commands
silent
printf "io_getevents-syscall\n"
c
end

Next I executed a SQL which did a direct path full table scan. This is the result:

io_submit
io_submit
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents-syscall
io_getevents-syscall
io_submit
io_submit
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents-syscall
io_getevents-syscall

If you recall what is in the about multiblock reads presentation: after the io_submit ‘phase’, Oracle executes up to 4 io_getevents() calls non-blocking to look for IO. In this case you see the calls being done in user land, but not making it to the system call, because of the shortcut in the io_getevents_0_4() code. After 4 times, Oracle executes io_getevents() with timeout set to 600 seconds, which makes the call truly execute a system call. Please mind that ‘catch syscall’ triggers twice (as can be seen from the two ‘io_getevents-syscall’ in the above example), but is in reality only 1 system call. This proves the working of the code of the io_getevents_0_4() function we looked into, and the reason why I thought the strace utility lied.


Tagged: AIO, asynchronous IO, C, internals, IO, libaio, oracle, strace

IOUG Collaborate 2014 presentations are online!

$
0
0

This is a small announcement that the slides of all of my four presentations for IOUG Collaborate 2014 are online in the ‘whitepapers and presentations’ section of this blog.


Systemtap revisited

$
0
0

Some time back, I investigated the options to do profiling of processes in Linux. One of the things I investigated was systemtap. After careful investigation I came to the conclusion that systemtap was not really useful for my investigations, because it only worked in kernelspace, only very limited in userspace. The limitation of working in userspace was that you had to define your own markers in the source code of the program you wanted to profile with systemtap and compile that. Since my investigations are mostly around Oracle products, which are closed source, this doesn’t help me at all.

Some time ago, Frank Eigler responded to a blog article I posted on my blog about using gdb (GNU debugger) for doing userspace profiling, indicating that systemtap could do userspace function profiling too. I was quite shocked, because I carefully investigated that option, and came to the conclusion that exactly this did not work. After some communication on this, the conclusion was that this indeed did NOT work with the version of systemtap which is included with current versions of RHEL (and therefore Oracle Linux). But in the current source version of systemtap userspace ‘probing’ is included.

But that is not all…in order to give systemtap the opportunity to do userspace probing, it needs userspace ‘trace hooks’. This is only available in the current stock kernels if the source is of the kernel patched with the ‘utrace patch’, enabled, and compiled. That means a custom compiled kernel. On itself a custom compiled kernel is fine, but in much environments where you work with closed source products, products are certified against stock kernels, and supported only on stock kernels. From a support point of view I very much understand this, and from the viewpoint from me as a consultant too. To put it in a different way: it is an enormous red flag which is raised if I encountered an environment where people compile their own kernel on Linux.

But there is good news. Since linux kernel version 3.5 userspace probing support is included in the linux kernel, which means there is no patch needed against the kernel source in order to be able to profile in userspace. If you take a look at the kernels Oracle provides (for red hat: I am sorry, there is no way that I know to obtain RHEL online for free for testing, which for me rules out using it. I know about the merger with CentOS, but haven’t looked if that makes it attractive for me again), we can see that Oracle provides UEK (2.6.32), UEK2 (2.6.39) and UEK3 (3.8.13). Yes! That means that I can hook up a yum repo and install a kernel that allows userspace probing!

I installed a testmachine with Oracle Linux 6.5, installed the UEK3 kernel, and installed systemtap. When doing testing of the primary desired functionality (profile userland functions without debug symbols), I encountered this problem:

[root@ol6-uekbeta ~]# /usr/bin/stap -e 'probe process("/u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv").function("*") { probefunc() }'
WARNING: cannot find module /u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv debuginfo: No DWARF information found [man warning::debuginfo]
semantic error: while resolving probe point: identifier 'process' at <input>:1:7
        source: probe process("/u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv").function("*") { probefunc() }
                      ^

semantic error: no match
Pass 2: analysis failed.  [man error::pass2]

This strongly looks like systemtap does not understand the ‘process’ probe, where Frank warned about. So. Is this the end of the journey? No!

The userland function probing is documented in the documentation on the systemtap website. This means it should be available. Let’s clone the systemtap source, and build systemtap ourselves. This has a few implications. For starters, this eliminates the usage of systemtap for userland functions on “real” systems. With “real” I mean systems that have a function, and need to be supported and need to be stable. Because on this kind of systems no beta or preview software can and should be installed, no matter how much we want it, need it or want it. But to have an investigation system where we can mimic one of the most desired functions of dtrace, this is fine!

So. I have got a X86_64 Oracle Linux 6.5 installation (default install, and the meta-rpm oracle-rdbms-server-11gR2-preinstall.x86_64 installed), installed the UEK3 kernel on it (using the UEKR3 repo on Oracle Linux public yum), and added the git version system executables using ‘yum install git’, and next I cloned the systemtap git repository using ‘git clone git clone git://sourceware.org/git/systemtap.git. What needed to be done next, is compile and install the stuff. This can be done in a quite standardised way:

./configure
make
make install

If all goes well, you end up with the latest version of systemtap (version 2.5/0.152), which should be able to do userspace probing, and a kernel capable to provide the information for userspace probing.

Now let’s test this, and create a systemtap script to profile the time dbv (db verify) takes just by running it:
(please mind this is a proof of concept script, any additions or remarks are welcome!)

global time, function_times, prev_func, function_count

probe begin {
	printf("Begin.\n");
	time=0
	prev_func="begin"
}

probe process("/u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbv").function("*") {
	if ( time > 0 ) {
		function_times[prev_func] += gettimeofday_us() - time
		function_count[prev_func] ++
	}
	time=gettimeofday_us()
	prev_func=probefunc()
}

probe end {
	printf("End.\n")
	if ( time > 0 ) {
		function_times[prev_func] += gettimeofday_us() - time
		function_count[prev_func] ++
	}
	delete function_times["__do_global_dtors_aux"]
	printf("Function\t\ttime (us)\tcount\tavg (us)\n")
	foreach( tm = [ fn ] in function_times+ ) {
		printf("%s: \t\t%d\t\t%d\t%d\n", fn, tm, function_count[fn],tm/function_count[fn])
		tot_time += tm
	}
	printf("Total time: %d\n", tot_time)
}

This systemtap script can be run from one (root) session, and dbv run in another session. Please mind to wait with running dbv until the systemtap session notifies you it is ready by saying “Begin.”. This is the result:

Function		time (us)	count	avg (us)
frame_dummy: 		3		1	3
lxplget: 		3		1	3
lxpsset: 		3		1	3
call_gmon_start: 		4		1	4
lxplset: 		4		1	4
lxpcset: 		4		1	4
lxptget: 		4		1	4
lxptset: 		4		1	4
lxhLaToId: 		5		1	5
kudbvcCreate: 		5		1	5
_fini: 		6		1	6
__do_global_ctors_aux: 		7		1	7
lxldini: 		7		1	7
lxhenvquery: 		7		1	7
kudbvhlp: 		7		1	7
lxldlbb: 		8		2	4
lxldLoadBoot: 		8		2	4
lxpname: 		12		3	4
kudbvcCreateMsg: 		12		1	12
lxlfOpen: 		13		4	3
lmsapop: 		13		2	6
lxldLoadObject: 		14		4	3
lxpdload: 		14		2	7
lxldlod: 		15		4	3
lxladjobj: 		15		4	3
lxlchkobj: 		15		4	3
__libc_csu_init: 		16		1	16
lxlgsz: 		16		4	4
lxfgnb: 		20		2	10
lxoCnvCase: 		22		2	11
lxhLangEnv: 		24		3	8
_init: 		27		1	27
lxpe2i: 		31		9	3
slmsbfn: 		31		2	15
lxdlobj: 		34		4	8
lxmopen: 		36		5	7
lxlfrd: 		40		4	10
_start: 		41		1	41
lmsagb1: 		46		14	3
lxhchtoid: 		47		6	7
lmsapts: 		47		14	3
lxpcget: 		48		7	6
lxgratio: 		48		14	3
slxldgnv: 		49		11	4
lmsapsb: 		49		14	3
lmsagbcmt: 		50		14	3
lmsapsc: 		50		14	3
lmsapnm: 		51		14	3
lxldalc: 		54		6	9
main: 		63		1	63
kudbvmal: 		63		1	63
lmsaprb: 		67		7	9
kudbvexit: 		68		1	68
lmsapfc: 		71		7	10
slxcfct: 		72		5	14
lxpmclo: 		81		13	6
slmscl: 		88		1	88
slxdfsync: 		91		1	91
lmsapic: 		91		7	13
lxhci2h: 		97		28	3
lxpendian: 		107		13	8
kudbvcml: 		116		1	116
lxgu2t: 		119		16	7
lmsagbf: 		120		14	8
kudbvmai: 		151		1	151
lxdgetobj: 		225		44	5
lxinitc: 		247		6	41
kudbvcpf: 		254		27	9
slmsrd: 		256		9	28
lxhh2ci: 		350		34	10
slxcfot: 		514		5	102
lxlinit: 		688		6	114
kudbvini: 		798		1	798
slmsop: 		1005		2	502
kudbvvpf: 		4102		27	151
Total time: 10993

Of course the result itself is not very useful. The time spend in dbv is measured at 10,993 microseconds (us), the function the most time was spend in was kudbvvpf(), which was 4102 us, but that function was executed 27 times, which makes the time per execution 151 us. The longest taking function was kudbvini(), which was 798 us.


Tagged: debug, linux, oracle, performance, profiling, stap, systemtap

How Exadata smartscans work

$
0
0

I guess everybody who is working with Oracle databases and has been involved with Oracle Exadata in any way knows about smartscans. It is the smartscan who makes the magic happen of full segment scans with sometimes enormously reduced scan times. The Oracle database does smartscans which something that is referred to as ‘offloading’. This is all general known information.

But how does that work? I assume more people are like me, and are anxious to understand how that exactly works. But the information on smartscans is extremely scarce. Of course there is the Oracle public material, which looks technical, but is little/nothing more than marketing. On My Oracle Support, I can’t find anything on the inner working. Even in the ‘Expert Oracle Exadata’ book (which I still regard as the best source of Exadata related information) there is no material on the mechanics of smartscans.

I’ve written a couple of articles on smartscans, of which this article already lays some groundwork, it describes the phases I could see with the available information at that time: oss_open, oss_ioctl followed by oss_wait and oss_cread followed by oss_wait. This is actually a summary of a smartscan, but a very brief one. In this article I described that a smartscan can only happen with a full segment scan (kdstf* functions, Oracle’s fast full scan routines) and if it chooses to use direct path (kcbld* functions, direct path loader) access, which is actually mandatory to get smart scans.

The following investigation is done on an Exadata X2-2 Quarter rack, with Image version: 12.1.1.1.0.131219, and database version 12.1.0.1.3.

In order to get more understand on smartscans, we can use Oracle’s new debugging syntax. The part we are going to look at is called ‘KXD’:

SYS@db12c2 AS SYSDBA> oradebug doc component kxd

  KXD			       Exadata specific Kernel modules (kxd)
    KXDAM		       Exadata Disk Auto Manage (kxdam)
    KCFIS		       Exadata Predicate Push (kcfis)
    NSMTIO		       Trace Non Smart I/O (nsmtio)
    KXDBIO		       Exadata Block level Intelligent Operations (kxdbio)
    KXDRS		       Exadata Resilvering Layer (kxdrs)
    KXDOFL		       Exadata Offload (kxdofl)
    KXDMISC		       Exadata Misc (kxdmisc)

In order to get the tracing of the Exadata (database-) kernel modules, along with regular sql tracing with waits (to understand when something is happening, use the following events:

FRITS@db12c2 > alter session set events 'trace[kxd.*]';

Session altered.

FRITS@db12c2 > alter session set events 'sql_trace level 8';

Session altered.

Now execute a SQL that does smartscans. I’ve made sure table ‘T’ is big enough to invoke a direct path full table scan:

FRITS@db12c2 > select count(*) from t;

  COUNT(*)
----------
   1000000

Now take a peek in the trace file! The first part is normal looking:

PARSING IN CURSOR #139755290955216 len=22 dep=0 uid=201 oct=3 lid=201 tim=1593707470857 hv=2763161912 ad='224d632b8' sqlid='cyzznbykb509s'
select count(*) from t
END OF STMT
PARSE #139755290955216:c=2000,e=2417,p=0,cr=1,cu=0,mis=1,r=0,dep=0,og=1,plh=2966233522,tim=1593707470856
EXEC #139755290955216:c=0,e=25,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2966233522,tim=1593707470920
WAIT #139755290955216: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=14 tim=1593707470968
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 215 name|mode=1263468550 2=131242 0=2 obj#=14 tim=1593707471374
WAIT #139755290955216: nam='reliable message' ela= 1035 channel context=10126085744 channel handle=10164799536 broadcast message=10179010104 obj#=14 tim=1593707472530
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 108 name|mode=1263468550 2=131242 0=1 obj#=14 tim=1593707472684
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 101 name|mode=1263468545 2=131242 0=2 obj#=14 tim=1593707472829

We see the parsing of the simple select statement, and the execution, which yields some waits which are always there (the sqlnet message wait), and then some ‘enq: KO – fast object checkpoint’ waits, indicating a checkpoint, which is a sign of a direct path read.

The next part is interesting, because this is the smartscan-specific tracing:

Caching: Global context initialized 0x7f1b50ca6d20
kcfis_alloc_so 0x24d8994d8
In kcfis initialize: new init: app_state: 0x7f1b50ca67b0 app_type: 1
kcfis rcv update : op: 7 val: 1 so_numses 1 ovhdmem 0 mdmem 0 bufmem 0
kcfis_reinitialize: initializing queues
Set work des: global_ctx: 0x7f1b50ca6d20 app_state: 0x7f1b50ca67b0, mmwds: 0x22446a1a8
Automem enabled: app_state: 0x7f1b50ca67b0, mmwds: 0x22446a1a8
No match found for mmwds. Allocated wds 0x7f1b50ca6768, mmwds 0x22446a1a8
Cache version is 1 start cache version is 1
kcfis rcv update : op: 1 val: 1496 so_numses 1 ovhdmem 0 mdmem 1496 bufmem 0
kcfis rcv update : op: 1 val: 68 so_numses 1 ovhdmem 0 mdmem 1564 bufmem 0
oss_state->oss_context is 0x109d2db0
kcfis rcv update : op: 3 val: 69656 so_numses 1 ovhdmem 69656 mdmem 1564 bufmem 0
kcfis_initialize done

What we see here is the kcfis layer initialising memory. It’s interesting to see where the allocations are done. In general, on this system, the allocations in the 0x7f1bxxxxxxxx are in the PGA heap, kcfis_alloc_so/0x24d8994d8 in the SGA, and the mmwds/0x22446a1a8 is in the SGA too.
This information can be obtained by dumping heaps. Dumping the heaps at level 7 will show sga, session, pga, call and uga heaps. Another way to get insight into a memory locations is using Tanel Poder’s fcha script (Find Chunk Address). Please mind that if you are reading the kxd trace file and want to look up the addresses in a second window in a sqlplus / as sysdba session, this will allow you to see the SGA chunks, but probably not the PGA chunks, because these are private to the traced session.

Let’s get on to the next chunk of trace lines:

In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=(nil)
kcfis_translate: source: 1
kcfis_get_new_request: obtained new piece to translate:fob: 0x24901a2a0 startblk: 1188499 blkcnt: 13 rdba: 55714451 Fno: 13 Bno: 1188499 
kcfis_get_translation:before: kcfis_req: fob: 0x24901a2a0 startblk: 1188499 blkcnt: 13 rdba: 55714451 Fno: 13 Bno: 1188499 reqid=1 cver=1 source=1
OSSIPC:SKGXP:[109be210.0]{0}: (25797 <- 13487)SKGXPDOAINVALCON: connection 0x109c8c30 admno 0x5ac93f1d scoono 0x321bdad5 acconn 0x63cd0678 getting closed. inactive: 0
OSSIPC:SKGXP:[109be210.9]{obj}: SKGXPCNH: 0x109c8390 SKGXPCON_OPEN (2) sconno 321bdad4 accono 1206dea3 admno 75f4dfcd ospid 13487 ANT
OSSIPC:SKGXP:[109be210.10]{obj}:   cookie [04030201010100001f2f65030b000000:..........e.....(16)]
OSSIPC:SKGXP:[109be210.11]{obj}:   Remote admin port
OSSIPC:SKGXP:[109be210.12]{obj}:        SSKGXPT 0x109c83d0 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 22774 lerr 0
OSSIPC:SKGXP:[109be210.13]{obj}:   Remote data port
OSSIPC:SKGXP:[109be210.14]{obj}:        SSKGXPT 0x109c84a0 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 51931 lerr 0
OSSIPC:SKGXP:[109be210.15]{obj}:   next seqno 32768 last ack 32763 credits 2 total credits 2 ertt 16 resends on con 0
OSSIPC:SKGXP:[109be210.16]{obj}: SKGXPCNH: 0x109c8c30 SKGXPCON_CLOSED (1) sconno 321bdad5 accono 63cd0678 admno 7b54ac1c ospid 13487 ANT
OSSIPC:SKGXP:[109be210.17]{obj}:   cookie [04030201010100001f2f65030b000000:..........e.....(16)]
OSSIPC:SKGXP:[109be210.18]{obj}:   Remote admin port
OSSIPC:SKGXP:[109be210.19]{obj}:        SSKGXPT 0x109c8c70 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 29832 lerr 0
OSSIPC:SKGXP:[109be210.20]{obj}:   Remote data port
OSSIPC:SKGXP:[109be210.21]{obj}:        SSKGXPT 0x109c8d40 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 65116 lerr 0
OSSIPC:SKGXP:[109be210.22]{obj}:   next seqno 32765 last ack 32763 credits 2 total credits 2 ertt 16 resends on con 0
kcfis_get_translation:after: Translation (disk,off,len of (0x24901a2a0, 1188499, 13) to (o/192.168.12.8/DATA_CD_06_enkcel04, 513723752448, 106496) Mirr_num: 0 reqid=1 cver=1 source=1
kcfis_get_disk_for_translation: Appliance 192.168.12.8/ does not exist
kcfis rcv update : op: 5 val: 8192 so_numses 1 ovhdmem 69656 mdmem 1564 bufmem 8192
kcfis_open_appliance: 
throttle: initialized for appliance 0x7f1b5078b3a8
kcfis_find_appliance_fd: appliance fd not found. appliance 192.168.12.8/ Cached appliance fd count 0
kcfis_open_appliance_fd: 0x7f1b5078b3a8
WAIT #139755290955216: nam='cell smart table scan' ela= 120 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707489569
Predicate device intelligent IO opened. fd 5
kcfis_init_appliance_fd: appliance fd 0x7f1b50da2b88 initialized for appliance 0x7f1b5078b3a8
kcfis_add_disk: Adding disk 0x7f1b50c953f0 name = o/192.168.12.8/DATA_CD_06_enkcel04 under appliance = 0x7f1b5078b3a8
initialize disk for disk o/192.168.12.8/DATA_CD_06_enkcel04
kcfis_initialize_disk_fd: Disk initialized. appliance: 192.168.12.8/ disk: o/192.168.12.8/DATA_CD_06_enkcel04 fd: 4 disknumber: 1 incarnation: 6 prev_disknumber: 0 num_inits: 1 init_cache_ver: 1
Translated fields: disk = 0x7f1b50c953f0, blkno = 1188499, numblks = 13, disk_handle = 4, diskoffs = 513723752448, len = 106496, path_asmname = o/192.168.12.8/DATA_CD_06_enkcel04, disk_num = 1, req_element = 0x7f1b50d0a000 reqid=1 cver=1 source=1
Default: calc numbufs mem 1048576
Final: calc numbufs mem 1048576 buflen 1048576
Num buffers: 1 buf per appliance: 1 num active appliance: 1
Appliance 0x7f1b5078b3a8 active. Active count 1

There’s a lot to see here. The next step in doing a smartscan is the translation of the data dictionary information on the segment to be smartscanned to cell server and grid disk extents. The data dictionary information is shown in line 3; ‘kcfis_get_new_request’, in line 4; ‘kcfis_get_translation:before’ the session tries to translate the data dictionary information to cell server and grid disk. Because the kcfis context is just initialised, there is no cell related information yet. For that reason, the information is requested from the cell server (the OSSIPC:SKGXP lines). Please mind this request will send all disk related information to the kcfis context of the process. With this information, the process can make the translation, as can be seen in line 20: ‘kcfis_get_translation:after’, it shows where the extent is located in the well known exadata notation: ‘o/cell ip/grid disk name’, together with offset and chunk length. Next the disk needs to be initialised (a disk is a combination of cell server or ‘appliance’ and grid disk), and given a hash value, as is indicated by ‘kcfis_get_disk_for_translation’ in line 21. However, in order for a disk to be initialised, the cell server or appliance must be initialised too. That is what the process is indicating in line 21: Appliance 192.168.12.8/ does not exist. The appliance (cell server) is initialised/opened, which is what is shown in lines 23-29. The wait here is initialising a connection with the cell server, waiting for an acknowledgement. Now the appliance is initialised, the disk is initialised, as indicated in lines 30-32. Next line 33 shows the translation which was started earlier new is finally been done. Line 34-37 show something about the buffering which seems to be arranged per appliance, and is 1MB.

Now that the process has initialised the appliance and the disk, the next translation is done. The translations are done per extent of the segment, and the disk is depended on the placing of the extent. Please mind the maximum size of the allocation is depended on the AU (allocation unit) size of ASM, which is set to 4MB by default with Exadata. If a non-initialised appliance is encountered, it is initialised and opened, and if a non-initialised disk is encountered, this is initialised.

The translation looks like this if everything is initialised:

kcfis_get_new_request: obtained new piece to translate:fob: 0x24901a2a0 startblk: 1188513 blkcnt: 15 rdba: 55714465 Fno: 13 Bno: 1188513 
kcfis_get_translation:before: kcfis_req: fob: 0x24901a2a0 startblk: 1188513 blkcnt: 15 rdba: 55714465 Fno: 13 Bno: 1188513 reqid=2 cver=1 source=1
kcfis_get_translation:after: Translation (disk,off,len of (0x24901a2a0, 1188513, 15) to (o/192.168.12.8/DATA_CD_06_enkcel04, 513723867136, 122880) Mirr_num: 0 reqid=2 cver=1 source=1
Translated fields: disk = 0x7f1b50c953f0, blkno = 1188513, numblks = 15, disk_handle = 4, diskoffs = 513723867136, len = 122880, path_asmname = o/192.168.12.8/DATA_CD_06_enkcel04, disk_num = 1, req_element = 0x7f1b50d0a220 reqid=2 cver=1 source=1

Another thing which is important to notice is the ‘reqid’, which obviously means ‘request id’. This process is repeated until the complete segment is translated into requests.

Once the requests are translated, the next step in the smartscan is to send (‘push’) the requests to the appliances. This apparently is called a ‘payload map’.

kcfis_push: num-appliances 4. payload_size 0x7f1b50da3130 ioctl_issued 0x7f1b50da3108 results 0x7f1b50da3068 payload_type 0x7f1b50da3160
kcfis_create_maps_payload. appliance 0x7f1b5078b3a8 num_disks 12
disk=0x7f1b50c953f0 state=1
trans_req_element = 0x7f1b50d0a000
Pushing request : disknumber = 1, offset = 513723752448, len = 106496 rdba: 55714451 version 0 reqid=1 cver=1
disk=0x7f1b507908e0 state=1

The first line is the start of the pushing of the payload maps. The next line shows a specific appliance being chosen. What is shown next is a line showing ‘disk’ and the hash value of the disk. At the start of a maps push, the disk lines are followed by two lines saying ‘trans_req_element’ and ‘Pushing request’. These two lines probably are some kind of state object for the request, and the actual pushing of the request. Here we see the request id back which we saw in the translation phase.

This is repeated, until some of the disk lines are starting to get followed immediately by another disk line:

disk=0x7f1b50cfdf80 state=1
trans_req_element = 0x7f1b50d12c40
Pushing request : disknumber = 7, offset = 513817985024, len = 1032192 rdba: 55716738 version 0 reqid=67 cver=1
disk=0x7f1b50cfc930 state=1
disk=0x7f1b50cc8118 state=1
trans_req_element = 0x7f1b50d2ffa0
Pushing request : disknumber = 9, offset = 513890353152, len = 4161536 rdba: 55729668 version 0 reqid=174 cver=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1

Further down in the trace file, the trans_req_element and Pushing request lines are becoming scarce:

disk=0x7f1b50cc8118 state=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1
disk=0x7f1b50cc30a0 state=1
disk=0x7f1b50c953f0 state=1
trans_req_element = 0x7f1b50d31920
Pushing request : disknumber = 1, offset = 498446925824, len = 4161536 rdba: 41208836 version 0 reqid=186 cver=1
disk=0x7f1b507908e0 state=1
disk=0x7f1b507903f0 state=1
disk=0x7f1b50d04fc0 state=1
disk=0x7f1b50cfe960 state=1
disk=0x7f1b50cfdf80 state=1
disk=0x7f1b50cfc930 state=1

Inspection of these lines show that the process is going through a strict sequence of disks of that appliance, and picks up one request per disk which (obviously) belongs to that disk. If the requests are not evenly divided between the disks, some disks will have all the requests already pushed to that disk, while other disks still need additional requests. In that case, the disk which already have their requests pushed will not get a request, so no trans_req_element/Pushing request combination. The process goes through this until all the requests for that appliance are pushed.

After the push of all the requests for that appliance, the following is happening:

kcfis_create_maps_payload. alloc_len 4088 num maps 55
throttle: mappayload: maps_to_push 7 iosize_being_pushed 150953984
kcfis_metadata_payload_len: app_state 0x7f1b50ca67b0 appliance 0x7f1b5078b3a8 payload_len 5968 payload_hdr_len 96 sessiondata_payload_len 144
metadata_payload_len 1536 fmetadata_payload_len 100 maps_len 3992 exthdr_len 8 planpayload_len 48 oflgrppayload_plen 40
kcfis_create_metadata_payload. appliance 0x7f1b5078b3a8 payload 0x7f1b50da3190 payload_memlen 5968 maps 0x7f1b50d07fe0 mapslen 3992
kcfis_create_metadata_payload: pushing sessiondata: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: pushing capability payload: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: dop: 1kcfis_create_metadata_payload: pushing metadata: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: pushing fast metadata: appliance 0x7f1b5078b3a8
kcfis_push: pushing metadata to appliance 0x7f1b5078b3a8. metadata 0x7f1b50da3190
kcfis_issue_ioctl: payload_type 1
WAIT #139755290955216: nam='cell smart table scan' ela= 178 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707571673
Ioctl completed. Payload type 1
Ioctl quarantine response 1 for appliance 0x7f1b5078b3a8
appliance 0x7f1b5078b3a8 : cellsrv pid: 13487: predicate: /box/predicate735745

The first line shows 55 maps have been pushed to the appliance. The other lines are various memory locations which are needed for gathering the reads which will be send back by the appliances. Further things which seem important are line 11, which issues a ioctl (IO control) request to the appliance, and waits for acknowledgement. Mind the wait is always ‘cell smart table scan’. Line 15 shows this request gets a predicate, which is ‘/box/predicate735745′.

This is repeated for every appliance.

Then the next thing happens:

kcfis_create_maps_payload. appliance 0x7f1b5078b3a8 num_disks 12
throttle: allowing map push: appliance 0x7f1b5078b3a8, maps_to_push 7
disk=0x7f1b50c953f0 state=1
disk=0x7f1b507908e0 state=1
disk=0x7f1b507903f0 state=1
disk=0x7f1b50d04fc0 state=1
disk=0x7f1b50cfe960 state=1
disk=0x7f1b50cfdf80 state=1
disk=0x7f1b50cfc930 state=1
disk=0x7f1b50cc8118 state=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1
disk=0x7f1b50cc30a0 state=1
kcfis_create_maps_payload. alloc_len 200 num maps 0

The maps payload push is done again for all the appliances, without any request being pushed. The last line confirms no maps/requests having been pushed: num maps 0. However, there is one line which hasn’t been there before: line 2 ‘throttle’, more specifically important in this line is ‘allowing map push’, the previous throttle during ‘kcfis_create_maps_payload’ had the remark ‘mappayload’.

This means that at this point the physical extents to be scanned on the appliances (cell servers) have been identified, translated to appliance, grid disk, offset and size, and the requests for these extents have been send to the appliances. The last snippet actually means that the appliance is notified to start preparing for sending results back.

After the appliances have been notified, memory is initialised again.

Default: calc numbufs mem 4194304
Final: calc numbufs mem 4194304 buflen 1048576
Alloc buffer: target_freebufs 4 allocated_freebufs 0
Get additional mem: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768
Starting work area: app_state: 0x7f1b50ca67b0 wds 0x7f1b50ca6768, mmwds: 0x22446a1a8
Started work area: wds: 0x7f1b50ca6768 mmwds 0x22446a1a8
Get additional mem for pga_aggregate_target: max 4195552 min 1048888, wds 0x7f1b50ca6768 mmwds 0x22446a1a8
cur size 0
Change req: expected size 4196352 cur size 0 max_mem (KB) 4098
Max memory allocation ok: max 4195552, expected 4196352, cur 0
Memlen allowed 4195552 io_buflen 1048576 chunk_len 1048888
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 1053007
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 1085775
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 2106014
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 2138782
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 3159021
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 3191789
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 4212028
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 4244796
Set workarea size: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768 mmwds: 0x22446a1a8, global_ctx: 0x7f1b50ca6d20, size: 4212028
Calling oss_cread: appliance 0x7f1b5078b3a8 app_buffer: 0x7f1b50659000 databuf: 0x7f1b50559000 buflen: 1048576 (posted to 192.168.12.8/)
appliance 0x7f1b5078b3a8 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50c94e78 app_buffer: 0x7f1b50549000 databuf: 0x7f1b50449000 buflen: 1048576 (posted to 192.168.12.11/)
appliance 0x7f1b50c94e78 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50d059a0 app_buffer: 0x7f1b50439000 databuf: 0x7f1b50339000 buflen: 1048576 (posted to 192.168.12.9/)
appliance 0x7f1b50d059a0 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50d02948 app_buffer: 0x7f1b50a7a000 databuf: 0x7f1b5097a000 buflen: 1048576 (posted to 192.168.12.10/)
appliance 0x7f1b50d02948 total creads 1 new creads 1 read seqno 0 pending reads 0
kcfis wait: buf: 0x7f1b50549000 app_state: 0x7f1b50ca67b0 err: (0) Success
kcfis wait: buf: 0x7f1b50659000 app_state: 0x7f1b50ca67b0 err: (0) Success
WAIT #139755290955216: nam='cell smart table scan' ela= 59 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707578411
kcfis_push: num-appliances 4. payload_size 0x7f1b50da3130 ioctl_issued 0x7f1b50da3108 results 0x7f1b50da3068 payload_type 0x7f1b50da3160

First memory areas are initialised (lines 1-24), then we see lines showing ‘Calling oss_cread’. The oss_cread call is the call to the appliances to start sending a resultset back. Please mind that despite the calls addressing specific extents, this is a smartscan, so resultsets are send back instead of Oracle blocks. Also, since this is exadata using the iDB/RDS protocol over infiniband, the appliances can use RDMA to send the results back, which means the cells can fill the memory in the server process’ memory directly.

After oss_cread being called, the ‘kcfis_create_maps_payload’ routine (shown in the snippet above the last snippet) being executed to every appliance apparently to indicate all the disks being enabled, possibly trying to throttle activity, and/or to indicate requests will be called from this session. This seems to be repeated for every roundtrip during the entire smartscan for all the appliances that are still needed.

Whenever a result (result set) is ready to be processed, the following sequence happens:

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50549000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 1 pending reads 0
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50549000, databuf: 0x7f1b50449000, nelem: 1
Dump of memory from 0x00007F1B5044B000 to 0x00007F1B5044B040
7F1B5044B000 0000A23C 03344401 5BB31709 0402099A  [<....D4....[....]
7F1B5044B010 00986F3F 001E0001 0000000F 0000000F  [?o..............]
7F1B5044B020 0000000F 00000000 00000000 00000000  [................]
7F1B5044B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50449080
kcfis_validate_translation: request 0x7f1b50d0aee0
Req completed is : err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1 block_id=53756929
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5044b000 afn 12 rdba 53756929 dlen: 3992 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 53756929 good: 1
Request 0x7f1b50d0aee0 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1
memptr (nil) len 0, blockid = 53756929
kcfis_get_next_data: dptr: 0x7f1b5044b000 len: 3992 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50549000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50549000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707579132

This shows the reap of result set returned by an appliance. Line 5-6 show appliance specific information. Line 7 is showing important information; ‘nelem’ shows the amount of extents (called ‘elements’ in this context) for which the result or results are returned. Of course ‘nelem’ means ‘number of elements’. In this case it’s 1 (resultset from a specific element/extent). Line 15 shows the actual extent from which the result set came back, because the reqid is exposed, the reqid was defined during the translation phase. This snippet ends with a WAIT line (again: all the waits are ‘cell smart table scan’). I consider this a cyclic process: first the ‘kcfis_create_maps_payload’, then calling oss_cread for one or multiple cells, then a wait, or the above processing of results for one or multiple appliances, and a wait.

I’ve created this snippet to be as simple as possible, in real life result sets of multiple appliances could be reaped (in my case I had to remove a second result). The processing of the result set is done in a few stages, so a resultset is not processed per appliance, but the processing stages are done for all the result sets of all the appliances.

Also this example shows only one request in the reaped result. There can be multiple requests (reqid’s/extents) returned.

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50439000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50d059a0 read seqno 1 pending reads 1
appliance 0x7f1b50d059a0 total creads 0 re-adjusted: read seqno 1 pending reads 1
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50439000, databuf: 0x7f1b50339000, nelem: 2
Dump of memory from 0x00007F1B5033B000 to 0x00007F1B5033B040
7F1B5033B000 0000A23C 02748C02 5BB31713 0402099A  [<.....t....[....]
7F1B5033B010 00988672 001E0001 0000007E 0000007E  [r.......~...~...]
7F1B5033B020 0000007E 00000000 00000000 00000000  [~...............]
7F1B5033B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50339080
kcfis_validate_translation: request 0x7f1b50d0b980
Req completed is : err = 0, disknumber = 1, off = 465244798976, len = 1032192 data_len = 32408 bufoff = 8192 version = 0 reqid=13 cver=1 block_id=41192450
 flags = 0
kcfis_oss_block_verify: bp: 0x7f1b5033b000 afn 9 rdba 41192450 dlen: 32408 blksz: 8192 nblks: 3
kcfis_oss_block_verify: corrupt checkcb: rdba: 41192450 good: 1
Request 0x7f1b50d0b980 done
kcfis_process_completed_buffer: 1 elem: 0x7f1b503390c8
kcfis_validate_translation: request 0x7f1b50d114e0
Req completed is : err = 0, disknumber = 6, off = 498161696768, len = 1032192 data_len = 32408 bufoff = 40600 version = 0 reqid=56 cver=1 block_id=53758466
 flags = 0
kcfis_oss_block_verify: bp: 0x7f1b50342e98 afn 12 rdba 53758466 dlen: 32408 blksz: 8192 nblks: 3
kcfis_oss_block_verify: corrupt checkcb: rdba: 53758466 good: 1
Request 0x7f1b50d114e0 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 465244798976, len = 1032192 data_len = 32408 bufoff = 8192 version = 0 reqid=13 cver=1
memptr (nil) len 0, blockid = 41192450
kcfis_get_next_data: elem = 1, err = 0, disknumber = 6, off = 498161696768, len = 1032192 data_len = 32408 bufoff = 40600 version = 0 reqid=56 cver=1
memptr 0x7f1b5033b000 len 32408, blockid = 53758466
kcfis_get_next_data: dptr: 0x7f1b5033b000 len: 64816 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50439000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50439000
WAIT #139755290955216: nam='cell smart table scan' ela= 10 cellhash#=1034800054 p2=0 p3=0 obj#=61458 tim=1593707616790

(example with ‘nelem’=2)

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50549000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 1 pending reads 0
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50549000, databuf: 0x7f1b50449000, nelem: 1
Dump of memory from 0x00007F1B5044B000 to 0x00007F1B5044B040
7F1B5044B000 0000A23C 03344401 5BB31709 0402099A  [<....D4....[....]
7F1B5044B010 00986F3F 001E0001 0000000F 0000000F  [?o..............]
7F1B5044B020 0000000F 00000000 00000000 00000000  [................]
7F1B5044B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50449080
kcfis_validate_translation: request 0x7f1b50d0aee0
Req completed is : err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1 block_id=53756929
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5044b000 afn 12 rdba 53756929 dlen: 3992 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 53756929 good: 1
Request 0x7f1b50d0aee0 done
kcfis reaped i/o: buf: 0x7f1b50659000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b5078b3a8 read seqno 1 pending reads 0
appliance 0x7f1b5078b3a8 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50659000, databuf: 0x7f1b50559000, nelem: 1
Dump of memory from 0x00007F1B5055B000 to 0x00007F1B5055B040
7F1B5055B000 0000A23C 03522293 5BB316ED 0402099A  [<...."R....[....]
7F1B5055B010 009849CB 001E0001 0000000D 0000000D  [.I..............]
7F1B5055B020 0000000D 00000000 00000000 00000000  [................]
7F1B5055B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50559080
kcfis_validate_translation: request 0x7f1b50d0a000
Req completed is : err = 0, disknumber = 1, off = 513723752448, len = 106496 data_len = 3480 bufoff = 8192 version = 0 reqid=1 cver=1 block_id=55714451
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5055b000 afn 13 rdba 55714451 dlen: 3480 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 55714451 good: 1
Request 0x7f1b50d0a000 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 513723752448, len = 106496 data_len = 3480 bufoff = 8192 version = 0 reqid=1 cver=1
memptr (nil) len 0, blockid = 55714451
kcfis_get_next_data: dptr: 0x7f1b5055b000 len: 3480 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50659000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50659000
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1
memptr (nil) len 0, blockid = 53756929
kcfis_get_next_data: dptr: 0x7f1b5044b000 len: 3992 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50549000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50549000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707579132

(example of result sets returned of two appliances, for which the returned results are processed)

Also, when the extent to be scanned is larger, the result set is processed in multiple steps:

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50659000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 2 pending reads 1
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 2 pending reads 1
throttle: received: maps_to_push 0 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50659000, databuf: 0x7f1b50559000, nelem: 3
Dump of memory from 0x00007F1B5055B000 to 0x00007F1B5055B040
7F1B5055B000 0000A23C 03523080 5BB3178D 0402099A  [<....0R....[....]
7F1B5055B010 0098FBC9 001E0001 00000080 00000080  [................]
7F1B5055B020 00000080 00000000 00000000 00000000  [................]
7F1B5055B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50559080
kcfis_validate_translation: request 0x7f1b50d145c0
Req completed is : err = 0, disknumber = 6, off = 650650845184, len = 1048576 data_len = 32920 bufoff = 8192 version = 0 reqid=79 cver=1 block_id=55718016
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a2a0 sblk: 1191936 nblk: 128 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650649796608 sz: 1048576 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d114e0, preqid=79 reqid=79 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a2a0 sblk: 1191936 nblk: 128 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650649796608 sz: 1048576 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d114e0, preqid=79 reqid=79 cver=1
kcfis_validate_translation: REQ3: splitting from bottom
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a2a0 sblk: 1192192 nblk: 256 aubyteoffs 2097152 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650651893760 sz: 2097152 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d0b980, preqid=79 reqid=79 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a2a0 sblk: 1192192 nblk: 256 aubyteoffs 2097152 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650651893760 sz: 2097152 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d0b980, preqid=79 reqid=79 cver=1
kcfis_oss_block_verify: bp: 0x7f1b5055b000 afn 13 rdba 55718016 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 55718016 good: 1
Request 0x7f1b50d145c0 done
kcfis_process_completed_buffer: 1 elem: 0x7f1b505590c8
kcfis_validate_translation: request 0x7f1b50d13d40
Req completed is : err = 0, disknumber = 4, off = 533539586048, len = 1048576 data_len = 32920 bufoff = 41112 version = 0 reqid=75 cver=1 block_id=41195392
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a520 sblk: 3446272 nblk: 384 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_06_enkcel07 off: 533536440320 sz: 3145728 mnum: 0 res: 1 parent req: 0x7f1b50d13d40, req 0x7f1b50d145c0, preqid=75 reqid=75 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a520 sblk: 3446272 nblk: 384 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_06_enkcel07 off: 533536440320 sz: 3145728 mnum: 0 res: 1 parent req: 0x7f1b50d13d40, req 0x7f1b50d145c0, preqid=75 reqid=75 cver=1
kcfis_oss_block_verify: bp: 0x7f1b50563098 afn 9 rdba 41195392 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 41195392 good: 1
Request 0x7f1b50d13d40 done
kcfis_process_completed_buffer: 2 elem: 0x7f1b50559110
kcfis_validate_translation: request 0x7f1b50d13f60
Req completed is : err = 0, disknumber = 5, off = 655073738752, len = 1048576 data_len = 32920 bufoff = 74032 version = 0 reqid=76 cver=1 block_id=59912064
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a008 sblk: 1191428 nblk: 380 aubyteoffs 32768 disk: o/192.168.12.11/DATA_CD_07_enkcel07 off: 655070625792 sz: 3112960 mnum: 0 res: 1 parent req: 0x7f1b50d13f60, req 0x7f1b50d13d40, preqid=76 reqid=76 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a008 sblk: 1191428 nblk: 380 aubyteoffs 32768 disk: o/192.168.12.11/DATA_CD_07_enkcel07 off: 655070625792 sz: 3112960 mnum: 0 res: 1 parent req: 0x7f1b50d13f60, req 0x7f1b50d13d40, preqid=76 reqid=76 cver=1
kcfis_oss_block_verify: bp: 0x7f1b5056b130 afn 14 rdba 59912064 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 59912064 good: 1
Request 0x7f1b50d13f60 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 6, off = 650650845184, len = 1048576 data_len = 32920 bufoff = 8192 version = 0 reqid=79 cver=1
memptr (nil) len 0, blockid = 55718016
kcfis_get_next_data: elem = 1, err = 0, disknumber = 4, off = 533539586048, len = 1048576 data_len = 32920 bufoff = 41112 version = 0 reqid=75 cver=1
memptr 0x7f1b5055b000 len 32920, blockid = 41195392
kcfis_get_next_data: elem = 2, err = 0, disknumber = 5, off = 655073738752, len = 1048576 data_len = 32920 bufoff = 74032 version = 0 reqid=76 cver=1
memptr 0x7f1b5055b000 len 65840, blockid = 59912064
kcfis_get_next_data: dptr: 0x7f1b5055b000 len: 98760 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50659000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50659000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707638819

(example with nelem=3, with bigger extents, for which the result is splitted)

Once the result is reaped for a certain appliance, a new oss_cread call must be done in order for another result to be pushed to the server process. Request (reqid or extents) are not sent back in order.

Once the requests are exhausted for a certain appliance, the appliance is excluded from the ‘kcfis_create_maps_payload’ procedure.

Once all smartscan is complete, the memory areas are cleaned up, and the sessions are closed. Apparently a session or some session state information is kept per grid disk, which are all closed (closing of one appliance is shown in this snippet):

kcfis_read DONE - ret NULL
kcfis_finalize: app_state 0x7f1b50ca67b0 permflags 0x9000 clnflags 0x1f
Appliance 0x7f1b5078b3a8 in-active. Active count 3
Appliance 0x7f1b50c94e78 in-active. Active count 2
Appliance 0x7f1b50d059a0 in-active. Active count 1
Appliance 0x7f1b50d02948 in-active. Active count 0
Caching: session at the end of scan 1
Work area gc: 0x7f1b50ca6768, app_state: 0x7f1b50ca67b0
Freeing read buffer chunk (GC) 0x7f1b50448eb0. count 4
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 3159021
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 3191789
Freeing read buffer chunk (GC) 0x7f1b50338eb0. count 3
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 2106014
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 2138782
Freeing read buffer chunk (GC) 0x7f1b50558eb0. count 2
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 1053007
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 1085775
Freeing read buffer chunk (GC) 0x7f1b50979eb0. count 1
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 0
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 32768
Freeing memory to top level heap: num_freed 4
Set workarea size: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768 mmwds: 0x22446a1a8, global_ctx: 0x7f1b50ca6d20, size: 0
Work area cleanup start: global_ctx: 0x7f1b50ca6d20
Close work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8 app_state: 0x7f1b50ca67b0 refcnt: 0
Close work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8 refcnt: 0
Closed work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8
Deallocating kcfis wds: 0x7f1b50ca6768
Cleanup work area for app_state: 0x7f1b50ca67b0
kcfis_finalize_cached_sessions: global_ctx 0x7f1b50ca6d20
Caching: in kcfis_finalize_cached_sessions global ctx 0x7f1b50ca6d20 total cached 1 cached in ctx 1
Deallocating session. app state 0x7f1b50ca67b0 num cached sessions 0
In kcfis_deallocate_session: app_state: 0x7f1b50ca67b0
KCFIS: [NSMTIO]:SQL for this (non)Smart I/O session is: 
select count(*) from t
kcfis_cache_appliance_fd: Cached appliance fd 0x7f1b50da2b88  for appliance 0x7f1b5078b3a8
Appliance 0x7f1b5078b3a8 was not active. Active count 0
kcfis_deallocate_session: Freeing disk 0x7f1b50c953f0 name = o/192.168.12.8/DATA_CD_06_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b507908e0 name = o/192.168.12.8/DATA_CD_01_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b507903f0 name = o/192.168.12.8/DATA_CD_05_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50d04fc0 name = o/192.168.12.8/DATA_CD_10_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfe960 name = o/192.168.12.8/DATA_CD_03_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfdf80 name = o/192.168.12.8/DATA_CD_09_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfc930 name = o/192.168.12.8/DATA_CD_00_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc8118 name = o/192.168.12.8/DATA_CD_11_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc7ae0 name = o/192.168.12.8/DATA_CD_08_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc5ab0 name = o/192.168.12.8/DATA_CD_02_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc3a80 name = o/192.168.12.8/DATA_CD_04_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc30a0 name = o/192.168.12.8/DATA_CD_07_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis rcv update : op: 6 val: 8192 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 24576

And the appliances are notified the smartscan has ended:

kcfis_close_all_appliance_fds. Cache count 4
kcfis_close_appliance_fd: appliance_fd 0x7f1b50da2b88
WAIT #139755290955216: nam='cell smart table scan' ela= 241 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707978094
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5db00
WAIT #139755290955216: nam='cell smart table scan' ela= 237 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707978356
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5df58
WAIT #139755290955216: nam='cell smart table scan' ela= 280 cellhash#=674246789 p2=0 p3=0 obj#=61458 tim=1593707978668
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5e3b0
WAIT #139755290955216: nam='cell smart table scan' ela= 242 cellhash#=1034800054 p2=0 p3=0 obj#=61458 tim=1593707978935

Summary.
This blogpost tries to summarise (..) the different steps in an Exadata smartscan. One of the most important things which this shows is that anything which is done, is covered by a single wait event (‘cell smart table scan’). In other words: profiling this wait event tells you little to nothing about what is actually happening, except that a smartscan is being processed. In other words: if you get high ‘cell smart table scan’ waits, either for all the waits or for a few, the only way to pinpoint what it is the wait is showing waiting time is executing this trace again, and see which step it is. Of course you can pinpoint if the waiting is happening for a specific cell/appliance by looking at cellhash# in the wait line.

Disclaimer:
Please mind I tried to use my knowledge on Oracle and Exadata processing together with the information the trace provided to build this description. If you encounter anything which is incorrect, please comment on this post, and I try to get it fixed. No bits where harmed during testing.


Tagged: database, debug, internals, oracle, oss, smartscan, trace

Using git for revision control for Oracle DBAs

$
0
0

This post is aimed at people working with code, scripts and/or any other means of textual files. I try to give my point of view on revision control and git as revision control system in particular.

The first thing you should ask yourself is: why using revision control in the first place? I think that’s a good and fair question. A lot of people I talk to see revision control as something that’s for developers in projects with multiple people working on the same code to build history and provide a single point of truth. I think revision control in that situation indeed is needed (in fact, I think it is a necessity).

Let’s now look at the situation of a DBA. Most people I work with built up their own bundle of scripts to circumvent constantly redoing stuff they’ve worked out in the past, and/or scripts they gotten or borrowed from other people (in fact, if you do not keep a copy of the excellent script bundles of Tanel Poder and Kerry Osborne you’re either brilliant enough to do this all from the top of your head, or stupid enough not to use them or know them, with a fair chance of falling in the second category).

Probably most (if not: all) people change scripts, create entirely new ones, experiment, etc. How many times where you searching for this specific version/modification you made *somewhere*, but forgot where it was? In my own situation, when doing research, experiments and investigations, I use a lot of virtual machines (with/without ASM, different operating system versions, different Oracle versions; there are huge differences between 11.2.0.1/2/3/4, and Oracle version 12), and need my scripts and what to conventiently move changes and use newly created scripts among these machines. I don’t think I even have to go in the area of seeing the history of changes to a certain script or the repository as a whole: this is something you will use when using revision control, or miss sorely if your scripts are not in revision control.

Once you or your team is convinced you need revision control the immediate second thing which always pops up (in the situations I worked in, I am aware there are other revision control systems) is subversion or git? If you look at the title of this blog you know where I end up, but it’s good to give this some thought. Subversion is a decent revision control system, with which I’ve worked with great pleasure in the past. The history I’ve read on subversion is that it was made to be an open source version of the CVS revision control system and overcome some of the problems/limitations of it. Subversion is a revision control system that works in a client/server way: there is the central repository, and clients check out the source from that. This is no problem when client and server are on the same machine or in the local network, and even less a problem if there’s only one user.

But what if there are multiple persons working with it? And these people are located at vast distances from each other? And some of the people do not have internet access all the time? The problems that arise from that are a lot of potential problems with concurrent versions, performance can be very bad, because you need to connect to the central repository, and you need a connection to the repository in the first place to commit your change. From what I’ve read from the git revision control system, these were some of the problems Linus Torvalds wanted to overcome with git.

A git repository always works local on the machine you are actually working. You can link remote git repositories and push your changes to one or more repositories, or the other way around pull changes from a remote repository to your repository. I guess most people immediately understand why this easily overcomes a lot of problems that where the result of having a single point as the repository. You always work in your own repository.

I’ve found this (git versus subversion) yet another topic people easily get into a ‘religious discussion’, which means the discussion isn’t about actual properties and pros and con’s of both revision control systems, but on personal preference and sometimes ego. Another thing which is closely related to this is there is a learning curve if you need to start with using git.

Installation
I work on Mac OSX. In order to get git, just install the Xcode package from the App Store (free). On Oracle Linux, yum install the git package. I guess on RedHat this works the same. On Windows, download the installer from http://msysgit.github.com/ and run it. (please mind the examples in this blog are on OSX, and applicable to linux)

Initial configuration
Once you’ve installed git, the next thing to do is set your credentials:

$ git config --global user.name "John Doe"
$ git config --global user.email johndoe@example.com

I’ve set some aliases (which I’ve found in articles on git), to shorten typing a bit:

$ git config --global alias.co=checkout
$ git config --global alias.ci=commit
$ git config --global alias.st=status
$ git config --global alias.hist=log --pretty=format:"%h %ad | %s%d [%an]" --graph --date=short

Okay, at this point you are set, but no repository has been created yet.

Create a repository
The next obvious step is to create a repository:

$ git init

This creates a .git directory in the current working directory, which means you are now in the root directory of the newly created repository. Please mind you can initialise a repository in a root directory of a current project. It will still be an empty repository.

One of the other ways to start using a git repository which might have caught your eyes a few times, is clone ($ git clone git://github.com/username/reponame.git); which creates a local revision controlled copy of a remote git repository.

Let’s say I got a directory which contains a file ‘a’, and a subdirectory ‘tt’ which contains ‘b’ and ‘c’:

$ find .
.
./a
./tt
./tt/b
./tt/c

Now start a git repository in the root:

$ git init
Initialized empty Git repository in /Users/fritshoogland/t/.git/

Now let’s look at how our repository looks like:

$ git status
On branch master

Initial commit

Untracked files:
  (use "git add ..." to include in what will be committed)

	a
	tt/

nothing added to commit but untracked files present (use "git add" to track)

Here we see git showing us that there are no files in it, but that it sees our test files, which are currently untracked. Let’s add all the files:

$ git add *

If we look at the status of the repository, we see the files are added, but not yet committed:

$ git status
On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

	new file:   a
	new file:   tt/b
	new file:   tt/c

And commit the additions:

$ git commit -m 'initial commit.'
[master (root-commit) 735ed81] initial commit.
 3 files changed, 3 insertions(+)
 create mode 100644 a
 create mode 100644 tt/b
 create mode 100644 tt/c

If you commit something, git wants you to add a remark with this commit. This can be done with the git command by adding ‘-m’ and a comment within quotation marks, or by omitting this, which makes git fire up an editor, in which you can type a comment.

Changes
At this point we have our files committed to the repository, and the files and repository are completely in sync:

$ git status
On branch master
nothing to commit, working directory clean

Now let’s change something. My file ‘a’ has got one line in it, with a single ‘a’. I add a second line to the file ‘a’ reading ‘second line’:

$ cat a
a
second line

Now let’s ask git for the status:

$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   a

no changes added to commit (use "git add" and/or "git commit -a")

Here we see that git knows about the modification in file a. The next thing we see is the line ‘Changes not staged for commit’. This is something which is different from subversion. Instead of changes which are committed, git follows a two stage approach: a changed file needs to be staged for commit first, after which it can be committed. This allows you to group changed files for a commit, instead of committing all.

Now let’s stage the change as indicated by the help provided with ‘git status’, using ‘git add’. Please mind the ‘add’ command is used both for making files version controlled and adding them to the stage list. I issued a subsequent git status to show the new status:

$ git add a
$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	modified:   a

We can look at the difference between the version in git and changes made by using diff –cached:

$ git diff --cached
diff --git a/a b/a
index 7898192..c4fbe32 100644
--- a/a
+++ b/a
@@ -1 +1,2 @@
 a
+second line

Now let’s commit the change, and look at the status:

$ git commit -m 'added line to a.'
[master 17ec0e8] added line to a.
 1 file changed, 1 insertion(+)
$ git status
On branch master
nothing to commit, working directory clean

Now we build a tiny bit of history, we can look at it using git log. This command can be used to look at the changes of the entire repository, or at the history of a specific file. I use the alias ‘hist’ created above:

$ git hist a
* c8ec93c 2014-05-25 | added line to a. (HEAD, master) [FritsHoogland]
* 739f11a 2014-05-25 | initial commit. [FritsHoogland]

One common thing you want to do is to see what changes are made to a file. To see what the changes are between the current version in the repository (which is called ‘HEAD’, which means ‘last commit in current branch’), use git diff, and the change hash (look at the above example of git hist):

$ git diff 739f11a a
diff --git a/a b/a
index 7898192..c4fbe32 100644
--- a/a
+++ b/a
@@ -1 +1,2 @@
 a
+second line

Of course you can also checkout that specific version (that’s the reason for having version control!):

$ git checkout 739f11a a

Now the file a is reverted to the version without the line ‘second line’ in it. Because we changed the file a now again, git considers it changed:

$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	modified:   a

So after checking out this version you can choose to commit this as new version, or revert it back to it’s original version using get reset HEAD (as described with the status command).

Remote repositories
However, everything described here until now are changes and version control done locally. How about collaboration with a team, like subversion has with the subversion server? Well, this is where git is inherently different. Git works in a peer-to-peer fashion, instead of a client-server way like subversion.

You can link your repository to a remote git repository, and push changes made in your local repository to a remote repository, or pull changes made in a remote repository to your own local repository. Especially if you work with a team, this is how you can centralise the source code in a very organised way (you can add a web interface for example).

These is how I create a remote repository on my synology NAS:

$ ssh Username@nas.local                                             # log on to remote server
$ mkdir burn.git                                                     # create directory for repo
$ cd $_
$ git init --bare                                                    # create empty git repository
$ git update-server-info                                             # not sure if this is needed
$ git remote add origin Username@nas.local:burn.git                  # add remote
$ git push -u origin master                                          # push local master to origin 

One of the things I use to keep track of current development version and the public client version of scripts is tags. This command tags the last commit with “prod 1.0″:

$ git tag "prod 1.0" HEAD

In order to use tags, you need to see which tags exist. This is very simple with the ‘git tag’ command. To look at the tags of a remote git repository, use:

$ git ls-remote --tags Username@nas.local:gdb_macros.git

If you want to get the whole repository for production usage, without the versioning, use the ‘git archive’ command (remove –remote and argument for archiving from a local repository). This version creates a gzipped tarball:

$ git archive master --remote Username@nas.local:gdb_marcos.git --format=tar | gzip > gdb_macros.tgz

You can also archive a tag, instead of the latest version. This version creates the files in the current directory:

$ git archive "prod 1.0" --remote Username@nas.local:gdb_macros.git --format=tar | tar xf -

I hope this blog encouraged you to put your files in a (git) repository.


Reconstructing oratab from the cluster registry

$
0
0

At the Accenture Enkitec Group we have a couple of Exadata racks for Proof of Concepts (PoC), Performance validation, research and experimenting. This means the databases on the racks appear and vanish more than (should be) on an average customer Exadata rack (to be honest most people use a fixed few existing databases rather than creating and removing a database for every test).

Nevertheless we gotten in a situation where the /etc/oratab file was not in sync with the databases registered in the cluster registry. This situation can happen for a number reasons. For example, if you clone a database (RMAN duplicate), you end up with a cloned database (I sincerely hope), but this database needs to be manually registered in the cluster registry. This is the same with creating a standby database (for which one of the most used methods is to use the clone procedure with a couple of changes).

However, above reasons are quite obvious. But there is a another reason which is way less obvious: bug 17172091 (Oracle restart may update oratab incorrectly after a node reboot) can also cause your oratab to get out of sync with the databases/instances in the cluster registry. Additional information: Oracle versions confirmed being affected are 11.2.0.3 and 11.2.0.2. This is bug is reported to be fixed with Grid Infra PSU 11.2.0.3.9 and 11.2.0.4.2. (yes I am aware of the inconsistency between versions affected and fixed versions, this is from the Oracle bug information available; thanks to Tanel Põder for finding this bug).

In order to recreate the oratab, you need to go through the cluster registry information, and compare it with your oratab. Especially if you’ve got a lot of databases, and/or single instance databases in different nodes, this can be quite some work. To relieve work for that situation, I created a little shell script to parse the cluster registry database information and get the db_unique_name (which is what the first field actually is in the oratab file, thanks to Randy Johnson) and oracle home path information and output this in “oratab format” (db_unique_name:oracle home path:N). Needless to say, this script just outputs it to STDOUT. If you want to use this information, redirect it to a file, or copy and past it in oratab yourself.

for resource in $(crsctl status resource -w "((TYPE = ora.database.type) AND (LAST_SERVER = $(hostname -s)))" | grep ^NAME | sed 's/.*=//'); do 
	full_resource=$(crsctl status resource -w "((NAME = $resource) AND (LAST_SERVER = $(hostname -s)))" -f)
	db_name=$(echo "$full_resource" | grep ^DB_UNIQUE_NAME | awk -F= '{ print $2 }')
	ora_home=$(echo "$full_resource" | grep ^ORACLE_HOME= | awk -F= '{ print $2 }')
	printf "%s:%s:N\n" $db_name $ora_home 
done

Tagged: bug, cluster, cluster registry, oracle, oratab, recreate oratab

Getting the database parameters from a spfile

$
0
0

There’s been some debate about how to get the parameters from a spfile. A spfile is a binary version of the parameter file of the Oracle database.

I added to the debate that my experience is that there are is some weirdness with using the strings command on the spfile. The discussion was on twitter, I didn’t add that doing that it most of the time meant it costed more time than I saved from using the “shortcut” of using strings on a spfile.

Let me show you what it means.

I’ve got a database with storage on ASM. Among other options, there are two simple methods to get the spfile from ASM:

You can get the spfile by logging on to the database, and create a pfile from the spfile, and create a spfile again:

SYS@v11204 AS SYSDBA> show parameter spfile

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
spfile				     string	 +DATA/v11204/spfilev11204.ora

Now let’s recreate the spfile on a filesystem:

SYS@v11204 AS SYSDBA> create pfile='/tmp/pfile' from spfile='+DATA/v11204/spfilev11204.ora';

File created.

SYS@v11204 AS SYSDBA> create spfile='/tmp/spfile' from pfile='/tmp/pfile';

File created.

Another option is to copy the spfile out of ASM:
Set the ASM environment and execute asmcmd

[oracle@ol65-oracle11204 [v11204] ~]$ +ASM
The Oracle base remains unchanged with value /u01/app/oracle
[oracle@ol65-oracle11204 [+ASM] ~]$ asmcmd
ASMCMD>

Now go to the DATA disk group, and the directory of the database (my database is called v11204). If you look here, you’ll see a link to the spfile to its true ASM place:

ASMCMD> cd data/v11204
ASMCMD> ls
CONTROLFILE/
DATAFILE/
ONLINELOG/
PARAMETERFILE/
TEMPFILE/
spfilev11204.ora

If you take a long listing, you see the true ASM place:

ASMCMD&gt; ls -l
Type           Redund  Striped  Time             Sys  Name
                                                 Y    CONTROLFILE/
                                                 Y    DATAFILE/
                                                 Y    ONLINELOG/
                                                 Y    PARAMETERFILE/
                                                 Y    TEMPFILE/
                                                 N    spfilev11204.ora =&gt; +DATA/V11204/PARAMETERFILE/spfile.265.847477361

In asmcmd, you can just copy the spfile (the real name, not the “spfilev11204.ora” one, which is a kind of symbolic link):

ASMCMD> cp +DATA/V11204/PARAMETERFILE/spfile.265.847477361 /tmp/spfile.asm
copying +DATA/V11204/PARAMETERFILE/spfile.265.847477361 -> /tmp/spfile.asm

Okay, now back to using strings on the spfile. If I issue strings on the spfile, I get what looks like a complete parameter file:

$ strings spfile.asm
v11204.__db_cache_size=692060160
v11204.__java_pool_size=4194304
v11204.__large_pool_size=8388608
v11204.__oracle_base='/u01/app/oracle'#ORACLE_BASE set from environment
v11204.__pga_aggregate_target=524288000
v11204.__sga_target=943718400
v11204.__shared_io_pool_size=0
v11204.__shared_pool_size=226492416
v11204.__streams_pool_size=0
*.audit_file_dest='/u01/app/oracle/admin/v11204/adump'
*.audit_trail='db'
*.compatible='11.2.0.4.0'
*.control_files='+DATA/v11204/controlfile/current.25
6.847475315'
*.db_block_size=8192
*.db_create_file_dest='+DATA'
*.db_domain=''
*.db_file_multiblock_read_count=1024
*.db_name='v11204'
*.diagnostic_dest='/u01/app/oracle'
*.disk_asynch_io=TRUE
*.dispatchers='(PROTOCOL=TCP) (SERVICE=v11204XDB)'
*.open_cursors=300
*.pga_aggregate_target=524288000
*.processes=150
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_target=943718400
*.undo_tablespace='UNDOTBS1'

But is it? If you take a detailed look, you’ll see something which is not alright:

*.compatible='11.2.0.4.0'
*.control_files='+DATA/v11204/controlfile/current.25
6.847475315'
*.db_block_size=8192

This is exactly what I meant: for some reason, the line with ‘control_files’ seems to have been broken into two pieces. I think I don’t need to tell most readers of this blog that tracing this kind of oddities costs a lot of time, especially if you’ve got a big spfile. And: most of the time you are playing around with this, there probably is something wrong, and you simply don’t have the time for this fooling around.

But what why is this happening? Let’s look at the spfile contents using the ‘od’ utility (octal dump):

$ strings spfile.asm | od -t d1 -a -A d
0000000  118   49   49   50   48   52   46   95   95  100   98   95   99   97   99  104
           v    1    1    2    0    4    .    _    _    d    b    _    c    a    c    h
0000016  101   95  115  105  122  101   61   54   57   50   48   54   48   49   54   48
           e    _    s    i    z    e    =    6    9    2    0    6    0    1    6    0
0000032   10  118   49   49   50   48   52   46   95   95  106   97  118   97   95  112
          nl    v    1    1    2    0    4    .    _    _    j    a    v    a    _    p
0000048  111  111  108   95  115  105  122  101   61   52   49   57   52   51   48   52
           o    o    l    _    s    i    z    e    =    4    1    9    4    3    0    4

This is the beginning of the spfile, to get an idea what we are looking at.
The numbers on the left side (00000000, 00000016, etc) are the position numbers in decimal. This shows there are 16 characters per line. The numbers on the line of the position are the ASCII values. The character representation of the ASCII value is BENEATH it. If you now read the line (looking at the character representation, you see ‘v11204.__db_cach (new line) e_size=692060160 (new line)’ and then ASCII value 10, which is represented with ‘nl’: newline.

Okay, now we are used to reading this output, now let’s look at the problem section with the control_files line:

0000432   46   48   39   10   42   46   99  111  110  116  114  111  108   95  102  105
           .    0    '   nl    *    .    c    o    n    t    r    o    l    _    f    i
0000448  108  101  115   61   39   43   68   65   84   65   47  118   49   49   50   48
           l    e    s    =    '    +    D    A    T    A    /    v    1    1    2    0
0000464   52   47   99  111  110  116  114  111  108  102  105  108  101   47   99  117
           4    /    c    o    n    t    r    o    l    f    i    l    e    /    c    u
0000480  114  114  101  110  116   46   50   53   10   54   46   56   52   55   52   55
           r    r    e    n    t    .    2    5   nl    6    .    8    4    7    4    7
0000496   53   51   49   53   39   10   42   46  100   98   95   98  108  111   99  107
           5    3    1    5    '   nl    *    .    d    b    _    b    l    o    c    k

If we look closely, you can see ‘*.control_fi’ on the first line, after the ASCII value 10, newline.
If we read on, it’s shows:
*.control_files=’+DATA/v11204/controlfile/current.25 (nl) 6.847475315′ (nl)
In other words, there is an additional newline. But the position (488) seems strange to me.

But when looking at how I generated this, I executed ‘strings’. This means the output is filtered to readable characters. Would there be unreadable characters in a spfile? Let’s look!

$ cat spfile.asm | od -t d1 -a -A d
0000000   67   34    0    0    1    0    0    0    0    0    0    0    0    0    1    4
           C    "  nul  nul  soh  nul  nul  nul  nul  nul  nul  nul  nul  nul  soh  eot
0000016   77   25    0    0    0    0    0    0    0    0    0    0    0    0    0    0
           M   em  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul
0000032    0    0    0    0    0    0    0    0    0    0    0    0    5    0    0    0
         nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  enq  nul  nul  nul
0000048    0    2    0    0    0    0    0    0    0    0    0    0    0    0    0    0
         nul  stx  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul
0000064    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
         nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul  nul

Aha! So strings does filter a lot of stuff! Now let’s look at the control_files parameter again:

0000960   46   48   46   52   46   48   39   10   42   46   99  111  110  116  114  111
           .    0    .    4    .    0    '   nl    *    .    c    o    n    t    r    o
0000976  108   95  102  105  108  101  115   61   39   43   68   65   84   65   47  118
           l    _    f    i    l    e    s    =    '    +    D    A    T    A    /    v
0000992   49   49   50   48   52   47   99  111  110  116  114  111  108  102  105  108
           1    1    2    0    4    /    c    o    n    t    r    o    l    f    i    l
0001008  101   47   99  117  114  114  101  110  116   46   50   53    1   67    0    0
           e    /    c    u    r    r    e    n    t    .    2    5  soh    C  nul  nul
0001024   67   34    0    0    3    0    0    0    0    0    0    0    0    0    1    4
           C    "  nul  nul  etx  nul  nul  nul  nul  nul  nul  nul  nul  nul  soh  eot
0001040   34  121    0    0   54   46   56   52   55   52   55   53   51   49   53   39
           "    y  nul  nul    6    .    8    4    7    4    7    5    3    1    5    '

I think this is quite self explanatory to a lot of people. If not, let me help you: all is as we expect up to position 1018, at which there are a few non-readable characters, until position 1043. This means there are 25 character positions which contain something else, after which the parameter file contents continue.

But now look at the position: it’s around position 1024. It’s my guess that the spfile uses a block size of 1024 bytes (1KB). In order to check for consistency of the parameter file blocks, Oracle puts some extra (internal) data on the borders of the block so integrity can be checked. This is like an Oracle database block.

So, there you have a reason not to use strings on the spfile, unless you like a game of find the random newlines in your new pfile.

How should you create a new pfile then?

The preferred method is using ‘create pfile from spfile’. This requires logging on to the instance (nomount is enough).
An alternative is to look at the alert.log file. When starting an Oracle instance, the non-default parameters are printed in the alert.log. This is a very simple, yet useful method of reconstructing the parameter file.

Update: I’ve gotten an email from Bjoern Rost saying that my statement on the need of the instance at least needing to be in nomount phase is not true, because the ‘create pfile from spfile’ commands can be used with the instance being down.

I decided to take this for an additional test. First of all, there two methods which can be used pfile/spfile manipulation via sqlplus (as far as I know): logging on with SYSDBA privileges, and starting sqlplus on the local node without logging on to any instance (/nolog).

a) instance is OPEN (the which is the same with in the nomount and mount phase)

With SYSDBA privilege.

$ sqlplus / as sysdba 
...
SYS@v11204 AS SYSDBA> create pfile='/tmp/tt' from spfile;

File created.

With /nolog.

$ sqlplus /nolog
...
@ > create pfile='/tmp/tt' from spfile;
SP2-0640: Not connected

Upon given this a little thought, it’s kind of obvious only saying spfile can’t be used with /nolog: there are no settings, we are not connected to any instance. So let’s try specifying a full path for both pfile and spfile:

$ sqlplus /nolog
...
@ > create pfile='/tmp/tt' from spfile='+DATA/v11204/spfilev11204.ora';
SP2-0640: Not connected

No. sqlplus /nolog can’t be used in my situation (Oracle 11.2.0.4, Linux X64 OL 6u5, ASM, instance open).

b) instance down

With SYSDBA privilege:

$ sqlplus / as sysdba
...
Connected to an idle instance.

SYS@v11204 AS SYSDBA> create pfile='/tmp/tt' from spfile='+DATA/v11204/spfilev11204.ora';

File created.

Aha! So we can manipulate the pfile and spfile with the instance being down when we logon as SYSDBA!

Let’s look at another case:

$ sqlplus / as sysdba
...
Connected to an idle instance.

SYS@v11204 AS SYSDBA> create pfile='/tmp/tt' from spfile;
create pfile='/tmp/tt' from spfile
*
ERROR at line 1:
ORA-01565: error in identifying file '?/dbs/spfile@.ora'
ORA-27037: unable to obtain file status
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3

What is shown here, is that when my instance is down, I can’t use ‘create pfile from spfile’ without a specification for spfile. Apparently, it looks at the default location ($ORACLE_HOME/dbs) for the default spfile name (spfilev11204.ora in my case): ORA-01565: error in identifying file ‘?/dbs/spfile@.ora’. ‘?’ means $ORACLE_HOME and ‘@’ means instance name. Let’s see if this works if we create a spfile on that location:

SYS@v11204 AS SYSDBA> create pfile='/tmp/tt' from spfile='+DATA/v11204/spfilev11204.ora';

File created.

SYS@v11204 AS SYSDBA> create spfile from pfile='/tmp/tt';

File created.

The first step is already done, but redone for completeness sake. The second step creates a spfile on the default location from the pfile ‘/tmp/tt’. Now we can ask sqlplus to essentially do the same but in reverse: create ‘/tmp/tt’ from spfile:

SYS@v11204 AS SYSDBA> create pfile='/tmp/tt' from spfile;

File created.

Yes! Now this works. The reason for this behaviour is I explicitly want my spfile not to be on the default location ($ORACLE_HOME/dbs), because this is on a local filesystem. This is not a problem with single instance databases with no shared storage at all, but this is not practical in the case of RAC, and with single instance databases with a cluster and shared storage (think Exadata here!), because it’s practical to have the spfile on shared diskspace so the instance can very easily be started on another node.

Let’s try sqlplus /nolog again:

$ sqlplus /nolog
...
@ > create pfile='/tmp/tt' from spfile='+DATA/v11204/spfilev11204.ora';
SP2-0640: Not connected

Nope. That doesn’t work. I’ve seen blogposts indicating that /nolog can be used for pfile/spfile manipulation, this didn’t work in my case. Bjoern was right that pfile/spfile can be done while the instance is down.


Tagged: oracle, parameter, parameter file, pfile, spfile

Exadata storage indexes and DML

$
0
0

Last week I’ve gotten a question on how storage indexes (SI) behave when the table for which the SI is holding data is changed. Based on logical reasoning, it can be two things: the SI is invalidated because the data it’s holding is changed, or the SI is updated to reflect the change. Think about this for yourself, and pick a choice. I would love to hear if you did choose the correct one.

First let’s do a step back and lay some groundwork first. The tests done in this blogpost are done on an actual Exadata (V2 hardware), with Oracle version 11.2.0.4.6 (meaning bundle patch 6). The Exadata “cellos” (Cell O/S) version is 11.2.3.3.1.140529.1 on both the compute nodes and the storage nodes.

A storage index is a memory structure used by the cell daemon, which is the storage server process on the storage layer of Exadata. By default (although I’ve never seen it different yet) a SI contains minimum and maximum values for up to eight columns of the table it is describing. The memory structure is transient. It describes a region of one megabyte of a table. A storage index is populated during the fetching of data for a smart scan, based on the filter predicates of the query causing the smart scan. In essence the whole SI management is done automatically by the cell daemon (“essence” means you can play around with some undocumented settings on the database level, and there are some undocumented settings and events you can set on the cell server level).

Okay, back to the the original question. Let’s test this. First we need a table large enough to be smartscanned. I’ve got a table called ‘BIGTAB_NOHCC’ (as you probably guessed, there is a table ‘BIGTAB’ in the schema I am using, which is hybrid columnar compressed). This table consists of 2671736 blocks of 8KB, which means it got a total size of 21G. This is big enough for my database instance (big enough is relative to the buffer cache size) to get a smart scan.

For this test, I use the session statistics. In v$sesstat there is a statistic called ‘cell physical IO bytes saved by storage index’, which tells us how many bytes we saved from being scanned because of the use of storage indexes. I also show some output of a (single) storage server, although a normal Exadata rack typically will have 3, 7 or 14 storage servers in use.

First of all, in order to get an idea and be pretty sure my query will populate storage indexes, I stop and start the cell daemon on the storage servers. I use ‘service celld restart’ for that.

Next, in order to get storage index information for the table I use, I need some metadata. The metadata I need are:
a) data object id

select data_object_id from dba_objects where owner = 'MARTIN' and object_name = 'BIGTAB_NOHCC';

b) tablespace number

select ts# from v$tablespace t, dba_segments s, dba_objects o where t.name=s.tablespace_name and s.segment_name=o.object_name and o.data_object_id = 18716;

c) database unique id from the x$ksqdn

select ksqdngunid from x$ksqdn;

(thanks to the progeeking website for this information)

Next, I run the query on my BIGTAB_NOHCC table. Please mind it’s fundamental to have a filter predicate, and that the execution plan shows we are doing a full table scan. Only with a filter predicate the storage server has a reason to build a storage index:

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index					     0

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906259;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index					     0

Now let’s see if the storage layer built a storage index:

CellCLI> alter cell events = "immediate cellsrv.cellsrv_storidx('dumpridx','all',18716,5,1126366144);
Dump sequence #1 has been written to /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_58.trc
Cell enkcel01 successfully altered

This is why it’s handy to have the aforementioned data on the table: you can dump the specific storage indexes for a single table. The three numbers after ‘all’ are data_object_id, tablespace number and the ksqdngunid from the x$ksqdn table.

Let’s look in the file which the cell event produced:

# less /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_58.trc
Trace file /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_58.trc
ORACLE_HOME = /opt/oracle/cell
System name:    Linux
Node name:      enkcel01.enkitec.com
Release:        2.6.39-400.128.17.el5uek
Version:        #1 SMP Tue May 27 13:20:24 PDT 2014
Machine:        x86_64
CELL SW Version:        OSS_11.2.3.3.1_LINUX.X64_140529.1

*** 2014-07-12 13:16:17.030

*** 2014-07-12 13:16:17.030
UserThread: LWPID: 27455 userId: 58 kernelId: 58 pthreadID: 0x7f2863fe7940
2014-07-12 13:21:36.499123 :0005C9DE: $$$ Dumping storage idx summary for griddisk DATA_CD_08_enkcel01:
2014-07-12 13:21:36.499217 :0005C9E0: Dump sequence #1:

*** 2014-07-12 13:21:36.499
2014-07-12 13:21:36.499212 :0005C9DF: 
***************************
2014-07-12 13:21:36.499249 :0005C9E1: Dumping RIDX summary for objd 18716, tsn 5, dbid 1126366144

2014-07-12 13:21:36.499249*: RIDX(0x7f27b4ddab64) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-12 13:21:36.499249*: RIDX: strt 0 end 2048 offset 6215958528 size 1048576 rgnIdx 5928 RgnOffset 0 scn: 0x099a.5cc294b8 hist: 2
2014-07-12 13:21:36.499249*: RIDX validation history: 
2014-07-12 13:21:36.499249*: 0:FullRead 1:Undef 2:Undef 3:Undef 4:Undef 5:Undef 6:Undef 7:Undef 8:Undef 9:Undef
2014-07-12 13:21:36.499249*: Col id [1] numFilt 4 flg 2 (HASNONNULLVALUES): 
2014-07-12 13:21:36.499249*: lo: c3 c 34 11 0 0 0 0
2014-07-12 13:21:36.499249*: hi: c3 5a 5 18 0 0 0 0

Here you see the typical heading of an Oracle trace file, next the announcement of the dump (“Dumping storage idx summary”,”Dumping RIDX summary”). The real storage index information starts with “RIDX(0x7f27b4ddab64)”. Starting with that and until the “lo” and “hi” values, you are looking at an actual storage index which holds data for a single column. You can see which column by looking at the “Col id” in the square brackets: 1. It’s interesting to note that there is a scn (system change number) included. The storage index shows if there are NULL values in the column (in this case it says HASNONNULLVALUES, so we don’t have any NULL values in the 1MB chunk in the column this storage index describes), and, of course, the low and high values in the Oracle internal data format.

So, despite any indication on the database layer, the query built storage indexes! That should mean that executing the same query again will result in actually using the storage indexes which were just build:

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906259;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

Yes! This using the storage index we caused the be built up previously!

With this being established, we can try to get an answer to our question: how does the storage index react to DML on the table it is describing?

We build the storage index, and used it. Now let’s update the ‘id’ field for which the storage index was build, and redo our query test:

MARTIN:dbm011> update bigtab_nohcc set id = id + 1;

16000000 rows updated.

MARTIN:dbm011> commit;

Commit complete.

Okay, now let’s redo the select again, and take stats before and after on storage index usage!

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906260;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

What we see here with the storage index session statistic, is the update statement didn’t use the storage index (which is obvious, we updated the data, which happened on the database layer, so we didn’t use a smartscan) because this number is the same as the last time we looked at it before the update statement.
When we executed the select query on the table with a filter on the id column again, there is no storage index usage, because the storage index session statistic didn’t increase.

Actually, this query built new storage indexes. When this query is executed again, we can use these:

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906260;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				   14534426624

(actually, it sometimes can take more than one execution for the storage indexes to be created again, my guess would be some heuristics are used to try to come up with the best candidates for storage indexes)

Let’s try another case: update a column in the table for which storage indexes are created on another column.
First make sure storage indexes are build:

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index					     0

MARTIN@dbm011 > select count(*) from bigtab_nohcc where id=906260;

  COUNT(*)
----------
	16

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    8790155264

For completeness sake, I verified the storage indexes for this table by using the dump shown previously. A simple grep on the dump shows this query created storage indexes for only column 1, the id column:

grep -e RIDX_VALID -e 'Col id' /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_4.trc
014-07-13 03:34:46.814235*: RIDX(0x7f27b4ddab64) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814235*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814235*: RIDX(0x7f27b4ddac40) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814235*: Col id [1] numFilt 4 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814235*: RIDX(0x7f27b4ddad1c) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814235*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814587*: RIDX(0x7f27b4ddadf8) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814587*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814587*: RIDX(0x7f27b4ddaed4) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814587*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814587*: RIDX(0x7f27b4ddb08c) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814587*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
...etc.

Now let’s update another field:

MARTIN@dbm011 > update bigtab_nohcc set spcol = spcol + 1;

16000000 rows updated.

MARTIN@dbm011 > commit;

Commit complete.

And query the storage index use, and do our query with filter predicate again:

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    8790155264

MARTIN@dbm011 > select count(*) from bigtab_nohcc where id=906260;

  COUNT(*)
----------
	16

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    8790155264

So…this strongly suggests the update invalidated our storage index, even when the column was not in the storage index.
Just for completeness sake, let’s issue the select statement again to see if the storage index was build up again:

MARTIN@dbm011 > select count(*) from bigtab_nohcc where id=906260;

  COUNT(*)
----------
	16

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    9808936960

Yes, there it is again, the previous statement didn’t benefit from the storage indexes, but build them, and now we take advantage of it again.

Conclusion.
I ran these tests on a quarter rack without any other usage. The method I used for measuring how storage indexes behave was to execute on the database layer, and see the result on the storage layer and in database layer session statistics. It’s important to realise that despite this being quite strong evidence, there isn’t definite proof on my conclusion. Also, the behaviour described can change in future storage server versions.

My conclusion is that the storage server uses a system change number to validate if the storage indexes are valid. If not, storage indexes are not used. If blocks contained in a storage index progresses it’s system change number (in other words: are updated), the storage index is invalidated right away (during my tests, when the update starts, no or few storage indexes remain).

So, the ones that picked storage indexes being invalidated were right. As far as I could detect, there is no updating of storage indexes.


Tagged: exadata, internals, oracle, performance, storage index

Oracle IO wait events: db file sequential read

$
0
0

(the details are investigated and specific to Oracle’s database implementation on Linux x86_64)

Exadata IO: This event is not used with Exadata storage, ‘cell single block physical read’ is used instead.
Parameters:
p1: file#
p2: block#
p3: blocks

Despite p3 listing the number of blocks, I haven’t seen a db file sequential read event that read more than one block ever. Of course this could change in a newer release.

Implementation:
One of the important things to realise here is that regardless of asynchronous IO settings (disk_asynch_io, filesystemio_options), Oracle always uses a pread() systemcall, so synchronous IO for reading blocks which are covered with this event. If you realise what the purpose of fetching the single block is in most cases: fetching a single database block which contents are necessary in order to continue processing, it should become apparent that issuing a synchronous IO call makes sense. This is also the reason the V$IOSTAT* view lists both SMALL_READ_REQS, SMALL_SYNC_READ_REQS and SMALL_READ_SERVICETIME, SMALL_SYNC_READ_LATENCY, to make a distinction between SYNC (pread()) reads and non-sync (thus asynchronous) calls, using the io_submit()-io_getevents() call combination.

IO done under the event ‘db file sequential read’ means a single block is read into the buffer cache in the SGA via the system call pread(). Regardless of physical IO speed, this wait always is recorded, in other words: there is a strict relation between the event and the physical IO. Just to be complete: if a block needed is already in the Oracle database buffer cache, no wait event is triggered and the block is read. This is called a logical IO. When the wait event ‘db file sequential read’ is shown, both a physical and a logical IO are executed.

This event means a block is not found in the database buffer cache. It does not mean the block is really read from a physical disk. If DIO (direct IO) is not used (filesystemio_options is set to ‘none’ or ‘async’ when using a filesystem, ASM (alias “Oracle managed raw devices”) is inherently direct path IO, except when the ASM “disks” are on a filesystem (when ASM is used with NFS (!), then filesystemio_options is obeyed)), the block could very well be coming from the filesystem cache of linux. In fact, without DIO a phenomenon known as ‘double buffering’ takes place, which means the IO doesn’t happen to it’s visible disk devices directly, but it needs to take a mandatory step in between, done at the kernel level, which means the data is put in the filesystem cache of linux too. It should be obvious that this extra work comes at the cost of extra CPU cycles being used, and is in almost any case unnecessary.

If you take a step back you should realise this event should take place for a limited amount of blocks during execution. Because of the inherent single block IO nature of db file sequential read, every physical read (when it needs to read from a physical disk device) takes the IO latency penalty. Even with solid state disk devices, which have an inherently lower latency time because there are no rotating parts and disk heads to be moved, chopping up an operation in tiny parts when a full table scan or fast full index scan could be done means a lot of CPU time is used whilst it could be done more efficient.

The time spend on ‘db file sequential read’ quite accurately times single block IO. This means a direct relationship between ‘db file sequential read’ timings and amount should exist with operating system measured IO statistics (iostat, sar and more).


Tagged: internals, IO, latency, linux, oracle, performance

Using Ansible for executing Oracle DBA tasks.

$
0
0

This post looks like I am jumping on the bandwagon of IT orchestration like a lot of people are doing. Maybe I should say ‘except for (die hard) Oracle DBA’s’. Or maybe not, it up to you to decide.

Most people who are interested in IT in general will have noticed IT orchestration has gotten attention, especially in the form of Puppet and/or Chef. I _think_ IT orchestration has gotten important with the rise of “web scale” (scaling up and down applications by adding virtual machines to horizontal scale resource intensive tasks), in order to provision/configure the newly added machines without manual intervention, and people start picking it up now to use it for more tasks than provisioning of virtual machines for web applications.

I am surprised by that. I am not surprised that people want boring tasks like making settings in configuration files and restarting daemons, installing software with all the correct options, etc. being automated. Instead, I am surprised that people are now picking this up after it has been around for so long.

A little history.
As far as I know, IT orchestration started with cfengine, which was really a configuration engine (hence the name). Despite having a little different purpose (configuration vs. orchestration), this tool is the parent of all the orchestration/configuration tools which exist nowaday. cfengine started off as a study in 1993 by Mark Burgess at the university of Oslo, with the creation of the cfengine software as a result. As far as I can see, it has been available as open source software since the beginning.

Now back to what I am surprised at: with cfengine, there has been a way to configure linux/unix systems in a structured way, and install and configure software on systems since the mid-nineties. Yet, this wasn’t picked up (of course with a few exceptions). Fast forward to today, we see it is being adopted. And that is a good thing.

I created a setup with cfengine for a client a long time ago, which had the ability to install the Oracle software, different PSU’s in different home’s, and remove it by adding or removing machines to groups in a cfengine configuration file. It wasn’t picked up by the client, it’s way more fun running X to install the software, and make the choices by hand, and redo this over and over on every machine, right?

I almost forgotten about my work with cfengine, until I spoke with Alex Gorbatchev at a conference, at which he pointed me to Ansible. At first I didn’t do a lot with it, but lately I’ve given it a go, and I am very happy with it.

Another redo of cfengine?
From what I read, most of the configuration/orchestration engines created after cfengine are created to circumvent all kinds of difficulties with cfengine. I can understand that. It took me a while to learn cfengine, and indeed it forces you to think in a different way.

The Ansible project decided to radically do it different than all the other engines. It is different in the sense that it advertises itself as simple, agentless and powerful.

Simple.
Simple is a terrific goal. For those of you that have worked with configuration/orchestration engines, there is a steep learning curve. It is just hard to get the basic principles in your head. To be honest, also Ansible took me a while too, to grasp the basic principles, and get the picture correctly in my head. Yet, having worked with cfengine comparing it with Ansible’s playbooks, which are the scripts to do things on the targets, it is a breath of fresh air. Playbooks are so clean they (almost) can be read and understood as plain english.

Agentless.
This is where Ansible is truly different than any of the other configuration/orchestration tools. Ansible does not require any agent installation on the targets. The obvious next question then is: how can this work? Well, quite simple: Ansible uses ssh to connect to the host, and executes commands via the shell. Having that said, it requires a little more detail; Ansible uses python on the remote host for it’s normal execution. However, you can use it without python, for example to setup the host up for the Ansible normal usage mode Which requires python and the simple-json module.

This is truly important, and makes it an excellent fit for my daily work as an IT consultant.

Powerful.
Ansible is powerful in the way that you can do the configuration and orchestration tasks in a simple clean way.

Summary on the introduction.
Above was a brief personal history, and some of the “marketed” features of Ansible. I think being agentless is the true “killer feature” here. All the other configuration/orchestration engines require you to setup and configure a fixed client-server connection, and install a deamon and a central server process. In case you wondered, yes, authentication is important, and it’s simply brilliant that the ssh password authentication or public key infrastructure can be used.

Because there’s no daemon to install, you can run your created play books everywhere. So instead of a fixed client configuration, you can create play books to do routine tasks, and repeat it at multiple sites.

Okay, how does this work?

Installation: add EPEL and install ansible.
If you are on one of the clones of RedHat Enterprise Linux (I use Oracle Linux), you simply need to add the EPEL repository to your yum source list, and run:

# yum install ansible

First steps.
One of the first things I do, is create a directory for a typical ansible ‘project’. Project means a set of tasks you want to do to a set of hosts here. Next, I create a file called ‘hosts’ which is the list of hosts you want to use for executing tasks on. By default, Ansible looks in /etc/ansible/hosts. In this case, I put a single machine in it (a test VM), but it can be a list of machines (ip addresses or hostnames).

$ cat hosts
192.168.101.2

In fact, you can create groups in the hosts file in the “ini style”. But I just put one host in for this example.
The next thing is to check if Ansible reads the file correctly. This is done in the following way:

$ ansible all -i hosts --list-hosts
    192.168.101.2

Okay, this means Ansible will operate on this one host if invoked. The next logical thing (typically done when you are in a new client environment to check if you can reach the hosts):

$ ansible all -i hosts -m ping
192.168.101.2 | FAILED => FAILED: Authentication failed.

Ping might be a bit misleading for some people. What ping does here (-m means module), is trying to connect to the host over ssh, and log in. Because I didn’t specify a user, it used the username of the current user on the machine, which is ‘ansible’. A user ‘ansible’ typically doesn’t exist on a normal server (and is not necessary or should be created), and also not on my test server. So it failed, as the message said, on authentication.

My test VM is a basic installed (OL) linux 6 server. This means there’s only one user: root.

So, let’s specify the user root as user:

$ ansible all -i hosts -m ping -u root
192.168.101.2 | FAILED => FAILED: Authentication failed.

The authentication failed again. And it should! What this is doing, is trying to log on as root, and we haven’t given any password, nor have I put my local user’s public key in the remote authorised_keys file. So there is no way this could work. This is typically also the state when you want to do stuff with a “fresh” client system. Let’s add the ‘-k’ option (ask ssh password), and run again:

$ ansible all -i hosts -m ping -u root -k
SSH password:
192.168.101.2 | success >> {
    "changed": false,
    "ping": "pong"
}

To walk you through the output: It now asks for a password, which I’ve filled out, then lists the host and the state: success. During this execution, there was nothing changed on the remote host, and the ping command resulted in a pong (alike the ICMP ping response).

With what we have learned now, we can do things like this:

$ ansible all -i hosts -u root -k -a "ifconfig"
SSH password:
192.168.101.2 | success | rc=0 >>
eth0      Link encap:Ethernet  HWaddr 00:0C:29:14:65:ED
          inet addr:192.168.39.145  Bcast:192.168.39.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe14:65ed/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:47 errors:0 dropped:0 overruns:0 frame:0
          TX packets:25 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:6293 (6.1 KiB)  TX bytes:2594 (2.5 KiB)

eth1      Link encap:Ethernet  HWaddr 00:0C:29:14:65:F7
          inet addr:192.168.101.2  Bcast:192.168.101.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe14:65f7/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:188 errors:0 dropped:0 overruns:0 frame:0
          TX packets:112 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:142146 (138.8 KiB)  TX bytes:15545 (15.1 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

Does this look familiar for you Exadata DBA’s? Yes, this replicates some of the functionality of dcli (although dcli is aimed at executing simple tasks to a group of hosts, whilst Ansible is aimed at enterprise configuration and orchestration).

One step beyond! Playbooks.
Now let’s progress to playbooks. An Ansible playbook is where the true strength lies of Ansible. It allows you to specify tasks to execute on the remote hosts, and create sequences of tasks and make decisions based on the outcome of a tasks for further execution. Let me show you a simple playbook, and guide you through it:

---
- hosts: all
  gather_facts: no
  remote_user: root
  tasks:

  - name: upgrade all packages
    yum: name=* state=latest

  - name: install python-selinux
    yum: name=libselinux-python state=installed

  - name: add public key to authorized_key file of root
    authorized_key: user=root state=present key="{{ lookup('file','/home/ansible/.ssh/id_rsa.pub') }}"

As you can see, this is a playbook with three tasks: upgrade all packages, install libselinux-python and adding my (local) public key to the authorised key file of root (to allow passwordless access).

Line 1 shows three dashes, which means the start of a YAML document.
Line 2 starts with a single dash, which indicates a list. There is one dash at this indention level, so it’s a list of one. The fields of this member are hosts, gather_facts and tasks. Tasks got his own list (mind the indention level, that is important). The fields are key/value pairs, with the separation indicated by the colon (:). The first field is ‘hosts’, with the value ‘all’. This means that all hosts in the hosts file are used for this playbook. I don’t think it’s hard to imagine how useful it can be to specify a group/kind of servers the playbook can run on. The next one is ‘gather_facts’. A normal playbook execution first gathers a lot of information from all the hosts it is going to run on before execution. These can be used during playbook execution. Next ‘remote_user’. This indicates with which user ansible is going to logon, so we don’t have to specify it on the command line. Then we see ‘tasks’ to indicate the list of tasks to be executed on the hosts.

It’s easy to spot we got three tasks. What is extremely important, is the indention of this list (it’s got a dash, so it’s a list!). Name is not mandatory, but it makes it easy to read if you give the tasks useful names and these will be shown when the playbook is executed. The first task has the name ‘upgrade all packages’. The next field shows the key is ‘yum’ indicating it is making use of the yum module. This key got two values: name=*, which means all ‘all packages’, and state=latest, which means we want all packages to be at the latest version. This means this command is the equivalent of ‘yum update’.

The second task is called ‘install python-selinux’. It makes use of the yum module again, and is self explanatory, it installs the libselinux-python package. This packages is necessary to work on a host which has selinux enabled on things that are protected by selinux.

The next task is called ‘add public key to authorised_key file of root’. It is making use of the authorized_key module. This module requires a parameter ‘key’, for which we use the lookup function to look up the local (!) public key, of the user with which I execute ansible, which is ‘ansible’. ‘state=present’ means we want this key to be present; ‘present’ is the default value, so it wasn’t necessary to put this in. Next ‘user=root': we want the public key to be added to the authorized_keys file of the user root.

Of course these tasks could be executed using the ‘ansible’ executable as single tasks. To show the importance of the installation of the libselinux-python module on a host with selinux enabled (which is the state of selinux on a fresh installed Oracle Linux machine), let’s execute the task using the authorized_key module:

$ ansible all -i hosts -k -u root -m authorized_key -a "user=root state=present key=\"{{ lookup('file','/home/ansible/.ssh/id_rsa.pub') }}\""
SSH password:
192.168.101.2 | FAILED >> {
    "failed": true,
    "msg": "Aborting, target uses selinux but python bindings (libselinux-python) aren't installed!"
}

Clear, right? The host is selinux protected. Now, let’s execute the installation of the libselinux package as single task, and then add our public key to the authorized_key file of root:

$ ansible all -i hosts -k -u root -m yum -a "name=libselinux-python state=installed"
SSH password:
192.168.101.2 | success >> {
    "changed": true,
    "msg": "",
    "rc": 0,
    "results": [
        "Loaded plugins: security\nSetting up Install Process\nResolving Dependencies\n--> Running transaction check\n---> Package libselinux-python.x86_64 0:2.0.94-5.3.el6_4.1 will be installed\n--> Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n Package             Arch     Version                 Repository           Size\n================================================================================\nInstalling:\n libselinux-python   x86_64   2.0.94-5.3.el6_4.1      public_ol6_latest   201 k\n\nTransaction Summary\n================================================================================\nInstall       1 Package(s)\n\nTotal download size: 201 k\nInstalled size: 653 k\nDownloading Packages:\nRunning rpm_check_debug\nRunning Transaction Test\nTransaction Test Succeeded\nRunning Transaction\n\r  Installing : libselinux-python-2.0.94-5.3.el6_4.1.x86_64                  1/1 \n\r  Verifying  : libselinux-python-2.0.94-5.3.el6_4.1.x86_64                  1/1 \n\nInstalled:\n  libselinux-python.x86_64 0:2.0.94-5.3.el6_4.1                                 \n\nComplete!\n"
    ]
}

$ ansible all -i hosts -k -u root -m authorized_key -a "user=root state=present key=\"{{ lookup('file','/home/ansible/.ssh/id_rsa.pub') }}\""
SSH password:
192.168.101.2 | success >> {
    "changed": true,
    "key": "ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAliR905hxLnsOCRlOGnmN0H9dGH4NPV88ySC6GMv0KNnU7FfCXYE51Bkk97p2IWFsPhYO9qJDyAFxRm/lia1IZRDpCFcKKKMh5eXmEJC5XSrHWFdmGZRlFcS3VQ3rpCIyU3qFM6xMazh3JHKKEtE1J6nvw/hW3slY9G/6VoJ8CzpfeQMLDOdVXUIcZXqtCPuIEDBQ7yjfMzTGz+hEmz7ImbLaUyB4MDGrDnl33L8mkBEVYu8RrwgBcagDQSiQKnIca/EL45eX/74NG1e/6vxZkHZJz/W0ak4KD+o9vF4ikz0bdrGPMZ5gRYXWoSSHrVA+Rqk8A93qBXNKUUkzGoQYTQ== ansible@ansiblevm.local",
    "key_options": null,
    "keyfile": "/root/.ssh/authorized_keys",
    "manage_dir": true,
    "path": null,
    "state": "present",
    "unique": false,
    "user": "root"
}

Maybe your customer doesn’t want you to store your keys in their servers. It’s easy to do the reverse, and remove your key from the authorized_key file:

$ ansible all -i hosts -u root -m authorized_key -a "user=root state=absent key=\"{{ lookup('file','/home/ansible/.ssh/id_rsa.pub') }}\""
192.168.101.2 | success >> {
    "changed": true,
    "key": "ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAliR905hxLnsOCRlOGnmN0H9dGH4NPV88ySC6GMv0KNnU7FfCXYE51Bkk97p2IWFsPhYO9qJDyAFxRm/lia1IZRDpCFcKKKMh5eXmEJC5XSrHWFdmGZRlFcS3VQ3rpCIyU3qFM6xMazh3JHKKEtE1J6nvw/hW3slY9G/6VoJ8CzpfeQMLDOdVXUIcZXqtCPuIEDBQ7yjfMzTGz+hEmz7ImbLaUyB4MDGrDnl33L8mkBEVYu8RrwgBcagDQSiQKnIca/EL45eX/74NG1e/6vxZkHZJz/W0ak4KD+o9vF4ikz0bdrGPMZ5gRYXWoSSHrVA+Rqk8A93qBXNKUUkzGoQYTQ== ansible@ansiblevm.local",
    "key_options": null,
    "keyfile": "/root/.ssh/authorized_keys",
    "manage_dir": true,
    "path": null,
    "state": "absent",
    "unique": false,
    "user": "root"
}

Please mind I didn’t specify ‘-k’ on the command line to send a password: in the previous step we added our key, so we can access our host using our public key. Another extremely important thing is ‘changed’. ‘changed’ indicates if the task did actually change something on the destination server.

I have ran single task until now, I changed the state of my test VM back to it’s state before I started changing it with ansible (by removing the libselinux package using ‘ansible all -i hosts -k -u root -m yum -a “name=libselinux-python state=absent”‘

Let’s run the above described playbook:

$ ansible-playbook -i hosts -k linux_setup_example.yml
 [WARNING]: The version of gmp you have installed has a known issue regarding
timing vulnerabilities when used with pycrypto. If possible, you should update
it (ie. yum update gmp).

SSH password:

PLAY [all] ********************************************************************

TASK: [upgrade all packages] **************************************************
changed: [192.168.101.2]

TASK: [install python-selinux] ************************************************
changed: [192.168.101.2]

TASK: [add public key to authorized_key file of root] *************************
changed: [192.168.101.2]

PLAY RECAP ********************************************************************
192.168.101.2              : ok=3    changed=3    unreachable=0    failed=0

Now at this point you might think: I get it, but these are all pretty simple tasks, it’s not special at all. Well, let me show you an actual thing which totally shows what the importance of using this is, even on a single machine, but even more when you got a large group of servers you have to administer.

The next example is a playbook created to apply PSU3 to an Oracle 11.2.0.4 home. It’s still quite simple, it just applies PSU3 to the Oracle home. But totally automatic. The point I am trying to make is that this is already nice to have automated a lot of work for a single home, but it saves a lot of hours (read: a lot of money), and saves you from human error.

---
- hosts: all
  vars:
    u01_size_gb: 1
    tmp_size_gb: 1
    oracle_base: /u01/app/oracle
    oracle_home: /u01/app/oracle/product/11.2.0.4/dbhome_1
    patch_dir: /u01/install
  remote_user: oracle
  tasks:

  - name: check u01 free disk space
    action: shell df -P /u01 | awk 'END { print $4 }'
    register: u01size
    failed_when: u01size.stdout|int < {{ u01_size_gb }} * 1024 * 1024

  - name: check tmp free disk space
    action: shell df -P /tmp | awk 'END { print $4 }'
    register: tmpsize
    failed_when: tmpsize.stdout|int < {{ tmp_size_gb }} * 1024 * 1024

  - name: create directory for installation files
    action: file dest={{ patch_dir }} state=directory owner=oracle group=oinstall

  - name: copy opatch and psu
    copy: src=files/{{ item }} dest={{ patch_dir }} owner=oracle group=oinstall mode=0644
    with_items:
     - p6880880_112000_Linux-x86-64.zip
     - p18522509_112040_Linux-x86-64.zip
     - ocm.rsp

  - name: install opatch in database home
    action: shell unzip -oq {{ patch_dir }}/p6880880_112000_Linux-x86-64.zip -d {{ oracle_home }}

  - name: unzip psu patch
    action: shell unzip -oq {{ patch_dir }}/p18522509_112040_Linux-x86-64.zip -d {{ patch_dir }}

  - name: patch conflict detection
    action: shell export ORACLE_HOME={{ oracle_home }}; cd {{ patch_dir }}/18522509; $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -ph ./
    register: conflict_detection
    failed_when: "'Prereq \"checkConflictAgainstOHWithDetail\" passed.' not in conflict_detection.stdout"

  - name: apply psu
    action: shell export ORACLE_HOME={{ oracle_home}}; cd {{ patch_dir }}/18522509; $ORACLE_HOME/OPatch/opatch apply -silent -ocmrf {{ patch_dir }}/ocm.rsp
    register: apply_psu
    failed_when: "'Composite patch 18522509 successfully applied.' not in apply_psu.stdout"

  - name: clean up install directory
    file: path={{ patch_dir }} state=absent

Let me run you through this playbook! It starts off with the indication of a YAML document: ‘—‘. Next hosts: all again. I just put all the hosts in the hosts file, I did not create all kinds of groups of hosts (which would be fitting when you use it at a fixed environment, but I use it for various customers). Then vars, with a list of variables. As you can see, I can use the variables, which are shown in the playbook as {{ variable }}. Then remote_user: oracle and tasks.

The first and second task use variables, and use the argument ‘register’ to save all response into a named variable. I also use ‘failed_when’ to make the playbook stop executing when the argument after ‘failed_when’ is true. Arguments of ‘failed_when’ is the named variable, for which the output of the standard out is used (.stdout). Then a filter is used to cast the output to integer, and is compared with a calculation of the variable.

The third task is using the files module to create a directory. The fourth task is using the copy module. The copy module means a file or files (in this case) are copied from the machine from which the playbook is run, onto the destination host or hosts. Here is also another trick used, to process the task with a list of items. As you can see, the copy line contains a variable {{ items }}, and the task is executed for all the items in the list ‘with_items’. I found this is fine for smaller files (up to a few hundred of megabytes), but too slow for bigger files. I use http (the get_url module) to speed up file transfer.

The fifth and sixth tasks execute a shell command, unzip, to extract the contents of a zip file into a specific place.

The seventh task is executing a small list of shell commands, in order to be able to run the conflict detection option of opatch. The same trick as with the first two tasks is used, register a name for the output of the conflict detection. Here I check if the stdout contains what I would manually check for when I would run it. The eighth task is the main task of the whole playbook: the actual patch. However, it uses the same technique as task seven. The last task simply removes a directory, in order to remove the files we used for this patch.

Summary
I hope this shows what a tremendous help Ansible can be for a consultant. This kind of tool is simply mandatory if you got an environment with more than approximately ten to twenty servers to administer. Ansible can be used even if the organisation does not want to spend time on the implementation of a configuration tool.


Tagged: ansible, configuration, installation, linux, orchestration

How to setup git as a daemon

$
0
0

This is a quick post on using git on a server. I use my Synology NAS as a fileserver, but also as a git repository server. The default git package for Synology enables git usage on the command line, which means via ssh, or via web-DAV. Both require a logon to do anything with the repository. That is not very handy if you want to clone and pull from the repository in an automated way. Of course there are ways around that (basically setting up password-less authentication, probably via certificates), but I wanted simple, read-only access without authentication. If you installed git on a linux or unix server you get the binaries, but no daemon, which means you can only use ssh if you want to use that server for central git repositories.

Running git via inetd
What I did is using inetd daemon to launch the git daemon. On any linux or unix server with the inetd daemon, and on Synology too, because it uses linux under the covers, it’s easy to setup git as a server.

First, check /etc/services for the following lines:

git               9418/tcp                   # git pack transfer service
git               9418/udp                   # git pack transfer service

Next, add the following line in the inetd.conf (which is /etc/inetd.conf on my synology):

git stream tcp nowait gituser /usr/bin/git git daemon --inetd --verbose --export-all --base-path=/volume1/homes/gituser

What you should look for in your setup is:
– gituser: this is the user which is used to run the daemon. I created a user ‘gituser’ for this.
– /usr/bin/git: of course your git binary should be at that fully specified path, otherwise inetd can’t find it.
– git daemon:
— –inetd: notify the git executable that it is running under inetd
— –export-all: all git repositories underneath the base directory will be available
— –base-path: this makes the git root directory be set to this directory. In my case, I wanted to have all the repositories in the home directory of the gituser, which is /volume1/homes/gituser in my case.

And make the inetd deamon reload it’s configuration with kill -HUP:

# killall -HUP inetd

Please mind this is a simple and limited setup, if you want to set it up in a way with more granular security, you should look into gitolite for example.


Tagged: daemon, git, howto, linux, synology

Physical IO on Linux

$
0
0

I posted a fair amount of stuff on how Oracle is generating IOs, and especially large IOs, meaning more than one Oracle block, so > 8KB. This is typically what is happening when the Oracle database is executing a row source which does a full segment scan. Let’s start off with a quiz: what you think Oracle is the maximum IO size the Oracle engine is capable of requesting of the Operating System (so the IO size as can be seen at the SCI (system call interface) layer? If you made up your answer, remember it, and read on!

The real intention of this blogpost is to describe what is going on in the Oracle database kernel, but also what is being done in the Linux kernel. Being a performance specialised Oracle DBA means you have to understand what the operating system does. I often see that it’s of the utmost importance to understand how an IO ends up as a request at the NAS or SAN head, so you understand what a storage admin is talking about.

Many people (including myself in the past) would state that the maximum IO size on Linux is 1MB. For the Linux 2.6 kernel and higher this statement is incorrect because there is no such thing as a single maximum IO size on Linux. There used to be one in the Linux 2.4 era, which was set with the maxphys parameter, but that time is long gone. In order to find out what it is now, let’s test and see!

First let’s get a Linux system and a big table!
The system I got is a VMWare Fusion VM, running Linux 3.8.13-44.1.5.el6uek.x86_64 (UEK3) on Oracle Linux 6u6. On top of that I am using the Oracle database and grid infrastructure version 12.1.0.2. This system is using udev for providing disk access (as opposed to asmlib)
The redundancy mode of ASM is external, although for reading (what I will be covering) this doesn’t matter.

Now let’s get a normal database session, and use a combination of sql_trace with waits (10046/8) and strace to see how the Oracle database interfaces with the kernel. Please mind I’ve prepared a heap table with no indexes on it, so a count(*) on it always will result in a full table scan. Also, the buffercache is sized small enough (or the table is created large enough, it depends on how you look at it) to have the session make the decision to do a direct path read, instead of a buffered read. If you don’t know what that means: please search this blog on direct path reads, or even better, download my presentation ‘about multiblock reads’.

The direct path read decision is visible via the ‘direct path read’ wait event. If you get a full table scan operation and see ‘db file scattered read’ waits, you are doing a buffered scan.

In the most cases, you will get a maximum value of 1MB if possible, which seems to support the generally assumed 1MB maximum operating system induced IO size. Why? Well, because you probably set the DB_FILE_MULTIBLOCK_READ_COUNT parameter to 128, which means you have explicitly set the Oracle process not to do IO with a size more than 1MB (8192*128). In this blogpost, I explain that Oracle can request IOs bigger than 1MB.

In the blogpost series on extra huge database IOs, I show that Oracle can do huge (1MB+) IOs, but the physical request size (what Oracle actually requests at the SCI layer, visible with the pread/pwrite/io_submit/io_getevents functions) still is 1MB. This limit is imposed by the physical storage structure which the database uses with ASM, called allocation unit (often called ‘AU’). The default size of an allocation unit is 1MB. The allocation unit can be seen in both the database and the ASM instance with the following query:

SYS@+ASM AS SYSASM> select name, allocation_unit_size from v$asm_diskgroup;

NAME			       ALLOCATION_UNIT_SIZE
------------------------------ --------------------
DATA					    1048576

How about doing an unbuffered read on a filesystem? I’ve created a database on an (XFS, but this doesn’t matter AFAIK) filesystem, and tried to set the maximum value to DB_FILE_MULTIBLOCK_READ_COUNT. I’ve done this by setting DB_FILE_MULTIBLOCK_READ_COUNT to 10000 (ten thousand), and then bounce the database to see what the number has become. In my case, the value became 4096. I think this is the limit for Oracle 12.1.0.2 on Linux x86_64, but love to hear if you have gotten different results:

I set 10000:

SYS@fv12102 AS SYSDBA> select name, value from v$spparameter where name like 'db_file_multiblock%';

NAME						   VALUE
-------------------------------------------------- ----------------------------------------------------------------------
db_file_multiblock_read_count			   10000

But Oracle limits this to 4096:

SYS@fv12102 AS SYSDBA> select name, value from v$parameter where name like 'db_file_multiblock%';

NAME						   VALUE
-------------------------------------------------- ----------------------------------------------------------------------
db_file_multiblock_read_count			   4096

Okay. Let’s start our investigation at that point: a database which is set up with a DB_FILE_MULTIBLOCK_READ_COUNT set to 4096, alias 32MB (with a block size of 8KB), and a table which got extents large enough to accommodate huge (32MB) IOs.

Fire up a session regular database session, and enable sql trace at level 8:

$ sqlplus ts/ts@//localhost/v11204
...
SQL> alter session set events 'sql_trace level 8';

Now start another session as root on the database server, and find the PID of the server process of the sqlplus process we just created above. Issue strace with verbose writing setting:

# strace -e write=all -e all -p PID
Process PID attached - interrupt to quit
read(14,

Okay, we are setup and ready to go, but there is one additional thing: the way direct path reads work, they would probably give little waits with fast IO capabilities. One way to get the waits back, is to limit the IO capabilities of the process. Doing so is documented in this article.

Now issue the full table scan on a large table in sqlplus while strace is attached:

SQL> select count(*) from bigtab;

Now take a peek at the strace output!
The output first shows IOs as we expect:

io_getevents(139717184229376, 1, 128, {{0x7f126dd3d780, 0x7f126dd3d780, 33554432, 0}}, {600, 0}) = 1
times(NULL)                             = 431386800
write(7, "\n*** 2014-11-24 13:09:28.028\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  34 2d 31 31 2d 32 34 20  .*** 201 4-11-24  |
 | 00010  31 33 3a 30 39 3a 32 38  2e 30 32 38 0a           13:09:28 .028.    |
lseek(7, 0, SEEK_CUR)                   = 31181
write(7, "WAIT #139717129509840: nam='dire"..., 130) = 130
 | 00000  57 41 49 54 20 23 31 33  39 37 31 37 31 32 39 35  WAIT #13 97171295 |
 | 00010  30 39 38 34 30 3a 20 6e  61 6d 3d 27 64 69 72 65  09840: n am='dire |
 | 00020  63 74 20 70 61 74 68 20  72 65 61 64 27 20 65 6c  ct path  read' el |
 | 00030  61 3d 20 33 39 30 37 33  30 20 66 69 6c 65 20 6e  a= 39073 0 file n |
 | 00040  75 6d 62 65 72 3d 34 20  66 69 72 73 74 20 64 62  umber=4  first db |
 | 00050  61 3d 37 34 31 33 37 36  20 62 6c 6f 63 6b 20 63  a=741376  block c |
 | 00060  6e 74 3d 34 30 39 36 20  6f 62 6a 23 3d 32 30 34  nt=4096  obj#=204 |
 | 00070  37 34 20 74 69 6d 3d 31  39 32 30 30 37 31 30 31  74 tim=1 92007101 |
 | 00080  39 39                                             99                |

What is visible here, is first the reap of an I/O request (with asynchronous IO on Linux this is typically the io_getevents() call). If you take a close look at the arguments of the io_getevents() call (taken from the manpage of io_getevents):

int io_getevents(aio_context_t ctx_id, long min_nr, long nr, struct io_event *events, struct timespec *timeout);

And then focus on the struct io_event:

struct io_event {
         __u64           data;           /* the data field from the iocb */
         __u64           obj;            /* what iocb this event came from */
         __s64           res;            /* result code for this event */
         __s64           res2;           /* secondary result */
};

The above description is taken from the annotated Linux kernel source, as available here: http://lxr.free-electrons.com/source/include/uapi/linux/aio_abi.h#L58 I use this site for navigating the Linux kernel source. What is important, is that the third field (io_event.res) contains the size of the IO request. Having learned this, now look again in the io_getevents call. The size of the IO reaped above is 33554432, which is 33554432/1024/1024=32 MB. Yes, that’s a single IO of 32MB! Also, this is consistent with the wait line a little lower:

 | 00050  61 3d 37 34 31 33 37 36  20 62 6c 6f 63 6b 20 63  a=741376  block c |
 | 00060  6e 74 3d 34 30 39 36 20  6f 62 6a 23 3d 32 30 34  nt=4096  obj#=204 |

Block count = 4096 * 8192 (block size) = 33554432

So, I wonder what you thought was possible, the correct answer on my operating system (Linux x86_64) with Oracle 12.1.0.2 is 32MB. It turned out the big IOs in the ASM case were limited by the allocation unit size of 1MB.

The next thing I’ve wondered is how this matches with the maximum IO size of the disk devices as visible by the Operating System. You can request 32MB, but a normal SCSI disk doesn’t do 32MB IOs. Of course in my case the SCSI disk really is a VMWare virtual disk device.

Let’s keep the 32MB IO in mind, now dive from the top layer, the SCI (system call interface) where an IO enters the kernel to the bottom of the kernel from an IO perspective, to the block device. The block device settings are found in /sys/block/DEVICE/queue. The maximum IO size the device is capable of is found in max_hw_sectors_kb. This is in kilobytes, and read only (can’t change hardware, right?). In my case this is:

[root@bigmachine queue]# cat max_hw_sectors_kb
4096

My disk supports a maximum of 4M for an IO size! But this is not what is used, the actual setting is in max_sectors_kb:

[root@bigmachine queue]# cat max_sectors_kb
512

That’s half a megabyte!

So…we got (up to) 32MB sized IO requests coming in, and a device that is set to 512KB IOs. This means that somewhere between the SCI and the device, there is a mechanism to scatter the request size to the device’s maximum IO size, and once the IO requests are done, going back to gather the IO results to the original request.

There are a couple of layers in the Linux kernel through which the call travels (including common functions):

-SCI/system call interface: system_call, sys_io_submit…. (io_submit, do_io_submit, io_submit_one; these seem to be in the VFS layer)
-VFS/virtual filesystem: aio_run_iocb, do_aio_read, xfs_file_read_iter, generic_file_read_iter, xfs_vm_direct_IO, bio_*, kiocb_batch_refill
-Block layer: blk_finish_plug, blk_flush_plug_list, queue_unplugged, __blk_run_queue, blk_run_queue
-SCSI layer: scsi_*
-Device driver: mptspi_qcmd, mptscsih_qcmd, mpt_put_msg_frame

(note: there seems to be consensus the above mentioned layers exist, although there is different wording and different numbers by different sources. Also, there doesn’t seem to be a very clear description of what is done by which layer, and what typically defines a kernel layer. For some functions it is clear they belong to a certain layer (for example aio_run_iocb in Linux/fs/aio.c, bulk_finish_plug in Linux/block/blk-core.c, etc.), for some layers, like the SCI layer, it seems there isn’t a clear layer definition by looking at where the function is defined. Also please mind the SCSI layer is implemented as a driver, just like the actual device driver for the hardware. This is very understandable, but makes it a bit harder to see it in a layered way)

System Call Interface (SCI)
The request enters kernel space via the SCI. The function of the SCI is to elevate a process to system priority to perform a kernel mode task, like (but not limited to) doing I/O. The system call implementation on Linux makes use of a wrapper function in glibc, which executes the system call on behalf of the user systemcall request. The reason for mentioning this, is that sometimes the glibc wrapper “hides” the real system call, for example calling the semtimedop() function:

(gdb) break semtimedop
Breakpoint 1 at 0x3bb38eb090: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) c
Continuing.

Breakpoint 1, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Above is a gdb (GNU debugger) session which attaches to an Oracle background process, which I know is sleeping in the system call semtimedop() when idle, A breakpoint is set on the semtimedop function, and the execution of the attached process is resumed. It then breaks on the function, showing the source code at which the break happened. Instead of showing the actual semtimedop function, it shows the pseudo function in glibc which wraps this system call. This hides the arguments of calling the semtimedop() function. My current workaround is to read the kernel registers which “carry” the arguments (RDI, RSI, RDX, RCX, R8, R9 for the first 6 arguments in most cases).

Virtual File System (VFS)
The next layer is virtual filesystem. Here we see functions specific to asynchronous IO or synchronous IO, and doing direct IO or not, and also actual filesystem specific functions (in my case xfs, when ext4 is used, you will see specific functions for that. I highly recommend XFS!). This layer also uses a structure called ‘request_queue’, which keeps track of the actual IO requests for a block device, of which each individual request is a struct ‘request’, which contains one or more structs called ‘bio’ which contains a description of the request, which points to structure called ‘bio_vec’, which points to pages for storing the disk request contents. This is all setup and created in kernel memory by the user process in system mode. It’s my assumption that the properties of the disk device (=maximum advertised IO size) are taken into account when the VFS filesystem implementation creates requests and all necessary structs and memory area’s. Please mind it’s important that enough memory is available to setup the necessary structures, and enough CPU to make this happen. Also some of the crucial structures for doing IO (request, bio, bio_vec) seem to be setup in this layer. An IO can’t be done without a memory area for the IO request to hold the data for sending it to the device (alias a write), or a memory area for the IO request to hold the data which is fetched from the device (alias a read).

The funny thing is that when you use ASM (the simple version 11.2 ASM with a local ASM instance and local disk devices), you will still see some functions of the VFS layer, because you use a disk device which is opened using the local filesystem. Examples of these functions are: aio_run_iocb, do_aio_read.

Block Layer
The next layer is the block layer. Here the request queue is handled, and I/O scheduling is done. Oracle advises the deadline scheduler in all cases. The scheduler works by plugging a request queue, much like a plug in your bathtub, letting the requests enter the queue. Having multiple requests in a queue means it can be optimised by reordering the requests, and merging adjacent requests up to the device’s advertised maximum IO size. Once a request’s timeout expires, or the requesting process finishes submitting IO, the queue is unplugged.

SCSI layer
The SCSI layer is responsible for communicating with SCSI devices to do IOs.

Device driver
The device driver layer is the layer that truly physically communicates with a device, and implements the device specific communication. In my case the functions start with mpt, which is the driver for LSI PCI adapters.

To see how the flow of IO going through the block layer, there is a tool called blktrace. Actually this is a mini-suite of tools consisting of blktrace (tracing the IO requests through the block layer), blkparse (parsing the output of blktrace to make it human readable), btrace (script to combine blktrace and blkparse, and btt (a blktrace output post processing tool)), among others.

In order to use blktrace, the debug file system of the Linux kernel needs to be mounted. Here is how that is done:

# mount -t debugfs debugfs /sys/kernel/debug

If the kernel debugfs is not mounted, you get the following message:

[root@bigmachine ~]# btrace /dev/oracleasm/disk1
Invalid debug path /sys/kernel/debug: 0/Success

I use blktrace in this article for looking at the IO requests to understand what is going on. The workflow for this use of blktrace is:
– create a trace file of the block flow using blktrace
– make the trace file human readable via blkparse or analyse via btt (block trace times)

Actually, you can parse the output of blktrace directly via blkparse using ‘blktrace -d DEVICE – | blkparse -i -‘. To make that even simpler, the script ‘btrace’ is created, to do exactly that.

Here’s how that looks like (depending on the number of processes using it, the output can be huge, this is only a snippet):

[root@bigmachine ~]# btrace /dev/oracleasm/disk1
...
  8,16   0       57     0.260669503  2421  Q  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       58     0.260672502  2421  G  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       59     0.260673231  2421  P   N [asm_gmon_+asm]
  8,16   0       60     0.260674895  2421  I  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       61     0.260675745  2421  U   N [asm_gmon_+asm] 1
  8,16   0       62     0.260677119  2421  D  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       63     0.260882884     0  C  WS 4088 + 8 [0]
...

What is shown here, is the typical flow of an IO in the block layer:
Q – Queue. A request starts off sending a notification on the intent to queue at the given location.
G – Get request. A struct request is allocated.
P – Plug. When the block device queue is empty, the queue is plugged in order to receive further IOs and have the ability to optimise (merge and/or reorder) them before the data is sent to the device.
I – Insert. A request is sent to the IO scheduler for addition to the internal queue and later service by the driver. The request is fully allocated at this time.
U – Unplug. The start of sending requests to the driver.
D – Driver. A request has been sent to the driver and removed from the queue.
C – Complete. A previously issued request to the driver has been completed.

The main point is here, that you can truly see how the IO requests flow through the block layer and are issued to the storage device, in other words, you can see how the block layer receives the IOs, and what is exactly submitted to the driver as request for the physical storage layer.

This is a microscopic view of the disk IOs. In most cases, when you want to gain information on block layer IO processing, another view on it is provided by processing blktrace output with btt. This is an example output of btt:

First capture IO events using blktrace:

[root@bigmachine ~]# blktrace -w 60 -d /dev/oracleasm/disk1 -o - | blkparse -d sdb.blkparse -i -

In this example I captured IOs for 60 seconds. You can exclude ‘-w 60′, and press interrupt (ctrl-c) when you deem IO recording is enough. This produces a binary file ‘sdb.blkparse’, which can be used btt:

This is the first part, the flow through the block layer until IO completion:

==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000001   0.239795347   3.002829973         238
Q2G               0.000000001   0.159337842   3.011192142         264
G2I               0.000000679   0.000001724   0.000011618         264
I2D               0.000000764   0.000007633   0.000153436         264
D2C               0.000000001   0.103328167   3.012509148         233
Q2C               0.000000001   0.270961298   3.012516496         233

Note: time is in milli seconds.
Q2Q – Time between IO requests.
Q2G – Time it takes for a request struct to be allocated.
G2I – Time it takes for the request to be inserted in the device’s queue.
I2D – Time spend in the device queue waiting to be issued to the driver.
D2C – Time spend between issuing to the driver and completion of the request. This includes controller, storage. This is the same figure as the ‘svctm’ column with iostat -x.
Q2C – Total time spend in block layer and physical IO. This is the same figure as the ‘await’ column with iostat -x.

The second part is the device overhead section:

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8, 16) |  66.6284%   0.0007%   0.0000%   0.0032%  38.1339%
---------- | --------- --------- --------- --------- ---------
   Overall |  66.6284%   0.0007%   0.0000%   0.0032%  38.1339%

This is partly the same as the IO flow table above. This is expressed as a percentage of where the total time of the IO is spend.
Q2G – Request struct allocation.
G2I – Insertion in the device queue.
Q2M – Total time until merge.
I2D – Time spend in the queue until it was dispatched to the driver.
D2C – Time spend on doing the IO after submitting the request to the driver.


Tagged: blkparse, blktrace, btt, internals, IO, IO size, linux, oracle

Oracle database operating system memory allocation management for PGA

$
0
0

This post is about memory management on the operating system level of an Oracle database. The first question that might pop in your head is: isn’t this a solved problem? The answer is: yes, if you use Oracle’s AMM (Automatic Memory Management) feature, which let’s you set a limit for the Oracle datababase’s two main memory area’s: SGA and PGA. But in my opinion any serious, real life, usage of an Oracle database on Linux will be (severely) constrained in performance because of the lack of huge pages with AMM, and I personally witnessed very strange behaviour and process deaths with the AMM feature and high demand for memory.

This means that I strongly advise customers to use Oracle’s ASMM (Automatic Shared Memory Management) feature. In the newer versions of 11.2 I found this to be working very well. Earlier versions like 10.2 could suffer from an ever growing shared pool (which also means an ever shrinking buffer cache), especially when bind variables weren’t used. This still could happen, but it seems the SGA memory management feature in 11.2 handles this well in most cases. The ASMM feature means a fixed memory area is allocated for the SGA. SGA allocation has always been fixed outside of the AMM feature, as far as I know.

When ASMM doesn’t work, meaning the memory areas are getting sized wrong and performance is influenced by that, the last option is to size the memory area’s yourself. However, since version 11.2.0.2 Oracle will resize when the memory manager thinks it’s feasible. See Kurt van Meerbeek’s article about that.

That leaves the PGA (Process Global Area) as a memory area on itself. Most databases are using the automatic PGA memory management, which is enabled once the PGA_AGGREGATE_TARGET parameter is set to a non zero value. A common misunderstanding is this setting is actually limiting the overall PGA usage of an instance. The truth is automatic PGA memory management will make attempts to adhere to the PGA_AGGREGATE_TARGET value. These are the actual words in the official Oracle documentation: ‘attempts to adhere’!

This means sort memory, hash memory and bitmap memory will be actively limited in size per process by automatic PGA memory management, any attempt to allocate more than automatic PGA memory management allows will result in moving some contents of these memory areas to the assigned temporary tablespace of the database user, to make room for new data.

However, there are more memory area’s allocatable per process, which are never swapped to disk, thus always will stay in memory, and these could not be limited in an officially supported way prior to Oracle version 12. Two structures which are allocated in PGA and never swapped to disk are PL/SQL collections and PL/SQL tables. Creating and filling these requires the usage of PL/SQL (hence their names); the reason for mentioning this is that if your database is not used by PL/SQL but only SQL, you almost certainly will not run into the problem I describe below.

You might be thinking: wait a minute! Does this mean a developer can just create such a structure, and allocate whatever he/she likes, with all the consequences that it can have, like the operating system starting to swap, and can do that for every single process? Yes, this is what this means. This is why Oracle introduced a parameter called PGA_AGGREGATE_LIMIT with Oracle 12, to effectively limit the overall PGA heap size.

In case you wonder what this means, or even doubting my words, I have written a little program to demonstrate this behaviour.

This is the source code to create my test table T2:

exec dbms_random.seed('abracadabra');
create table t2
as
with generator as (
    select      rownum      id
    from        dual
    connect by
                rownum <= 1000
)
select
    rownum                                                id,
    trunc((rownum-1)/50)                            clustered,
    mod(rownum,20000)                               scattered,
    trunc(dbms_random.value(0,20000))               randomized,
    trunc(sysdate) + dbms_random.value(-180, 180)   random_date,
    dbms_random.string('l',6)                       random_string,
    lpad(rownum,10,0)                               vc_small,
    rpad('x',100,'x')                               vc_padding
from
    generator   g1,
    generator   g2
where
    rownum <= 1000000
;
exec dbms_stats.gather_table_stats(null,'T2');

This is a very smart way to generate a table. I actually borrowed this from Jonathan Lewis.

Next up, I created a small anonymous PL/SQL block to take the contents from the T2 table, and store them in a collection until I hit the limit in the variable ‘grow_until’.

declare
	type sourcetab is table of t2%ROWTYPE;
	c_tmp		sourcetab;
	c_def		sourcetab	:= sourcetab();
	v_b_p		number		:= 0;
	v_c_p		number		:= 0;
	v_b_u		number		:= 0;
	v_c_u		number		:= 0;
	grow_until	number		:= 700000000;
	p_a_t		number;
begin
	select value into v_b_p from v$mystat m, v$statname n where m.statistic#=n.statistic# and name = 'session pga memory max';
	select value into v_b_u from v$mystat m, v$statname n where m.statistic#=n.statistic# and name = 'session uga memory max';
	select value into p_a_t from v$parameter where name = 'pga_aggregate_target';
	select * bulk collect into c_tmp from t2;
	while v_c_p < grow_until loop
		for c in c_tmp.first .. c_tmp.last loop
			c_def.extend(1);
			c_def(c_def.last) := c_tmp(c);
			select value into v_c_p from v$mystat m, v$statname n where m.statistic#=n.statistic# and name = 'session pga memory max';
			select value into v_c_u from v$mystat m, v$statname n where m.statistic#=n.statistic# and name = 'session uga memory max';
			if v_c_p >= grow_until then
				exit;
			end if;
		end loop;
	end loop;
	dbms_output.put_line('vbp : '||v_b_p);
	dbms_output.put_line('vcp : '||v_c_p);
	dbms_output.put_line('vbu : '||v_b_u);
	dbms_output.put_line('vcu : '||v_c_u);
	dbms_output.put_line('pat : '||p_a_t);
end;
/

Please mind the session needs to have create table, create session granted, enough quota in the default tablespace and select on v_$mystat, v_$parameter and v_$statname granted.

This is run on an Oracle 12.1.0.2 database:

TS@v12102 > @pga_filler
vbp : 3535368
vcp : 700051976
vbu : 1103192
vcu : 4755704
pat : 524288000

PL/SQL procedure successfully completed.

The begin sizes of the UGA (vbu) and PGA (vbp) are 1’103’192 and 3’535’368. The PGA_AGGREGATE_TARGET size is set to 524’288’000 (500MB). I did set the grow_until variable to 700’000’000 (roughly 700MB), which is more than PGA_AGGREGATE_TARGET. After running this, it’s easy to spot the values of vcu (UGA allocation) and vcp (PGA allocation). vcu grew to 4’755’704 during the run, however vcp grew to 700’051’976, a little more than 700MB! This shows that the collection is stored in the PGA, and that the collection grew beyond the value set with PGA_AGGREGATE_TARGET.

This behaviour is consistent in versions 12.1.0.1, 11.2.0.4, 11.2.0.3, 11.2.0.2 and 11.2.0.1.

Let me emphasise once again that the above proof of concept code managed to allocate more memory than was set for the overall PGA usage of the entire instance. This can have an enormous, devastating impact on a consolidated database setup (meaning having multiple instances running on a single machine). Typically, once memory consumption of all the processes exceeds physically available memory, the operating system tries to use the swap device, to which it will swap memory pages in and out depending on memory usage of active (=on CPU) processes. Mild swapping shows as severely slowed-down processing (because a number of memory pages for processing need to be read from the swap device and placed in memory, from which the former contents need to be written to the swap device), heavy swapping shows as the machine coming down to a standstill.

Please mind that a diagnosis on the state of memory usage (alias swapping), just by looking at the amount of used swap (as can be seen in the ‘top’ output, or ‘swapon -s’) could be misleading. It’s also important to look at actual swapping in and out, as can be seen with ‘vmstat 1′ (si/so columns) or swap -W. I’ve found several systems which had been running for some time (approximately longer than a month) that had swap usage, sometimes up to 40%, while no ‘active swapping’, so memory pages being transfered to and from the swap device, was happening.

Luckily, starting with Oracle 12 you can actually limit overall PGA usage using the parameter PGA_AGGREGATE_LIMIT. The default value is the greater of (list from Oracle documentation):
a) 2GB
b) 200% of PGA_AGGREGATE_TARGET parameter (or lower if 200% > (90% of physical memory – total SGA size) but not below 100%)
c) 3MB * PROCESSES parameter
The parameter can not set below it’s default value, except when set in a pfile or spfile.

Let’s set the PGA_AGGREGATE_LIMIT to 600MB and see what happens when we start doing a large allocation again:

SQL> alter system set pga_aggregate_limit=600m scope=spfile;

System altered.

SQL> startup force;

Okay, let’s run the pga_filler.sql script again, and try to allocate 900MB. This means the “grow_until” variable must be set to 900000000.
PLEASE MIND this is done as a regular user, the SYS user and background processes other than job queue processes are not subject to the limiting.

TS@v12102 > @pga_filler
declare
*
ERROR at line 1:
ORA-01423: error encountered while checking for extra rows in exact fetch
ORA-00039: error during periodic action
ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT
ORA-06512: at line 21

Great! Exactly like we expect, right?
Well…yes, but let’s look at the alert.log

Sat Dec 13 15:08:57 2014
Errors in file /u01/app/oracle/diag/rdbms/v12102/v12102/trace/v12102_ora_4147.trc  (incident=46599):
ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT
Incident details in: /u01/app/oracle/diag/rdbms/v12102/v12102/incident/incdir_46599/v12102_ora_4147_i46599.trc
Sat Dec 13 15:09:07 2014
Dumping diagnostic data in directory=[cdmp_20141213150907], requested by (instance=1, osid=4147), summary=[incident=46599].
Sat Dec 13 15:09:09 2014
Sweep [inc][46599]: completed
Sweep [inc2][46599]: completed

Okay, essentially, this tells us nothing interesting, except for the tracefile. Let’s look in/u01/app/oracle/diag/rdbms/v12102/v12102/trace/v12102_ora_4147.trc, being the tracefile as indicated in the above alert.log snippet:

*** 2014-12-13 15:08:57.351
Process may have gone over pga_aggregate_limit
Just allocated 65536 bytes
Dumping short stack in preparation for potential ORA-4036
----- Abridged Call Stack Trace -----
ksedsts()+244<-ksm_pga_limit_short_stack()+1016<-ksm_check_over_limit()+469<-ksmarfg()+574<-kghgex()+1376<-kghfnd()+361<-kghalo()+4422<-kghgex()+414<-kghfnd()+361<-kghalo()+4422<-kghgex()+414<-kghfnd()+361<-kghalo()+4422<-kghgex()+414<-kghalf()+1003<-klmalf()+103
<-kllcqas()+194<-kcblasm()+108<-kxhfNewBuffer()+607<-qerhjSplitBuild()+632
----- End of Abridged Call Stack Trace -----
=======================================
PRIVATE MEMORY SUMMARY FOR THIS PROCESS
---------------------------------------
******************************************************
PRIVATE HEAP SUMMARY DUMP
781 MB total:
   781 MB commented, 646 KB permanent
   208 KB free (0 KB in empty extents),
     779 MB,   2 heaps:   "koh-kghu call  "            57 KB free held
------------------------------------------------------
Summary of subheaps at depth 1
779 MB total:
   778 MB commented, 110 KB permanent
    63 KB free (0 KB in empty extents),
     667 MB, 42786 chunks:  "pmuccst: adt/record       "
      83 MB, 5333 chunks:  "pl/sql vc2                "

Actually, this is the end of the tracefile. It seems that the pga limit dump (the text in between “Process may have gone over pga_aggregate_limit” to the private memory summary heap dumps) occurs several times before an actual ORA-4036 is triggered. In my private test instance, where I am obviously the only user process doing something, I get a pga limit dump approximately 20 times before the ORA-4036 is actually triggered:

sending 4036 interrupt
Incident 46599 created, dump file: /u01/app/oracle/diag/rdbms/v12102/v12102/incident/incdir_46599/v12102_ora_4147_i46599.trc
ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT

Did you actually spot the oddity here?

Remember the PGA_AGGREGATE_LIMIT was set to 600M. Now look at the process’ PGA/Private heap summary dump above: it says 781M. Please mind the 781M is the PGA heap of a SINGLE process! When looking at the total PGA allocated for the entire instance, it’s even more:

SYS@v12102 AS SYSDBA> select value/power(1024,2) "MB" from v$pgastat where name = 'maximum PGA allocated';

	MB
----------
1041.16699

So…despite PGA_AGGREGATE_LIMIT set to 600M, according to the v$pgastat view, there’s 1041MB allocated for PGA. Please mind I haven’t looked into how accurate v$pgastat is, but I tend to believe this.

Summary.
I’ve seen PGA_AGGREGATE_TARGET being used as a calculation value for actual PGA usage of an instance. This is simply wrong. The actual amount of PGA memory allocated by the instance is highly depended on what is done, and can be less than PGA_AGGREGATE_TARGET, or more. Automatic PGA can control three per process memory area’s: the sort, hash and bitmap memory area’s. These are sized based on the setting of PGA_AGGREGATE_TARGET and the actual PGA memory usage instance wide. If more memory is needed for sort, hash or bitmap memory than is made available by the memory manager, excess memory needed is allocated in the temporary tablespace. Any other PGA memory allocation is always done, regardless of the setting of PGA_AGGREGATE_TARGET.

Starting with Oracle 12, it seems the actual PGA allocation now can actually be limited with the new parameter PGA_AGGREGATE_LIMIT. However, during some simple testing it shows that actually more memory is allocated than set with PGA_AGGREGATE_LIMIT as limit. I haven’t tested it in more situations, this post is meant to grow awareness that the actual limit as set by PGA_AGGREGATE_LIMIT might not be that hard as you would expect.

Please mind, PGA_AGGREGATE_LIMIT seems to truly limit PGA usage instance wide, not limit the PGA heap per process, as event 10251 (PGA usage limiting way for Oracle 11.2) does. However, once again: PGA_AGGREGATE_LIMIT seems to try to be smart and actually does not limit at the exact size set, but beyond that.

The next post will introduce a way to limit PGA usage in Oracle 11.2. Stay tuned!


Tagged: consolidation, database, exadata, linux, memory usage, oracle, pga, pga_aggregate_limit, pga_aggregate_target, swap, swapping, uga

Oracle database operating system memory allocation management for PGA – part 2: Oracle 11.2

$
0
0

This is the second part of a series of blogpost on Oracle database PGA usage. See the first part here. The first part described SGA and PGA usage, their distinction (SGA being static, PGA being variable), the problem (no limitation for PGA allocations outside of sort, hash and bitmap memory), a resolution for Oracle 12 (PGA_AGGREGATE_LIMIT), and some specifics about that (it doesn’t look like a very hard limit).

But this leaves out Oracle version 11.2. In reality, the vast majority of the database that I deal with at the time of writing is at version 11.2, and my guess is that this is not just the databases I deal with, but a general tendency. This could change in the coming time with the desupport of Oracle 11.2, however I suspect the installed base of Oracle version 12 to increase gradually and smoothly instead of in a big bang.

With version 11.2 there’s no PGA_AGGREGATE_LIMIT. This simply means there is no official way to limit the PGA. Full stop. However, there is an undocumented event to limit PGA usage: event 10261. This means that if you want to use this in a production database, you should ask Oracle support to bless the usage of it. On the other hand, Oracle corporation made this event public in an official white paper: Exadata consolidation best practices.

Let’s test event 10261! I’ve got the same table (T2) setup, a description how to set this up, and the anonymous PL/SQL code to allocate PGA using a collection is in the first part. I am using a database version 11.2.0.4 with PSU 4 applied. The reason for choosing this version is that if you run a serious business on Oracle 11.2, THAT should be the version you should be running on!
(disclaimer: everything shown in this blogpost is purely for educational purposes. Do test everything thoroughly before applying this to a production system. Behaviour can or may be different in your specific situation)
The reason for this disclaimer: Bernhard (@bdcbuning_gridit) tweeted that he was warned that when setting it at the instance level, it could crash the instance. I am not sure if this means setting it at runtime, this event is always evaluated at the instance level.

Okay, let’s replicate more or less the test done to Oracle version 12.1.0.2 in the first part. In this database PGA_AGGREGATE_SIZE is set to 500M, now let’s try to set the event to 600M, which means we set the PGA limit to 600M:
This is setting the event on runtime:

SYS@v11204 AS SYSDBA> alter system set events = '10261 trace name context forever, level 600000';

System altered.

This is setting the event in the spfile (which means you need a restart of the instance to activate this event, or the above syntax to set it on runtime):

SYS@v11204 AS SYSDBA> alter system set event = '10261 trace name context forever, level 600000' scope=spfile;

System altered.

The level is the amount of memory to which the PGA must be limited, in kilobytes.

Now start the anonymous PL/SQL block to fill up the PGA with a collection, again set to 900M:

TS@v11204 > @pga_filler
declare
*
ERROR at line 1:
ORA-10260: limit size (600000) of the PGA heap set by event 10261 exceeded
ORA-06512: at line 20

That’s nice! There’s actually a meaningful, describing error message which explains why this PL/SQL block ended!

Let’s look at the actual PGA memory used, as reported by v$pgastat:

SYS@v11204 AS SYSDBA> select value/power(1024,2) from v$pgastat where name = 'maximum PGA allocated';

VALUE/POWER(1024,2)
-------------------
	 676.078125

This is different than setting PGA_AGGREGATE_LIMIT, however there’s still more memory allocated than set as the limit (600000KB), but lesser (676M in 11.2.0.4 versus 1041M in 12.1.0.2). The outside visibility of the limiting happening is different too: there is NO notice of a process hitting the PGA limit set in the alert.log file nor the process’ trace file(!). Another difference is even SYS is limited, a test with the procedure running as SYS gotten me the ORA-10260 too, PGA_AGGREGATE_LIMIT does not limit SYS.

Event 10261 has got the same description to at least as low as version 11.2.0.1. Here’s a test with with the event 10261 set at version 11.2.0.3 to 600M:

TS@v11203 > @pga_filler
declare
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [723], [123552], [top uga heap], [], [], [], [], [], [], [], [], []
ORA-06512: at line 20

As has been detailed in the Oracle white paper, prior to version 11.2.0.4, an ORA-600 [723] is signalled when event 10261 is set, and more PGA memory is allocated as has been specified as limit. The amount of total allocated PGA is 677M, so roughly the same as with version 11.2.0.4.

Because this is a genuine ORA-600 (internal error, ‘OERI’), this gives messages in the alert.log file:

Tue Dec 16 10:40:09 2014
Errors in file /u01/app/oracle/diag/rdbms/v11203/v11203/trace/v11203_ora_8963.trc  (incident=9279):
ORA-00600: internal error code, arguments: [723], [123552], [top uga heap], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/v11203/v11203/incident/incdir_9279/v11203_ora_8963_i9279.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.

The process’ trace file in the trace directory only points to the incident file, no further details are available there.
The incident trace file contains a complete diagnostics dump.

The behaviour is identical with Oracle 11.2.0.2.

Summary
The limiting of the total amount of PGA memory used must be done using an undocumented event prior to Oracle version 12. The event is 10261. The event is made known in an official white paper. Still I would open a service request with Oracle to ask blessing for setting this. This does not mean this functionality is not needed, I would deem it highly important in almost any environment, even when running a single database: this setting, when done appropriately, protects your system from over allocating memory, which could mean entering the swapping death-spiral. The protection means a process gets an ORA message, and the PGA allocation aborted and deallocated.

With version 11.2.0.4 hitting the limit as set with event 10261 is not published, outside of the process getting the ORA-10260.

With versions prior to 11.2.0.4 (11.2.0.3 and 11.2.0.2 verified) processes do get an ORA-600 [723], which is also visible in the alert.log, and incidents are created accordingly.

When a limit has been set using event 10261, it still means more memory is allocated than set as limit (approximately 677M when 600M is set), but this is way less than with the PGA_AGGREGATE_LIMIT (1041M when 600M is set) in my specific situation. Test this in your own environment when you start using this.

Important addendum:
A very good comment to emphasise on the behaviour of using/setting event 10261 by Alexander Sidorov: this event sets a limit per process, not for the entire instance!! (tested with 11.2.0.4 and 11.2.0.3)


Tagged: consolidation, event, internals, oracle, performance, pga_aggregate_target, tuning

Oracle database operating system memory allocation management for PGA – part 3: Oracle 11.2.0.4 and AMM: Quiz

$
0
0

This is a series of blogposts on how the Oracle database makes use of PGA. Earlier posts can be found here (PGA limiting for Oracle 12) and here (PGA limiting for Oracle 11.2).

Today a little wednesday fun: a quiz.

What do you think will happen in the following situation (leave a response as comment please!):

-Oracle Linux x86_64 6u6.
-Oracle database 11.2.0.4 PSU 4
-Oracle database (single instance) with the following parameter set: memory_target=1G. No other memory related parameters set.

Run the pga_filler script (which can be found here (PGA limiting for Oracle 12)), with grow_until set to 2100000000 (approximately 2.1G).

I’ll try to create a blogpost on the outcome and an explanation on short notice!


Tagged: amm, fun, memory, oracle, pga, quiz
Viewing all 190 articles
Browse latest View live