

Recently I patched an 11.2.0.2 grid infrastructure to an higher version. After the patching I started the grid infrastructure on that host, and ASM was unable to start. Looking in the alert.log file of the ASM instance it turned out that upon starting ASM, even before the contents of the pfile/spfile was displayed, the ASM crashed with the ORA-00600 error:
Sat Oct 13 14:35:07 2012
NOTE: No asm libraries found in the system
* instance_number obtained from CSS = 1, checking for the existence of node 0...
* node 0 does not exist. instance_number = 1
Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 30960 free: 30960)
DFLT Huge Pages allocation successful (allocated: 67107111)
***********************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
CELL communication is configured to use the following interface(s) for this instance
1.1.1.1
CELL interconnect IPC version: Oracle RDS/IP (generic)
IPC Vendor 1 Protocol 3
Version 4.1
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/11.2.0/grid/dbs/arch
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
NOTE: Volume support enabled
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_20405.trc (incident=66599):
ORA-00600: internal error code, arguments: [kmgs_component_init_3], [60], [65], [17], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/asm/+asm/+ASM1/incident/incdir_66599/+ASM1_ora_20405_i66599.trc
Sweep [inc][66599]: completed
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
A quick google didn't result in any hits, nor searching the MOS knowledge base and bug database.
I decided to start the ASM instance manually, and got an interesting message on screen just prior to the instance crashing:
ORA-32004: obsolete or deprecated parameter(s) specified for ASM instance
Upon investigation of the contents of the spfile, it turned out the parameter 'remote_os_authent' was set for the ASM instance. Removing this parameter returned normal, expect behaviour, in other words: not crashing the ASM instance.
I tried to replay/redo this failure in a VM on my laptop (using Oracle Grid Infrastructure 11.2.0.3 in non-clustered mode), but was UNABLE to crash ASM by setting 'remote_os_authent' to both true and false. So it could either be a 'cluster feature' or a feature specific to version 11.2.0.2.
The reason for this blogpost is to have a little bit of documentation on the OERI [kmgs_component_init_3], so if anybody encounters this problem, there is at least some clue to be found, or even a resolution.
ps. kmgs_component_init is in the VOS layer (operating system depended layer)
If you are administering an Oracle Exadata database machine, which base operating system image (the operating system version with which it system came) is Linux and version 11.2.3.1.0 (current version is viewable with the command ‘imageinfo’, which needs root account privileges) or higher, and multiple users are accessing the system with password authentication, this blogpost might be an interesting read. Also, if you have witnessed temporary lockout of the oracle user, or other users: this blogpost describes the reason and a potential resolution.
I administer several Exadata database machines, which are not all delivered at the same time, so the base image version is different. Because I also administer the Linux operating system on the computing nodes in the Exadata database machines, I noticed the Linux settings slightly differ among the different Exadata database machine computing nodes. There is a positive side to this: apparently the team that maintains the base image does not only renew the packages on the image, but also get feedback about the settings, and change things to improve something. There is also a negative side to this: these changes are not documented anywhere (that I am aware of), so getting a new system always is a bit exciting, because things might have been changed. Or not…
I speak about ‘base image’ deliberately. After a system is delivered with a certain ‘base image’ (collection of kernel, executables, os-scripts and settings), the kernel, executables and os-scripts are renewed with an upgrade, but the settings remain the same. This blogpost is about a PAM (pluggable authentication modules) setting, which I encountered on base-image 11.2.3.1.0.
I witnessed the ‘oracle’ account being locked out temporarily on a system. The reason was a series of unsuccesful logon attempts. This could be something which complies with somebodies security standards (but who’s?). I think the ‘oracle’ account on an Oracle database system being locked out (albeit temporarily) is highly undesirable on most systems. As I’ve described earlier: this is a setting which the Exadata image engineering team decided to do with pluggable authentication modules.
Meet pam_tally2.so…
In the directory /etc/pam.d there are two files which configure the temporary lockout when a series of unsuccesful logon attempts have been made: ‘login’ and ‘sshd’. The actual line responsible for the lockout is:
auth required pam_tally2.so deny=5 onerr=fail lock_time=600
In most environments where no actual security compliancy rules are known, I disable this behaviour by commenting it with a hash (‘#’) sign as the first character on the line. Of course you could read the manpage of pam_tally2 (search on the internet for ‘man pam_tally2′) and configure it to your linking or to comply with your security rules.
Recently we upgraded an Exadata to the currently latest version, 11.2.3.2.0. The Exadata software itself consists of an image for the storage servers (the storage servers are essentially re-imaged), and a set of updates for the database/computing nodes, including: firmware for ILOM (lights out adapter), BIOS, LSI RAID adapter, Infiniband adapter, linux kernel, drivers, mandatory packages, to name some.
One of the exceptional things this upgrade does, is remove the hot-spare out of the RAID set on the database/compute nodes. This is documented in MOS note: 1468877.1, as ‘known issue 5: hotspare removed for compute nodes’. For some sites, this actually can be a good thing, if they are really tight on disk space on the compute nodes of Exadata. I must say that we have not encountered this situation. What this means, is that the actual HDD configuration on the compute node is left to the customer, instead of having one mandatory configuration (having 3 disks in a RAID-5 configuration, and one hot-spare).
So if you decide to use the former hot-spare disk as an active part of the RAID configuration, you are effectively trading availability for diskspace. Please mind the RAID set itself already provides redundancy, even without the hot-spare!
On the other hand, I think in most configurations, it makes sense to convert the disk back to being hot-spare.
This is done in the following way:
a) Get an overview of the current disk configuration:
/opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: 0B70
This shows the disk in slot number 3 being left “unconfigured”, but in “good” state (of a disk has gone bad because of errors, it will be removed from the RAID set, and will show up as “unconfigured(bad)”!)
This is the state the upgrade to 11.2.3.2.0 leaves your system.
New let’s make the disk hot spare again!
b) Get the inclosure id:
/opt/MegaRAID/MegaCli/MegaCli64 -encinfo -a0 | grep ID
Device ID : 252
This means we know the enclosure id (252) and the slot number (3), which is the information needed for the MegaCli utility to revert the unconfigured disk to hot-spare again!
c) Revert the unconfigured disk back to hot-spare
/opt/MegaRAID/MegaCli/MegaCli64 -PdHsp -set -EnclAffinity -PhysDrv[252:3] -a0
Adapter: 0: Set Physical Drive at EnclId-252 SlotId-3 as Hot Spare Success.
Exit Code: 0x00
d) Check the disk configuration again:
/opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Hotspare, Spun Up
Device Firmware Level: 0B70
It appears that having the disk been removed from the RAID set by the update to 11.2.3.2.0 generates an ASR message. At least it did at our site, despite this being an undocumented bug (7161023,'ASR generating false errors in relation to disks') which is marked resolved in 11.2.3.1.0 (?). Most sites I encounter have ASR setup, but not having all messages send additionally to local, onsite monitoring. I want to stress it's very important to have the ASR messages sent to your own monitoring too!
Oracle Support does not list all the specifications from an ASR message it has gotten. Instead, a Service Request is made with enough information for Oracle itself (!!). In our case, the exact error message was NOT specified, only 'compute server hard disk predictive failure' and the node name.
Where do you look on an Exadata for that information? The first logical point is the ASR daemon. I didn't spend too much time on it, but it seems that it's more a proxy for messages than a database. I wasn't able to find useful information about the systems which where using this daemon.
What are the sources for ASR with an Exadata? These are:
Computing node:
- "compmon daemon" / Linux level monitoring
- ILOM
Storage node:
- "cell daemon" / Linux level monitoring
- ILOM
For the computing node, it's quite easy to see if there are any detected failed devices from the viewpoint of the ILOM:
(please mind ipmitool -I open only works on the local system)
# ipmitool -I open sunoem cli "show /SP/logs/event/list Severity==(Major,Critical,Down)"
Connected. Use ^D to exit.
-> show /SP/logs/event/list Severity==(Major,Critical,Down)
ID Date/Time Class Type Severity
----- ------------------------ -------- -------- --------
-> Session closed
Disconnected
This shows no messages with the severity Major, Critical or Down are in the eventlog in the ILOM. Please mind that the logons to the ILOM have severity "Minor". These are in most system the vast majority of the messages, which are not of interest for this investigation. If you want to know if something has failed, there even a simpler command:
# ipmitool -I open sunoem cli "show faulty"
For the "compmon daemon", grep the processlist for "compmon":
# ps -ef | grep compmon
root 12812 1 0 Oct22 ? 00:00:11 /usr/bin/perl -w /opt/oracle.cellos/compmon/exadata_mon_hw_asr.pl -server
The most important part here is the directory: /opt/oracle/cellos/compmon
If you navigate to that directory, you will see a number of "state files": asrs.state, traps.state and disks.state.
The disks.state lists the disk status as listed with a) with the firmware state.
The most important file for the ASR message investigation is the traps.state file. This file lists traps it has sent to ASR. In our case:
1 ; Mon Oct 22 14:39:10 2012 ; 86425886-b359-4587-8d46-f31ff2ecb135 ; Physicaldisk : Make Model: is at status predictive failure. Raised fault id: HALRT-02008 ; Physical disk should be replaced. Exadata Compute Server: Disk Serial Number:
Yes, this is pasted correctly, it misses Physicaldisk, Make Model and Disk Serial Number information. This has not been omitted for safety, it just is not listed.
So, the failure which was sent was HALRT-02008 in our case.
For completeness, the ILOM layer can be investigated identically to the description of the ILOM handling on the computing layer. The Linux layer messages can be investigated with:
# cellcli -e list alerthistory
32 2012-10-17T02:00:27+02:00 info "HDD disk controller battery on disk contoller at adapter 0 is going into a learn cycle. This is a normal maintenance activity that occurs quarterly and runs for approximately 1 to 12 hours. The disk controller cache might go into WriteThrough caching mode during the learn cycle. Disk write throughput might be temporarily lower during this time. The message is informational only, no action is required."
33 2012-10-22T11:43:21+02:00 info "Factory defaults restored for Adapter 0"
34 2012-10-22T11:43:23+02:00 info "Factory defaults restored for Adapter 0"
A substantial part of the people I encounter present using OSX on a Macbook. I am not sure how much of these people use Apple’s Keynote for presenting, but I like Keynote very much for various reasons, like a cleaner interface. This blogpost is about some handy tips and tricks I learned using a few years of presenting around the world. If you don’t use OSX, this blogpost is probably not for you.
1. Setup a private network with your iPhone/clicker
This first step has two important reasons. The first reason is extremely obvious: in order to communicate with your iPhone/clicker, you need a connection. The second reason is a little less obvious: if the conference you are attending as a speaker as wireless access, you probably joined that wireless network. In order to make your computer NOT respond to any kind of signal from the internet (growl, notification center, updates, etc.), you really should disconnect first. When you setup a private network with your iPhone/clicker, you are not connected to the internet anymore. (obviously you need to disconnect any wired internet connections too!)
This is done on the Macbook using the wifi signal strength indicator on the upper right side, create network. Choose 40-bit WEP (this isn’t the safest encryption on the world, but you are going to use this for relatively short time), and choose a 5 character password.
Next go to the Settings on your iPhone, choose ‘Wi-Fi’, and select the network you just setup on your Macbook. The default name of the local network is the name of the computer. If it’s the first time, or you’ve changed the password, enter the 5 character password you choose when setting up the local network.
What is lesser know, is that you now DO NOT HAVE A CONNECTION AT THIS MOMENT. Simple reason is there is no DHCP server which gives both your Macbook and your iPhone an ip address. You need to wait a little while, then both your Macbook and your iPhone will self assign an ip address. On your Macbook, go to System Prefences>Network, and click on “Wi-Fi”. It has an orange colour, not a green colour as you might expect. If you have clicked on “Wi-Fi”, the description will say something like:
Wi-Fi has the self-assigned IP address 169.254.111.111 and will not be able to connect to the Internet.
Your IP address will be different. Now go to your iPhone, and go to Settings>Wi-Fi, and look what network is selected. It should be the network with the name of your Macbook. If your iPhone powersaved, it will probably gone to the wireless of the conference again, more on that later. (Re)select the network with the name of your Macbook, and click on the blue circle with the bigger-than sign in it on the right side if the network. It shows you an IP address and the subnet mask. If you just re-set the Macbook network, you probably must wait a little while before it assigns an IP address to itself.
In order to perform a test if a connection is possible, open a terminal on your Macbook, and ping the (self assigned) IP address of the iPhone. If the network connection can be used, ping will show response from the iPhone.
2. Disable powersave on your Macbook
You do not want your Macbook to go into powersave while your are setting it up, talking to people, when presenting, or when you go out of the presentation to show something and you are discussing that. There is an extremely simple way to do that: caffeine. Search for this little piece of software on the internet, or, even simpler: go to the OSX app store, and search for caffeine. It’s a free application. If you fire it up, it shows an empty cup in the top bar on the right side. When you do not want your computer to go into powersave at any time, click on the cup: it will show a full cup of coffee. That’s simple, right?
3. Disable powersave on your iPhone
Probably you have set your iPhone up to powersave too. This is done in Settings>General>Auto-lock; set it to ‘Never’. As you probably know or learned, once your iPhone goes into powersave, it turns off wireless. So if you enable your iPhone, wireless will turn on again, and just search for any network it can autoconnect to. This is the reason it will connect to the conference wireless again: the local network is not saved by default, but the conference wireless is.
4. Use your iPhone as a clicker
There are two ways that I’ve used to use your iPhone as a clicker, the ‘Remotepad’ app (which needs an application on OSX too called the same, and makes a mouse of your iPhone), or the Keynote Remote app. If you are serious about presenting, and want to use your iPhone as a remote, my opinion is to buy the Keynote Remote app. The strong point is its simplicity: swipe right to left for going forward (‘click’) or swipe left to right to go backward. The other two functions it’s got is go to beginning and go to end. That’s all.
If you didn’t had the Keynote Remote app, and installed it on your iPhone, and you’ve setup the network, there’s one additional thing you should do: link keynote with the app. Startup or select Keynote, Select Keynote>Preferences and go to the ‘Remote’ tab/icon. Now select ‘Enable iPhone and iPod touch remotes’, and link the two together.
Keynote has to be started on your Macbook, and the presentation you want to use needs to be loaded, but does not have to be set in presenting mode already; if you start the Keynote Remote app on your iPhone, it will put Keynote in presentation mode with the current slide.
Happy presenting!
Profiling PL/SQL is an Oracle feature since Oracle 9i (2001). Yet I haven’t seen anyone actually use the profiler (besides myself, which is the reason for this post). On the other hand, I have seen shiploads of people guess how the time profile of their PL/SQL looks like. I also have seen people instrument their code, which ranges from bad (to little instrumentation, to much instrumentation, wrong timescale (measuring things that take lesser than milliseconds with a granularity of seconds for example), to good (use correct granularity, use Method-R’s ILO).
Profiling and instrumentation go hand-in-hand, they absolutely don’t rule each other out. Instrumentation points you to the chunk(s) of code which takes the most time, and should be always on. Profiling takes timing to the source code line level, and is the logical next step to do when you’ve been pointed to a piece of code that takes a long time. Profiling should be off by default.
I know the basic form of profiling has been described by a number of websites, including the Oracle documentation and Tim’s Oracle-base. To be honest, I haven’t got very much to add. (Just like all the other websites which seems to have taken a copy of oracle-base)
The description of dbms_profiler in this article is about using dbms_profiler on an Oracle 11.2 database. If the profiler package is not loaded (check with ‘desc dbms_profiler’ with a DBA granted user), load it as SYSDBA with ‘@?/rdbms/admin/profload.sql’. The only thing you have to do, is to add a few tables and a sequence in the schema which is executing the profiler package. The most simple way to do do this, is to log on to the database server, and log on with sqlplus with this owner, and execute:
@?/rdbms/admin/proftab.sql
The next thing to do, is add start and stop commands for the profiler in the piece of ‘trouble code’. The code can be either SQL or PL/SQL. In PL/SQL add:
dbms_profiler.start_profiler ( 'profiler run ' || to_char ( sysdate, 'yyyymmddhh24mi' ) ); --start
dbms_profiler.stop_profiler; --stop
In SQL add ‘exec’ in front of calling dbms_profiler.
There’s a gotcha you really need to know: the profiler records the source line number. What you can do, and most scripts found on the internet do, is join the profiler output with the ALL_SOURCE view to see what happens at that line. Anonymous PL/SQL blocks are not recorded in ALL_SOURCE, so when you profile anonymous PL/SQL blocks, you get ‘anonymous’ in the unit_owner and unit_name fields.
Let’s look at a complete example:
a) Please mind this is an extremely simple example! I use the SYS user for the sake of simplicity. Don’t do this in official environments!
b) I’ve checked the existence of dbms_profiler. In fact, I needed to load the profiler package in this database.
c) I’ve ran the proftab script to generate the needed tables for the profiler.
d) This is the script I’ve used:
-- script: gtt_and_collection.sql
-- setup
set serveroutput on;
create global temporary table gtt ( col1 number, col2 varchar2(10) ) on commit preserve rows;
create or replace procedure insert_example is
nr_to_loop number:=1000000;
nr number:=0;
type col_type is table of gtt%ROWTYPE;
col col_type:=col_type();
begin
--dbms_profiler.start_profiler('profiler run '||to_char(sysdate,'yyyymmddhh24mi'));
dbms_output.put_line('Start: '||to_char(sysdate,'hh24:mi:sssss'));
for nr in 1..nr_to_loop loop
/* gtt */
insert into gtt (col1, col2) values ( nr, 'AABBCCDDEE' );
/* collection */
col.extend;
col(nr).col1 := nr;
col(nr).col2 := 'AABBCCDDEE';
end loop;
commit;
dbms_output.put_line('Stop: '||to_char(sysdate,'hh24:mi:sssss'));
--dbms_profiler.stop_profiler;
end;
/
-- execute test
exec insert_example;
-- clean up
--drop procedure insert_example;
truncate table gtt;
drop table gtt;
Let's run it:
SQL> @gtt_and_collection
Table created.
Procedure created.
Start: 06:57:25074
Stop: 06:58:25110
PL/SQL procedure successfully completed.
Procedure dropped.
Table truncated.
Okay, the procedure ran approximately 1 minute. Let suppose this takes too long. Now what are you going to do? Can this be sped up? Is the time taken by insert into the global temporary table? Some would argue this is a temporary table, so it probably exists in memory. Or is it the collection which eats up the time? Or maybe even the commit after the loop?
Well, enable the profiler by removing the '--' in front of the dbms_profiler calls in the procedure, and run it again! (Please mind flashback database is an incredibly handy feature to redo tests with absolutely the same database. Downside is it means you need to stop the database and mount the database to flashback it. This means other users or developers have downtime)
When it runs with the profiler, nothing extraordinary is seen, it just runs. You have to make a report of the profiler data yourself. This is the script I use:
col owner format a10
col name format a20
col line format 9999
col source format a40 word_wrap
col occurances format 9999999
col tot_time_s format 999,999,990.999
col run_comment format a40
col pct format 999
select runid, run_date, run_comment from plsql_profiler_runs;
select u.unit_owner OWNER, u.unit_name NAME, d.line# LINE, (select text from all_source where type in ('PACKAGE BODY','FUNCTION','PROCEDURE','TRIGGER') and name = u.unit_name and line=d.line# and owner=u.unit_owner and type=u.unit_type) SOURCE, d.total_occur OCCURANCES, (d.total_time/1000000000) TOT_TIME_S, d.total_time/r.run_total_time*100 "PCT"
from plsql_profiler_runs r, plsql_profiler_units u, plsql_profiler_data d
where r.runid=&&runid
and r.runid=u.runid
and r.runid=d.runid
and d.unit_number=u.unit_number
and (d.total_time/1000000000) > 1 --more than 1 second
order by tot_time_s
/
select (run_total_time/1000000000) tot_time_s from plsql_profiler_runs where runid=&runid;
undefine runid
The most notable change with scripts found on the internet is this script orders the output by time spend, which is why I use it, and you probably too. If you profile a real-life piece of code, there is probably more lines of code involved than fit on your screen, so having the output in the order of the source code doesn't make sense to me.
What is the output?
RUNID RUN_DATE RUN_COMMENT
---------- --------- ----------------------------------------
1 26-NOV-12 profiler run 201211260638
2 26-NOV-12 profiler run 201211260643
Enter value for runid: 2
old 3: where r.runid=&runid
new 3: where r.runid=2
OWNER NAME LINE SOURCE OCCURANCES TOT_TIME_S PCT
---------- -------------------- ----- ---------------------------------------- ---------- ---------------- ----
SYS INSERT_EXAMPLE 15 col.extend; 1000000 2.386 6
SYS INSERT_EXAMPLE 12 insert into gtt (col1, col2) values ( 1000000 32.836 88
nr, 'AABBCCDDEE' );
old 1: select (run_total_time/1000000000) tot_time_s from plsql_profiler_runs where runid=&runid
new 1: select (run_total_time/1000000000) tot_time_s from plsql_profiler_runs where runid=2
TOT_TIME_S
----------------
37.130
Well, this makes it all obvious, doesn't it? Only two lines of code have spend more than 1 second in total, which are the two shown. The collection management (adding a row) is responsible for 6% of the time, and the insert is responsible for 88%
I guess the first question which comes to mind when reading this title is ‘Why’? For a database, but I guess for any IO depended application, we want IO’s to be faster, not throttle them, alias make them slower. Well, the ‘why’ is: if you want to investigate IO’s, you sometimes want them to slow down, so it’s easier to see them. Also, (not so) recent improvements in the Oracle database made great progress in being able to use the available bandwidth by doing IO in parallel, which could strip away much of the ability to see them in Oracle’s popular SQL trace.
I use VMWare Fusion on my MacBook and use Linux in the VM’s to run the Oracle datatabase. Desktop virtualisation, like VMWare Fusion (and Virtualbox and VMWare Workstation, I think all desktop virtualisation products) use the operating system IO subsystem. This introduces a funny effect: if you stress the IO subsystem in the VM, and measure throughput, it looks like the disk or disks are getting faster and faster every run. The reason for this effect is the blocks in the file, which is the disk from the perspective of the VM, are getting touched (read and/or written) more and more, thus are increasingly better candidates for caching from the perspective of the underlying operating system.
I think that if you combine the ‘disk getting faster’ effect with the need to investigate IO’s, you understand that it can be beneficial to throttle IO’s in certain cases.
The mechanism which can be used to control and throttle resources is ‘cgroups’. Cgroups is a Linux kernel feature, which is an abbreviation of ‘control groups’, and has the function to limit, account and isolate resource usage (see this wikipedia article. Cgroups are a Linux kernel feature since kernel version 2.6.24. This means there is no cgroups in Redhat and Oracle linux version 5, but there is version 6.
The idea behind cgroups is to have control over resources in a system, which becomes more and more important with today’s systems getting bigger. Cgroups have been created to function from single processes to complete (virtualised) systems.
(please mind all commands are either executed as root (indicated by ‘#’), or as a regular user (oracle in my case, indicated by ‘$’))
First, we need to make sure the ‘blkio’ controller is available:
# grep blkio /proc/mounts || mkdir /cgroup/blkio ; mount -t cgroup -o blkio none /cgroup/blkio
Next, we create a cgroup called ‘iothrottle’:
# cgcreate -g blkio:/iothrottle
In order to throttle IO on the device, we need to find the major and minor number of the block device. If you use ASM, you can list the PATH field in the V$ASM_DISK view, and generate a long listing of it on linux:
$ ls -ls /dev/oracleasm/disk1
0 brw-rw----. 1 oracle dba 8, 16 Dec 15 13:22 /dev/oracleasm/disk1
This shows that the major and minor number of the block device are: 8 16.
The next step is to use the 'read_iops_device' configuration option of the blkio controller to apply throttling to the 'iothrottle' cgroup. The 'read_ops_device' configuration option uses the following format: major_number:minor_number nr_IO_per_second (major:minor, space, maximum number of read IO's per second)
cgset -r blkio.throttle.read_iops_device="8:16 10" iothrottle
Okay, we now have a cgroup called 'iothrottle' setup, and used the 'read_iops_device' option of the 'blkio' controller. Please mind there are no processes assigned to the cgroup yet. The next steps are to use an IO generation and measurement tool to first measure uncapped IO performance, then assign the process to the 'iothrottle' cgroup, and rerun performance measurement.
For the IO tests I use 'fio'. This tool gives you the opportunity to investigate your system's IO subsystem and IO devices performance. This is my fio.run file:
$ cat fio.run
[global]
ioengine=libaio
bs=8k
direct=1
numjobs=4
rw=read
[simple]
size=1g
filename=/dev/oracleasm/disk1
Now run it! Please mind I've snipped a large part of the output, because fio gives a great deal of output, which is extremely interesting, but not really relevant to this blog:
$ fio --section=simple fio.run
simple: (g=0): rw=read, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=1
...
simple: (g=0): rw=read, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=1
fio 1.57
Starting 4 processes
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
Jobs: 2 (f=2): [_R_R] [100.0% done] [123.3M/0K /s] [15.4K/0 iops] [eta 00m:00s]
...
So, we did an average of 15.4K read IOPS. Now lets put the process which runs fio in the 'iothrottle' cgroup!
Get PID of the process we just used with 'fio':
$ echo $$
5994
And assign the 'iothrottle' cgroup to it:
# echo 5994 > /cgroup/blkio/iothrottle/tasks
You can see if your process is assigned a cgroup by reading the 'cgroup' file in 'proc':
$ cat /proc/self/cgroup
1:blkio:/iothrottle
Okay, we are assigned the 'iothrottle' cgroup! Now rerun the 'simple' fio benchmark:
$ fio --section=simple fio.run
simple: (g=0): rw=read, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=1
...
simple: (g=0): rw=read, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=1
fio 1.57
Starting 4 processes
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
Jobs: 4 (f=4): [RRRR] [0.3% done] [81K/0K /s] [9 /0 iops] [eta 14h:37m:42s]
To be honest, I cancelled this fio run after a little while, because the time it takes to run is very long (approximately 14 hours and 30 minutes, as can seen above).
I think this example shows the cgroup 'iothrottle' in action very clear!
I can't imagine anybody want to echo all the process ID's in the cgroup's 'tasks' file in order to get these processes in a certain cgroup. With my DBA background, I would love to have control over an entire (all processes belonging to a) database. Setting up cgroups like done above (manually) means you have to set it up every time the server reboots. Luckily, there is a way to automate cgroup creation and assigning!
cgconfig service
In order to create cgroups, there is a service called 'cgconfig', which reads the file /etc/cgconfig.conf. In order to get the 'iothrottle' cgroup, and the disk throttling configuration created automatically, use this configuration:
mount {
blkio = /cgroup/blkio;
}
group iothrottle {
blkio {
blkio.throttle.read_iops_device="8:16 10";
}
}
In order to use this configuration restart the cgconfig service using 'service cgconfig restart'. Optionally, you can enable automatic starting of this service on startup using 'chkconfig --level 2345 cgconfig on' (optionally check when this service is started with 'chkconfig --list cgconfig'). Now the cgroup is created. But now do we assign processes to it?
cgred service
This is where the cgred service is for. This daemon uses a simple configuration file: /etc/cgrules.conf. Once configured and active, this service assigns cgroups to users, groups or processes. For the purpose of limiting IO from an Oracle database, I created this simple line:
oracle blkio /iothrottle
Now the cgred service can be started using 'service cgred restart'. Optionally, you can enable automatic starting of this service using 'chkconfig --level 2345 cgred on'.
The purpose of this blogpost was to introduce cgroups, and let you understand why I choose the IO throttling functionality. Next it showed how to setup cgroups manually, and a simple test to prove it works, with enough information to let you repeat the test for yourself. The last part showed how to automate cgroup creation and assignment.
A word of caution is on it's place. It's a fairly new feature at the time of writing, which means it could break or does not work as expected. So use at your own risk! In my limited tests it worked like a charm.
Recently I was discussing some IO related waits with some friends. The wait I was discussing was ‘kfk: async disk IO’. This wait was always visible in Oracle version 11.2.0.1 and seems to be gone in version 11.2.0.2 and above. Here is the result of some investigation into that.
First: the wait is not gone with version 11.2.0.2 and above, which is very simple to prove (this is a database version 11.2.0.3):
SYS@v11203 AS SYSDBA> select name, parameter1, parameter2, parameter3, wait_class from v$event_name where name like 'kfk: async disk IO';
NAME PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS
-------------------- ---------- ---------- ---------- ----------------------------------------------------------------
kfk: async disk IO count intr timeout System I/O
What is interesting, is that the wait class is ‘System I/O’. I don’t know the official definition of the wait class ‘System I/O’, but it tells me that it is something a background process is waiting for, not my foreground process. But I could be wrong…
Let’s look at an excerpt of a tracefile of a asynchronous, full segment scan in Oracle version 11.2.0.1. This host is running in a VM on OL6.3 X64:
PARSING IN CURSOR #3 len=23 dep=0 uid=85 oct=3 lid=85 tim=1356620409310181 hv=1020534364 ad='7f1b14a0' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #3:c=0,e=120,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356620409310180
EXEC #3:c=0,e=63,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356620409310297
WAIT #3: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1356620409310351
WAIT #3: nam='Disk file operations I/O' ela= 9 FileOperation=2 fileno=5 filetype=2 obj#=-1 tim=1356620409311203
WAIT #3: nam='Disk file operations I/O' ela= 238 FileOperation=2 fileno=0 filetype=15 obj#=73426 tim=1356620409312218
WAIT #3: nam='kfk: async disk IO' ela= 14 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409313703
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409314449
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409315169
WAIT #3: nam='kfk: async disk IO' ela= 5 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409315850
WAIT #3: nam='Disk file operations I/O' ela= 42 FileOperation=2 fileno=0 filetype=15 obj#=73426 tim=1356620409316451
WAIT #3: nam='kfk: async disk IO' ela= 33 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409317903
WAIT #3: nam='kfk: async disk IO' ela= 403 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409320529
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409321950
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409323627
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409325424
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409327121
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409329153
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409330861
WAIT #3: nam='kfk: async disk IO' ela= 8 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409332534
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409334179
What is quit clear to see, is the database is doing some stuff which is instrumented in the wait interface (‘Disk file operations I/O’), but the event ‘kfk: async disk IO’ is much occurring. It is NOT taking an enormous amount of time (ela < 10 (microseconds) most of the time).
Let's dig a little deeper. Most events are instrumentation for calls to the operating system to do something. Let's take the linux 'strace' utility, and list the waits together with the system calls. This is done with the 'strace -e write=all -p PID’, here is a fragment:
io_submit(139892671860736, 1, {{0x7f3b4b5851e0, 0, 0, 0, 11}}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b585420, 0, 0, 0, 12}}) = 1
io_getevents(139892671860736, 2, 128, {{0x7f3b4b5851e0, 0x7f3b4b5851e0, 122880, 0}}, {0, 0}) = 1
write(9, "WAIT #3: nam='kfk: async disk IO"..., 107) = 107
| 00000 57 41 49 54 20 23 33 3a 20 6e 61 6d 3d 27 6b 66 WAIT #3: nam='kf |
| 00010 6b 3a 20 61 73 79 6e 63 20 64 69 73 6b 20 49 4f k: async disk IO |
| 00020 27 20 65 6c 61 3d 20 35 33 20 63 6f 75 6e 74 3d ' ela= 5 3 count= |
| 00030 31 20 69 6e 74 72 3d 30 20 74 69 6d 65 6f 75 74 1 intr=0 timeout |
| 00040 3d 34 32 39 34 39 36 37 32 39 35 20 6f 62 6a 23 =4294967 295 obj# |
| 00050 3d 37 33 34 32 36 20 74 69 6d 3d 31 33 35 36 36 =73426 t im=13566 |
| 00060 32 31 31 31 30 32 35 38 30 31 34 21110258 014 |
write(9, "\n", 1) = 1
| 00000 0a . |
io_getevents(139892671860736, 1, 128, {{0x7f3b4b585420, 0x7f3b4b585420, 1032192, 0}}, {0, 0}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b5851e0, 0, 0, 0, 11}}) = 1
write(9, "WAIT #3: nam='kfk: async disk IO"..., 106) = 106
| 00000 57 41 49 54 20 23 33 3a 20 6e 61 6d 3d 27 6b 66 WAIT #3: nam='kf |
| 00010 6b 3a 20 61 73 79 6e 63 20 64 69 73 6b 20 49 4f k: async disk IO |
| 00020 27 20 65 6c 61 3d 20 36 20 63 6f 75 6e 74 3d 31 ' ela= 6 count=1 |
| 00030 20 69 6e 74 72 3d 30 20 74 69 6d 65 6f 75 74 3d intr=0 timeout= |
| 00040 34 32 39 34 39 36 37 32 39 35 20 6f 62 6a 23 3d 42949672 95 obj#= |
| 00050 37 33 34 32 36 20 74 69 6d 3d 31 33 35 36 36 32 73426 ti m=135662 |
| 00060 31 31 31 30 32 35 39 37 35 39 11102597 59 |
write(9, "\n", 1) = 1
| 00000 0a . |
io_getevents(139892671860736, 1, 128, {{0x7f3b4b5851e0, 0x7f3b4b5851e0, 1032192, 0}}, {0, 0}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b585420, 0, 0, 0, 12}}) = 1
write(9, "WAIT #3: nam='kfk: async disk IO"..., 106) = 106
| 00000 57 41 49 54 20 23 33 3a 20 6e 61 6d 3d 27 6b 66 WAIT #3: nam='kf |
| 00010 6b 3a 20 61 73 79 6e 63 20 64 69 73 6b 20 49 4f k: async disk IO |
| 00020 27 20 65 6c 61 3d 20 39 20 63 6f 75 6e 74 3d 31 ' ela= 9 count=1 |
| 00030 20 69 6e 74 72 3d 30 20 74 69 6d 65 6f 75 74 3d intr=0 timeout= |
| 00040 34 32 39 34 39 36 37 32 39 35 20 6f 62 6a 23 3d 42949672 95 obj#= |
| 00050 37 33 34 32 36 20 74 69 6d 3d 31 33 35 36 36 32 73426 ti m=135662 |
| 00060 31 31 31 30 32 36 32 31 37 38 11102621 78 |
What is interesting here, is that the event ‘kfk: async disk IO’ is reported after both the io_getevents() and the io_submit() call. (with the io_submit call an IO request or IO requests are dispatched to the operating system, with the io_getevents call the operating system (AIO) completion queue is examined to see if any requests are ready). So we can conclude that ‘System I/O’ is not appropriate (by my definition), this is something my foreground process apparently is waiting for. We can also conclude that this event is related to IO. But what does it mean?
In order to dig deeper, we need to know what the database is doing internally. The most obvious way is to read the source. But, it’s not possible to get the Oracle database source if your not working for Oracle (and probably explicitly have been granted access). There are a number of ways to see more. I wrote about ‘perf’, but also the gdb (gnu debugger) can be used.
In order to debug a running Oracle database process, you can attach to a process with the debugger (as root) with ‘gdb -p PID’. An equivalent of using trace and looking for io_submit() and io_getevents() calls is:
# gdb -p 14122
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
...
0x0000003f38a0e530 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb)
Now enter the following commands (commands have a bold typeface):
(gdb) set pagination off
(gdb) break io_submit
Breakpoint 1 at 0x3f38200660: file io_submit.c, line 23.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>frame
>continue
>end
(gdb) break 'io_getevents@plt'
Breakpoint 2 at 0x9d1a58
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent
>frame
>continue
>end
(gdb) continue
Continuing.
And run the full segment scan again:
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x00000000009d1a58 in io_getevents@plt ()
Now we need to know the function call which Oracle uses for the registering waits, for which we used the tracefile generated by SQL trace earlier. The first line in the tracefile of our current session if we issue the full scan again will be the ending of the wait ‘SQL*Net message from client’. The call which Oracle uses for writing this line is write(). Let’s clean up the current breakpoints, and set a breakpoint for the write call (stop the debugger’s current action using CTRL-c):
(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) break write
Breakpoint 3 at 0x3f38a0e4c0: file ../sysdeps/unix/syscall-template.S, line 82. (3 locations)
(gdb) c
Continuing.
(‘c’ is an abbreviation of ‘continue’)
And run the scan again.
Because of the breakpoint set, gdb will break execution when the write() call is issued:
Breakpoint 3, write () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
And generate a backtrace (a list of all the functions which where called to reach the write() call), with the ‘backtrace’ command (‘bt’):
(gdb) bt
#0 write () at ../sysdeps/unix/syscall-template.S:82
#1 0x00000000086014dd in sdbgrfuwf_write_file ()
#2 0x0000000008601426 in sdbgrfwf_write_file ()
#3 0x00000000085caafb in dbgtfdFileWrite ()
#4 0x00000000085ca685 in dbgtfdFileAccessCbk ()
#5 0x00000000085c9d6d in dbgtfPutStr ()
#6 0x0000000001a8d93a in dbktWriteTimestamp ()
#7 0x0000000001a8d6b5 in __PGOSF61_dbktWriteNfy ()
#8 0x00000000085caa06 in dbgtfdFileWrite ()
#9 0x00000000085ca685 in dbgtfdFileAccessCbk ()
#10 0x00000000085c9d6d in dbgtfPutStr ()
#11 0x0000000001a8e291 in dbktPri ()
#12 0x00000000009f143e in ksdwrf ()
#13 0x00000000009f1d8f in ksdwrfn ()
#14 0x0000000005abf0e1 in kxstTraceWait ()
#15 0x000000000821e88d in kslwtectx ()
#16 0x00000000083c2a79 in __PGOSF24_opikndf2 ()
#17 0x000000000143b790 in opitsk ()
#18 0x00000000014406da in opiino ()
#19 0x00000000083c54fd in opiodr ()
#20 0x0000000001437b60 in opidrv ()
#21 0x00000000018aac97 in sou2o ()
#22 0x00000000009d3ef1 in opimai_real ()
#23 0x00000000018aff26 in ssthrdmain ()
#24 0x00000000009d3e5d in main ()
What is interesting to see here, is the function at number 15, kslwtectx(). We know this write reports the ending of waiting at ‘SQL*Net message from client’. This function is ‘Kernel Service Layer WaiT End ConTeXt’ (educated guess; I’ve had a little help from Tanel Poder to find the Oracle wait interface functions). The function which marks the start of a wait is called ‘kslwtbctx’, and the function which ends a wait is called ‘kslwtectx’. Let’s put a break and continue on io_submit, io_getevents, kslwtbctx and kslwtectx:
(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) break io_submit
Breakpoint 4 at 0x3f38200660
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break 'io_getevents@plt'
Breakpoint 5 at 0x9d1a58
(gdb) commands
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break kslwtbctx
Breakpoint 6 at 0x8217fda
(gdb) commands
Type commands for breakpoint(s) 6, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break kslwtectx
Breakpoint 7 at 0x821e3d8
(gdb) commands
Type commands for breakpoint(s) 7, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.
Now run the full scan again; this is what is seen in the debugger:
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000008217fda in kslwtbctx ()
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000008217fda in kslwtbctx ()
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000008217fda in kslwtbctx ()
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x000000000821e3d8 in kslwtectx ()
This goes on, until the IO calls end, and the foreground does some other stuff at the end, ending with the beginning of waiting on ‘SQL*Net message from client’:
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000008217fda in kslwtbctx ()
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000008217fda in kslwtbctx ()
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000000008217fda in kslwtbctx ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000000008217fda in kslwtbctx ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000000008217fda in kslwtbctx ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000000008217fda in kslwtbctx ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000000008217fda in kslwtbctx ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000000008217fda in kslwtbctx ()
Okay, back to ‘kfk: async disk IO’. We see the ‘io_getevents()’ call which ALWAYS has a kslwtbctx() call before, and always a kslwtectx() call after the io_getevents() call itself. Please mind we are on 11.2.0.1. This strongly suggests ‘kfk: async disk IO’ is the instrumentation of the io_getevents() call. Or io_getevents() is part of this wait event.
Let’s make the io_getevents() call take longer, and see if that is reflected in the wait, to prove the hypothesis made above. This can be done using the ‘catch’ function of the debugger:
(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) catch syscall io_getevents
Catchpoint 8 (syscall 'io_getevents' [208])
(gdb) commands
Type commands for breakpoint(s) 8, one per line.
End with a line saying just "end".
>silent
>shell sleep 0.01
>c
>end
(gdb) c
Continuing.
Please mind it’s ‘io_getevents’ here instead of ‘io_getevents@plt’, because the catch function looks the syscall up in an header file, instead of the symbol table in the executable.
Now run the full scan again, with SQL trace turned on at level 8 to show the waits. Previously we saw waits with an elapsed time mostly lesser than 10ms. here is a snippet from the trace:
PARSING IN CURSOR #3 len=23 dep=0 uid=85 oct=3 lid=85 tim=1356626115353145 hv=1020534364 ad='7f1b14a0' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #3:c=0,e=203,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626115353116
EXEC #3:c=1000,e=80,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626115355176
WAIT #3: nam='SQL*Net message to client' ela= 30 driver id=1413697536 #bytes=1 p3=0 obj#=73426 tim=1356626115355832
WAIT #3: nam='kfk: async disk IO' ela= 26404 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115384177
WAIT #3: nam='kfk: async disk IO' ela= 25050 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115411677
WAIT #3: nam='kfk: async disk IO' ela= 24885 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115437990
WAIT #3: nam='kfk: async disk IO' ela= 25002 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115464244
*** 2012-12-27 17:35:15.492
WAIT #3: nam='kfk: async disk IO' ela= 25192 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115492891
WAIT #3: nam='kfk: async disk IO' ela= 25175 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115521419
WAIT #3: nam='kfk: async disk IO' ela= 24804 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115549124
WAIT #3: nam='kfk: async disk IO' ela= 24980 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115576955
WAIT #3: nam='kfk: async disk IO' ela= 24988 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115604879
A little explanation here: the debugger triggers the syscall catcher on entrance and leaving of the syscall. The times being somewhat more than 20ms, instead of a few microseconds gives a fair indication the wait is related. We can also look in the debugger:
(gdb) info break
Num Type Disp Enb Address What
8 catchpoint keep y syscall "io_getevents"
catchpoint already hit 174 times
silent
shell sleep 0.01
c
Yes, it has been hit 174 times. So ‘kfk: async disk IO’ is the instrumentation of the ‘io_getevents()’ call, or more actions of which the io_getevents() is part of.
Okay, now that we know that, let’s switch to Oracle version 11.2.0.3…
This is what I find on doing EXACTLY the same (full scan) in the tracefile:
PARSING IN CURSOR #140382932318408 len=23 dep=0 uid=84 oct=3 lid=84 tim=1356626776003874 hv=1020534364 ad='7f491298' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #140382932318408:c=0,e=41,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626776003873
EXEC #140382932318408:c=0,e=31,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626776003962
WAIT #140382932318408: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356626776004009
WAIT #140382932318408: nam='direct path read' ela= 340 file number=5 first dba=28418 block cnt=126 obj#=75579 tim=1356626776045733
FETCH #140382932318408:c=164975,e=188719,p=20941,cr=20944,cu=0,mis=0,r=1,dep=0,og=1,plh=3321871023,tim=1356626776192752
STAT #140382932318408 id=1 cnt=1 pid=0 pos=1 obj=0 op='SORT AGGREGATE (cr=20944 pr=20941 pw=0 time=188708 us)'
STAT #140382932318408 id=2 cnt=1000000 pid=1 pos=1 obj=75579 op='TABLE ACCESS FULL T2 (cr=20944 pr=20941 pw=0 time=68726 us cost=5738 siz=0 card=1000000)'
Okay. First thing to notice is the ‘kfk: async disk IO’ events are gone. We see the PARSE line, the SQL Net message, and a single ‘direct path read’ wait, and then the FETCH line. This is absolutely different behavior than version 11.2.0.1!
What is happening? I think the only way to understand more about this, is running the debugger again with notification of io_submit, io_getevents, kslwtbctx and kslwtectx:
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000008fa1334 in kslwtectx ()
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000008fa1334 in kslwtectx ()
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000008fa1334 in kslwtectx ()
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000008fa1334 in kslwtectx ()
#0 0x0000000008f9a652 in kslwtbctx ()
That’s odd! There are no calls to the code location’s kslwtbctx and kslwtectx during IO processing! So, despite ‘kfk: async disk IO’ still being a wait event, it doesn’t seem to be instrumenting io_getevents() anymore. What if we make io_getevents() take longer (I’ve used the same number as the earlier catching of the syscall, 0.01 second):
PARSE #140382932318408:c=0,e=168,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356627973832834
EXEC #140382932318408:c=0,e=68,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356627973833385
WAIT #140382932318408: nam='SQL*Net message to client' ela= 671 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356627973835367
WAIT #140382932318408: nam='asynch descriptor resize' ela= 680 outstanding #aio=0 current aio limit=152 new aio limit=182 obj#=75579 tim=1356627973837361
*** 2012-12-27 18:06:16.440
FETCH #140382932318408:c=185972,e=2604507,p=20941,cr=20944,cu=0,mis=0,r=1,dep=0,og=1,plh=3321871023,tim=1356627976440376
No, the time in the FETCH line increases (elapsed; e=2604507, cpu; c=185972; there’s 2418535 microseconds of time which is not spend on CPU), but there is only little time instrumented by wait events. A reasonable conclusion is the ‘kfk: disk async IO’ wait event is different in this version (11.2.0.3). I’ve also tried to slow down the io_submit() call the same way, with the same result: the time increases in the FETCH line, but there’s nothing the database writes a wait line for.
What if I slow down the disk? This can be done using cgroups, as described in this post. I’ve set my both my ASM disks to 1 IO per second. That should result in waits! This is the resulting trace output:
PARSE #140382932318408:c=0,e=20,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356633964787836
EXEC #140382932318408:c=0,e=64,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356633964788256
WAIT #140382932318408: nam='SQL*Net message to client' ela= 703 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356633964790122
WAIT #140382932318408: nam='asynch descriptor resize' ela= 699 outstanding #aio=0 current aio limit=238 new aio limit=264 obj#=75579 tim=1356633964792014
WAIT #140382932318408: nam='direct path read' ela= 497938 file number=5 first dba=23939 block cnt=13 obj#=75579 tim=1356633965295483
*** 2012-12-27 19:46:05.795
WAIT #140382932318408: nam='direct path read' ela= 495498 file number=5 first dba=23953 block cnt=15 obj#=75579 tim=1356633965795382
*** 2012-12-27 19:46:06.295
WAIT #140382932318408: nam='direct path read' ela= 495890 file number=5 first dba=23969 block cnt=15 obj#=75579 tim=1356633966295208
WAIT #140382932318408: nam='direct path read' ela= 495889 file number=5 first dba=23985 block cnt=15 obj#=75579 tim=1356633966795127
So, if I slow the IO down, I get ‘direct path read’ wait events. How does that look when I use the break/continue technique on this configuration?
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000008fa1334 in kslwtectx ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000008fa1334 in kslwtectx ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 0x0000000008fa1334 in kslwtectx ()
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000008fa1334 in kslwtectx ()
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000008fa1334 in kslwtectx ()
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000008fa1334 in kslwtectx ()
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000008fa1334 in kslwtectx ()
#0 0x0000000008f9a652 in kslwtbctx ()
The interesting part are the io_getevents() calls, which consistently are 4 calls before kslwtbctx is called, and another one or two calls are done before kslwtectx is called, and wait time is registered (see above). Interesting, but no ‘kfk: async disk IO’.
Now let’s look at the same output when Oracle 11.2.0.1 is throttled:
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000008217fda in kslwtbctx ()
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x0000000008217fda in kslwtbctx ()
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0 0x0000000008217fda in kslwtbctx ()
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x0000000008217fda in kslwtbctx ()
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x00000000009d1a58 in io_getevents@plt ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000000008217fda in kslwtbctx ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000000008217fda in kslwtbctx ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000000008217fda in kslwtbctx ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000000008217fda in kslwtbctx ()
#0 0x000000000821e3d8 in kslwtectx ()
#0 0x0000000008217fda in kslwtbctx ()
This looks remarkably the same as the 11.2.0.3 version!
It seems the database issues a number of io_getevents() calls after asynchronous IO calls are submitted with io_submit(), which do return immediately (also known as ‘non-blocking’). They just peak at the completion queue, if an IO is available in the completion queue, the process processes that IO. The number of non-blocking calls is 3 with version 11.2.0.1 and 4 with 11.2.0.3 (please mind the ‘slot’ mechanism which controls the number of simultaneous AIO requests is probably in play here too). If these calls do not return any IO request, the wait interface is called (kslwtbctx()), and a blocking io_getevents() call is issued (which waits for at least 1 request to be returned too), for which the wait time is visible as event ‘direct path read’.
It also seems the first non-blocking io_getevents() call is instrumented by the ‘kfk: async disk IO’ event with Oracle 11.2.0.1, for reasons unknown to me. This instrumentation is not present in version 11.2.0.3.
In my previous post I touched the topic of a “new” codepath (codepath means “way of utilising the operating system”) for of full segment scans. Full segment scan means scanning a segment (I use “segment” because it could be a partition of a table or index) which is visible commonly visible in an explain plan (but not restricted to) as the rowsources “TABLE ACCESS FULL”, “FAST FULL INDEX SCAN” and “BITMAP FULL SCAN”.
Look at my presentation About multiblock reads to see how and when direct path reads kick in, and what the difference between the two is. Most notably, Oracle has released very little information about asynchronous direct path reads.
This post is about the implementation of direct path reads in the Oracle executable in 11.2.0.3 and the OS implementation on Linux.
For completeness: this is Oracle Linux 6.3 X64 with kernel 2.6.39-300.17.3.el6uek.x86_64. Database 11.2.0.3 64 bit, no PSU/CPU.
The database uses two ASM disks and the clusterware in single machine mode.
The system is running in a VM on my Macbook OSX 10.8.2 with SSD, VMWare Professional 5.0.2.
The database action which is traced with gdb is a simple “select count(*) from t2″, where t2 is a table without any indexes or constraints, and big enough to make the database engine choose a direct path read. Oracle can choose a number of new, largely undocumented mechanisms for optimising specifically direct path reads. The mechanisms observed in this blogpost are the true asychronous reads, which means the foreground process can issue several IO read requests, and reap them if they are finished, and increasing the number of simultaneous IO requests.
First we restore the setup from my previous post, which means a normal database process (using sqlplus), and a root session with gdb attached to this process using “gdb -p “
Now set breakpoints for the operating system IO calls (with the intention to see io_submit() and io_getevents()) to show the IO calls for the oracle foreground process, and let the oracle process continue processing by using “c” in the debugger:
(gdb) rbreak ^io_.* Breakpoint 1 at 0xa08c20 io_prep_pwritev; Note: breakpoint 1 also set at pc 0xa08c20. ... Breakpoint 45 at 0x7f2e71b1dc0c io_prep_poll; (gdb) commands Type commands for breakpoint(s) 1-45, one per line. End with a line saying just "end". >silent >f >c >end (gdb) c Continuing.
#0 0x0000000002cfb352 in io_prep_pread () #0 0x0000000000a09bb0 in io_submit@plt () #0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1 #0 0x0000000000a09030 in io_getevents@plt () #0 0x0000000002cfb352 in io_prep_pread () #0 0x0000000000a09bb0 in io_submit@plt () #0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1 #0 0x0000000002cfb352 in io_prep_pread () #0 0x0000000000a09bb0 in io_submit@plt () #0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1 #0 0x0000000000a09030 in io_getevents@plt ()
This is where ‘plt’ needs an introduction. ‘plt’ means procedure linkage table. This is a construction for functions used by an executable that uses dynamically linked shared libraries, which is exactly what the ‘oracle’ executable uses. If you look at the address of the @plt call, you see it is in the oracle executable (see /proc/maps), and io_submit () from /lib/libaio.so.1 is, as the line indicates from the dynamically linked library libaio.so.1 (which also can be seen from /proc//maps).
Back to io_getevents. We see only the io_getevents@plt call, which means that either the executable fakes the system call (?), or that we are missing something. This lead me to investigate the libaio library itself. This can be done using ‘nm -D’ (nm is an executable):
# nm -D /lib64/libaio.so.1 0000000000000000 A LIBAIO_0.1 0000000000000000 A LIBAIO_0.4 0000003f38200710 T io_cancel 0000003f38200670 T io_cancel 0000003f38200690 T io_destroy 0000003f382006a0 T io_getevents 0000003f38200620 T io_getevents 0000003f38200570 T io_queue_init 0000003f38200590 T io_queue_release 0000003f382005b0 T io_queue_run 0000003f382005a0 T io_queue_wait 0000003f382006e0 T io_queue_wait 0000003f38200680 T io_setup 0000003f38200660 T io_submit
Let’s see which io_getevents call or calls are used by the oracle executable:
(gdb) del Delete all breakpoints? (y or n) y (gdb) rbreak ^io_get.* Breakpoint 46 at 0x3f382006a0 io_getevents; ... Breakpoint 53 at 0xa09030 io_getevents@plt;
(gdb) info break Num Type Disp Enb Address What 46 breakpoint keep y 0x0000003f382006a0 io_getevents 47 breakpoint keep y 0x0000000000a09030 io_getevents@plt 48 breakpoint keep y 0x0000003f382006a0 io_getevents 49 breakpoint keep y 0x0000000000a09030 io_getevents@plt 50 breakpoint keep y 0x0000003f382006a0 io_getevents 51 breakpoint keep y 0x0000003f382006a0 io_getevents 52 breakpoint keep y 0x0000003f382006a0 io_getevents 53 breakpoint keep y 0x0000000000a09030 io_getevents@plt
Let’s make gdb break and continue at the second io_getevents address too!
(gdb) break *0x0000003f38200620 Breakpoint 54 at 0x3f38200620 (gdb) commands Type commands for breakpoint(s) 54, one per line. End with a line saying just "end". >silent >f >c >end (gdb) c Continuing.
#0 0x0000000002cfb352 in io_prep_pread () #0 0x0000000000a09bb0 in io_submit@plt () #0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1 #0 0x0000000002cfb352 in io_prep_pread () #0 0x0000000000a09bb0 in io_submit@plt () #0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1 #0 0x0000000000a09030 in io_getevents@plt () #0 0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
So, we now got a breakpoint set in gdb which is able show us the asynchronous operating system calls Oracle uses in a scan, which are io_submit and io_getevents. Oracle uses more calls for managing asynchronous IO, but for clarity I focus on io_submit and io_getevents.
At this point I can add two breakpoints for the wait instrumentation (kslwtbctx (enter wait) and kslwtectx (end wait)) using:
(gdb) rbreak ^kslwt[be]ctx Breakpoint 55 at 0x8f9a652 kslwtbctx; Breakpoint 56 at 0x8fa1334 kslwtectx; (gdb) commands Type commands for breakpoint(s) 55-56, one per line. End with a line saying just "end". >silent >f >c >end
As introduced in the previous post, the last function call seen is entering of a wait using ‘kslwtbctx’, which is the event for ‘SQL*Net message from client’, which is the foreground waiting for user input to process:
#0 0x0000000002cfb352 in io_prep_pread () #0 0x0000000000a09bb0 in io_submit@plt () #0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1 #0 0x0000000000a09030 in io_getevents@plt () #0 0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1 #0 0x0000000002cfb352 in io_prep_pread () #0 0x0000000000a09bb0 in io_submit@plt () #0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1 #0 0x0000000002cfb352 in io_prep_pread () #0 0x0000000000a09bb0 in io_submit@plt () #0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1 #0 0x0000000000a09030 in io_getevents@plt () #0 0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1 #0 0x0000000008f9a652 in kslwtbctx () #0 0x0000000008fa1334 in kslwtectx () #0 0x0000000008f9a652 in kslwtbctx () #0 0x0000000008fa1334 in kslwtectx () #0 0x0000000008f9a652 in kslwtbctx () #0 0x0000000008fa1334 in kslwtectx () #0 0x0000000008f9a652 in kslwtbctx () #0 0x0000000008fa1334 in kslwtectx () #0 0x0000000008f9a652 in kslwtbctx ()
I slowed down both my (ASM-)disks to 1 IOPS, and ran the scan in sqlplus again. This is a snippet of the gdb output:
#0 0x0000000002cfb352 in io_prep_pread () #0 0x0000000000a09bb0 in io_submit@plt () #0 0x0000003f38200660 in io_submit () from /lib64/libaio.so.1 #0 0x0000000000a09030 in io_getevents@plt () #0 0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1 #0 0x0000000000a09030 in io_getevents@plt () #0 0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1 #0 0x0000000000a09030 in io_getevents@plt () #0 0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1 #0 0x0000000000a09030 in io_getevents@plt () #0 0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1 #0 0x0000000008f9a652 in kslwtbctx () #0 0x0000000000a09030 in io_getevents@plt () #0 0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1 #0 0x0000000008fa1334 in kslwtectx ()
The pattern can have multiple forms, for example, when the scan starts, after the initialisation of the asynchronous IO dependencies (like an IO context on the OS and the IO slots in Oracle), the first calls are two io_submit calls (in order to get the minimal number of asynchronous IO’s (2) in flight).
Apparently, the four io_getevents calls after io_submit are non-blocking. This is my assumption, simply because I see these calls scrolling over my screen until the wait event is registered, and the scrolling stops at another io_getevents call. But can we be sure?
In order to look deeper into the io_getevents calls, the arguments of the io_getevents function specifically, the debuginfo package of libaio needs to be installed (the debuginfo packages are available at http://oss.oracle.com/ol6/debuginfo). After that, the gdb session needs to be restarted in order to pickup the debug information.
This is how the gdb output looks like with the libaio debuginfo installed:
#0 0x0000000002cfb352 in io_prep_pread () #0 0x0000000000a09bb0 in io_submit@plt () #0 io_submit (ctx=0x7ff6ceb2c000, nr=1, iocbs=0x7fff2a4e09e0) at io_submit.c:23 23 io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs) #0 0x0000000000a09030 in io_getevents@plt () #0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4e9048, timeout=0x7fff2a4ea050) at io_getevents.c:46 46 if (ring==NULL || ring->magic != AIO_RING_MAGIC) #0 0x0000000000a09030 in io_getevents@plt () #0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4ec128, timeout=0x7fff2a4ed130) at io_getevents.c:46 46 if (ring==NULL || ring->magic != AIO_RING_MAGIC) #0 0x0000000000a09030 in io_getevents@plt () #0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4e8e48, timeout=0x7fff2a4e9e50) at io_getevents.c:46 46 if (ring==NULL || ring->magic != AIO_RING_MAGIC) #0 0x0000000000a09030 in io_getevents@plt () #0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4ebf28, timeout=0x7fff2a4ecf30) at io_getevents.c:46 46 if (ring==NULL || ring->magic != AIO_RING_MAGIC) #0 0x0000000008f9a652 in kslwtbctx () #0 0x0000000000a09030 in io_getevents@plt () #0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=1, nr=128, events=0x7fff2a4e8e38, timeout=0x7fff2a4e9e40) at io_getevents.c:46 46 if (ring==NULL || ring->magic != AIO_RING_MAGIC) #0 0x0000000008fa1334 in kslwtectx ()
First of all, all the IO calls involved are done in the same “aio_context”: ctx=0x7ff6ceb2c000. This means that any asynchronous IO that is done during this query reports it’s readiness here. This means IO’s can be reaped out of order(!).
Next is min_nr. The non-instrumented io_getevents calls have min_nr set to 2, the io_getevents call which is instrumented has min_nr set to 1.
nr is the maximum number of IO’s that can be reaped by this call in this aio_context. As far as I know, the number of slots cannot be higher than 32, which means that if I am right, there will not be more than 32 requests.
The last io_getevents argument is timeout, which is really interesting to understand and verify the behavior I described. This value is a pointer to a struct which holds the timeout specification. In order to actually know the timeout value, we need to print the contents of the struct.
This is where gdb, once the debuginfo is available, can help. Let’s modify the breakpoints the following way:
(gdb) del Delete all breakpoints? (y or n) y (gdb) break *0x0000003f38200620 Breakpoint 38 at 0x3f38200620: file io_getevents.c, line 46. (gdb) commands Type commands for breakpoint(s) 38, one per line. End with a line saying just "end". >print *timeout >c >end (gdb) rbreak ^kslwt[be]ctx Breakpoint 39 at 0x8f9a652 kslwtbctx; Breakpoint 40 at 0x8fa1334 kslwtectx; (gdb) commands Type commands for breakpoint(s) 39-40, one per line. End with a line saying just "end". >silent >f >end (gdb) rbreak ^io_.* Breakpoint 41 at 0x3f38200570: file io_queue_init.c, line 28. int io_queue_init(int, io_context_t *); ... Breakpoint 74 at 0x7f549fd44c0c io_prep_poll; (gdb) commands Type commands for breakpoint(s) 41-74, one per line. End with a line saying just "end". >silent >f >c >end (gdb) c Continuing.
#0 0x0000000002cfb352 in io_prep_pread () #0 0x0000000000a09bb0 in io_submit@plt () #0 io_submit (ctx=0x7f54a1956000, nr=1, iocbs=0x7fff78f059b0) at io_submit.c:23 23 io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs) #0 0x0000000000a09030 in io_getevents@plt () Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f0dfd8, timeout=0x7fff78f0efe0) at io_getevents.c:46 46 if (ring==NULL || ring->magic != AIO_RING_MAGIC) $21 = {tv_sec = 0, tv_nsec = 0} #0 0x0000000000a09030 in io_getevents@plt () Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f110b8, timeout=0x7fff78f120c0) at io_getevents.c:46 46 if (ring==NULL || ring->magic != AIO_RING_MAGIC) $22 = {tv_sec = 0, tv_nsec = 0} #0 0x0000000000a09030 in io_getevents@plt () Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f0ddd8, timeout=0x7fff78f0ede0) at io_getevents.c:46 46 if (ring==NULL || ring->magic != AIO_RING_MAGIC) $23 = {tv_sec = 0, tv_nsec = 0} #0 0x0000000000a09030 in io_getevents@plt () Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f10eb8, timeout=0x7fff78f11ec0) at io_getevents.c:46 46 if (ring==NULL || ring->magic != AIO_RING_MAGIC) $24 = {tv_sec = 0, tv_nsec = 0} #0 0x0000000008f9a652 in kslwtbctx () #0 0x0000000000a09030 in io_getevents@plt () Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=1, nr=128, events=0x7fff78f0ddc8, timeout=0x7fff78f0edd0) at io_getevents.c:46 46 if (ring==NULL || ring->magic != AIO_RING_MAGIC) $25 = {tv_sec = 600, tv_nsec = 0}
The reason for this post is to show the essence of how the direct path reads work. There is much more to be said on this subject, especially with the adaptive or “auto tune” mechanism, which scales up the number of asynchronous IO’s in flight.
Recently I was asked to look at a virtual (linux) system which needed to be moved to a new datacenter. If you want to determine if you are on VM Ware, you can use either lspci or dmidecode. A little searching on the internet revealed it’s reasonably easy to determine the version of VMWare ESX using the BIOS Information:
case $( dmidecode | grep -A4 "BIOS Information" | grep Address | awk '{ print $2 }' ) in "0xE8480" ) echo "ESX 2.5" ;; "0xE7C70" ) echo "ESX 3.0" ;; "0xE7910" ) echo "ESX 3.5" ;; "0xE7910" ) echo "ESX 4" ;; "0xEA550" ) echo "ESX 4U1" ;; "0xEA2E0" ) echo "ESX 4.1" ;; "0xE72C0" ) echo "ESX 5" ;; "0xEA0C0" ) echo "ESX 5.1" ;; * ) echo "Unknown version: " dmidecode | grep -A4 "BIOS Information" ;; esac
Sources:
http://virtwo.blogspot.com/2010/10/which-esx-version-am-i-running-on.html
http://dag.wieers.com/blog/detecting-vmware-esx-from-the-guest-os