2007-11-17

2K ill | NOT 2K ill (on the essence of KeepAlive)

I have one article, intended for publishing as a MetaLink note quite long ago (2001-2002?). Inspite of of its quite venerable age, I use it quite often as the recommendation for the customers, who experience the problems with the dead sessions.

Usually this is due to unsophisticated (to say the least) quality of the middle-tier in use (stateless WEB connection, repeated logins for a couple of statements and forgetting to close the connection to DB - i.e. not using proper connection pooling), the other cases may raise due to bad quality of the network. Anyway, the DBA should find some method to deal with that situation.

One of the many ways is to utilize the KeepAlive feature of TCP/IP protocol.

So, the article is here. Your comments/corrections are very welcome.

---------

Adressee's

This short article may be usefull for of those advanced DBA's, whose client's PCs are connected to the Oracle Server via slow/noisy channels, i.e. where temporary breaks in the network (using TCP/IP) connections occur quite often.

Preface

In most cases the Oracle foreground/server/shadow process detects that a communication channel to the client is broken in a timely fashion, so the rest of connection failure processing is successfully performed by PMON background process. However, this is not always the case.

Typical situation:

a) for some reason (for example, a user have pressed RESET or POWER-OFF/ON button on his/her PC, or there was a short break in the WAN network, or there are any time-outs specified on the firewall/router/switch) the client part of the connection terminated abnormally. Which means, that server processs "sirvived" while client program was terminated or detected a network failure.

b) the user had re-established connection to the server (i.e. created the new session)

c) the "old" session was in 'INACTIVE' state during the abnormal disconnect (otherwise - status 'ACTIVE' - the broken connection would be detected when the server process will try to return the result or status of the current command to the client)

d) this "old" session never dies (or will die too late)

e) DBA wants to "cleanup" the instance, because PMON does nothing on the "old" session and this session holds the resources like locks, which bother the other sessions.

f) an attept to kill the old session via Oracle doesn't help: status='KILLED', process='PSEUDO'

alter system kill session '<sid>,<serial#>';
alter system disconnect session '<sid>,
<serial#>' immediate;


g) an attept to kill the session after f) also causes some difficulties
select
p.spid from v$session s, v$process p
where s.paddr=p.addr
and s.sid=<sid>;

returns nothing. Of course, there are methods to find the runaway process as it is still present in V$PROCESS.

One more comment: though in unix-like systems the kill -KILL usually solves the problem (when the right process was killed :-)), this is not the case with Oracle 32-bit on the 32-bit Windows and versions <9i.

The Oracle Net (Net8/SQL*Net) product provides for a mechanism called Dead Connection Detection (DCD). This is configured via Client Profile (sqlnet.ora) using SQLNET.EXPIRE_TIME parameter on a server side. Such a behaviour is protocol independent. Unfortunately, the implementation of this DCD feature is not always satisfactory. Namely, while the setting is in effect, some long running PL/SQL procedures might be erroneously recognized as "DEAD". Or the long running export utility sessions also could be "killed". The most typical mistake here is when the DCD is configured on the client (and not server) side. However, when the dblink is used, then the server becomes a client for the other server. Here is the problem.

TCP/IP "Keep Alive" feature

The TCP/IP protocol is by the fact an industry standard.

Oracle networking products just rely on the features of ANY industry standard protocol.

Oracle TCP/IP adapter also could use a common feature of TCP/IP, which is a subject of this note, namely "Keep Alive".

IMPORTANT NOTE: If the KeepAlive message is not responded upon in a required time interval, than the shadow (server) process dies gracefully, allowing the PMON to do the clear "process failure cleanup".

The mechanism of the KeepAlive

is very straightforward - the two communicating parties (client and server) periodically (by default the time interval equals to 2 hours) are sending each other some "empty" or "probe" TCP/IP packets, which just confirm, that the partner of the connection is able to respond. If the partner doesn't respond, than the sending party resends the KeepAlive packets at more frequent intervals for some time. Finally, if there is still no response, the connection is marked as "faulted" and corresponding TRAP is sent to the connection "owner" on the OS level. Oracle server processes do not provide any handlers for the TRAP, so the process dies. The death of the server process allows the PMON to cleanup.

The mechanism of the DCD

is a little bit different. The PMON (IIRC) periodically enqueues some IPC (Inter-Process Communication) message in the SHARED POOL and "posts" this message (via OS semaphore, post-wait driver or similar). The it waits for the reply through ICP. It is assumed, that each server process will check its own incoming message queue, sends an Oracle Net "probe" packet and in case of cuccess will respond to PMON in timely fashion. The question then is - how often this incoming queue is checked? Is it checked while the call execution is going on (i.e. asynchronously with respect to the main processing)? In the tests on Oralce10g there is an evidence of such asynchronous processing for DCD messages. But not in the earlier versions.

TCP/IP "Keep Alive" feature - how to (employ)

To switch on the KeepAlive feature of the TCP/IP protocol, one should:

- specify in the "service specification" in his/her client's tnsnames.ora/Oracle Names config the following clause (an example):

xxx.domain.com = (DESCRIPTION=(ENABLE=BROKEN)(ADDRESS= ....)... )

Some details can be found in the MetaLink Note:39357.1

- decrease (if required) the interval between KeepAlive messages on the OS level. The later is platform-depenent. Find below some examples for a number of the platforms:

a) Windows: change the registry settings (see Microsoft support site for case 120642

HKEY_LOCAL_MACHINE/SYSTEM/CurrentControlSet/Services/Tcpip/Parameters
Edit - Add value : KeepAliveTime (REG_DWORD)
Decimal -> 180000 (3 minutes, for example)

b) Solaris/HP-UX

/usr/sbin/ndd –set /dev/tcp tcp_keepalive_interval 180000
(sets 3 minutes interval)

c) Linux

/sbin/sysctl -w net.ipv4.tcp_keepalive_time=1800 ?
(sets 3 minutes interval)

d) Tru64

/usr/sbin/netconfig tcp_keepidle 360
(sets 3 minutes interval)

NOTE: time is specified in 1/2 second units

e) AIX

/usr/sbin/no -o tcp_keepidle=360
(sets 3 minutes interval)

NOTE: time is specified in 1/2 second units

2 comments:

Unknown said...

Гммм... интересная статья!

Anonymous said...

1 й блин комом ! Ждем других статей, более родных для нас, ораклистов