soanen - Martti's SOA Blog: helmikuuta 2013

keskiviikko 27. helmikuuta 2013

Troubleshoot performance part 8 - generic short approach

Recommended approach for performance troubleshooting

I recommend running a benchmark for an hour (or longer) and putting the following processes to run in the background. These generate reports for the next hour and five minutes once a minute. Too short intervals may affect the results. Use the guidelines to pinpoint areas for improvement or to prove that everything runs perfectly :-)

top –b –d 60.0 –n 65 >top.out 2>&1 &
iostat –kt 60 65 >iostat.out 2>&1 &
mpstat 60 65 >mpstat.out 2>&1 &

Troubleshoot performance part 7. Additional tools

Additional tools for performance troubleshooting

Alternative tool for performance finding - SAR

Sar is the "system activity report" program found on *nix systems. In Linux, you can usually find it in the sysstat package, which includes programs and scripts to capture and summarize performance data, then produce detailed reports. This suite of programs can be useful in tracking down performance bottlenecks and providing insight into how the system is used throughout the day.

Material from: http://www.linux.com/archive/feature/52570

Showing io usage per process

Showing IO usage per process is not possible out-of-the box in the Linux world (at least to my knowledge). Open source tools exists. They require 2.60.20 kernel (you almost certainly have a newer version than that by now) or newer so that the TASK_DELAY_ACCT ja TASK_IO_ACCOUNTING options are set on.

From a very early google search at least the following tools are available (I have no experience in using these tools):

Iotop

Home page: http://guichaz.free.fr/iotop/
see a note on using iotop here: http://taint.org/2009/04/15/095426a.html

collectl
Collectl has the ability to monitor processes in pretty much the same way as ps or top do as can be see here:
To Probe Further
There is a large set of other options in collectl. See the documentation at:
http://collectl.sourceforge.net/Process.html

Troubleshoot performance - part 6. Disk issues with iostat

Iostat for pinpointing bottleneck devices

If problem is on IO side, iostat can be used to check “hot” devices. These figures should be compared to the theoretical capacity of the devices.

There are too limiting factors for the disk speed: channel (bus or cable) and the disks itself. The channel (IDE, SATE, USB etc.) is seldom the limiting factor in personal or workstations (perhaps if you would put an external SSD to an oldest possible USB channel, there could theoretically be some impact). Not even if there are multiple disks in the same channel. Also in external disk arrays (EMC and the likes.) the connections are usually optic and the system design is such that there won’t be a bottleneck there. You can look at different channel bit rates here: http://en.wikipedia.org/wiki/List_of_device_bit_rates

Usually the disk itself is the limiting factor. Manufacturers publish performance numbers but in practical terms they are often not directly comparable. A common way to tell performance is to indicate IOPS (Input/Output Operations per Second). A disk may for example do 200 IOPS and if one IO is 32kB then the speed is 102 MB/sec. But naturally it is not so simple, manufacturers do not tell normally on what type of load they have calculated the performance numbers. It is quite different whether we are talking about random access to disk or sequential writes/reads. As rule of thumb you may usually assume the manufacturer figures are from sequential writes/reads. In addition caches (both operating system and the disk controller have caches) do affect. Luckily the net is full of instructions and programs that allow you to estimate the speed of the disk if you want to do more research on the topic and your specific environment.

The crude first estimate is to compare the iostat numbers with the speed as published by the manufacturer. If you are over that figure, the disk is having a bad day. If you have a good safety margin and still mpstat reports a lot of io-waits, the there could be some program that is doing random access to disk like there is no tomorrow.

[root@soaserver ~]# iostat 2 2
Linux 2.6.9-55.0.0.0.2.ELsmp (soaserver) 10/14/2012
avg-cpu: %user %nice %sys %iowait %idle
1.64 0.00 0.65 0.07 97.64
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 4.83 13.98 150.48 42224192 454375484
sda1 0.00 0.00 0.00 948 4
sda2 19.10 13.98 150.48 42222140 454375416
dm-0 18.95 13.63 149.67 41147754 451931448
dm-1 0.15 0.36 0.81 1073808 2444032

avg-cpu: %user %nice %sys %iowait %idle
1.00 0.00 1.25 0.00 97.75

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 4.02 0.00 144.72 0 288
sda1 0.00 0.00 0.00 0 0
sda2 18.09 0.00 144.72 0 288
dm-0 18.09 0.00 144.72 0 288
dm-1 0.00 0.00 0.00 0 0

Troubleshoot performance part 5 - /proc/loadavg for seeing if the CPU is the bottleneck

/proc/loadavg for seeing if the CPU is the bottleneck
Another way for looking at CPU usage is to take a look at /proc/loadvg. This file provides a look at the load average in regard to both the CPU and IO over time, as well as additional data used by uptime and other commands. A sample /proc/loadavg file looks similar to the following:

[root@soaserver ~]# cd /proc
[root@soaserver proc]# cat loadavg
0.00 0.00 0.00 1/146 21886
[root@soaserver proc]#
The format is as follows:
avg1 avg2 avg3 running_threads/total_threads last_running_pid

Values of /proc/loadavg

avg1, avg2, avg3. These are number of threads in running state, averaged over the last minute, last 5 minutes, and last 15 minutes, respectively.
running_threads. Number of threads currently in state "running".
total_threads. Total number of threads currently existing in the system.
last_running_pid. PID of the last process observed in state "running".

If the amount of the averages is larger than the number of CPU cores, then there are more threads than available CPU Cores and the system is loaded.

Troubleshoot performance part 4 - top to pinpoint the bottleneck process

Top for pinpointing the bottleneck process

Top can be used to pinpoint the exact process eating CPU resources. Top sorts processes by the amount of CPU resources they need so if there is some process hogging all CPU, it will be at the top.

Top is different from other commands because other commands produce output and exist, top on the other hands displays results on the screen and constantly refreshes it with new information until you stop it by pressing control-C

There is also a command line option –b that allows you to run top in batch mode. If you run it in batch modem you can use –n to indicate how many iterations to run. Sometimes this is handy if you for example want to make a test script that runs stability tests. You can use top with batch mode to run top results into a file. By inspecting the results you can quickly see if there are any memory leaks etc.

Below an example of top in batch mode:

[root@soaserver ~]# top -b -n 1
top - 12:48:20 up 79 days, 16:09, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 97 total, 1 running, 96 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.7%us, 0.1%sy, 0.0%ni, 98.8%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 2621440k total, 2544616k used, 76824k free, 223376k buffers
Swap: 2104504k total, 131176k used, 1973328k free, 1974716k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 15 0 10348 684 572 S 0.0 0.0 1:17.69 init
2 root RT -5 0 0 0 S 0.0 0.0 0:04.96 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.30 ksoftirqd/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
5 root 10 -5 0 0 0 S 0.0 0.0 0:00.04 events/0
6 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 khelper
7 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kthread
9 root 11 -5 0 0 0 S 0.0 0.0 0:00.00 xenwatch
10 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 xenbus
20 root RT -5 0 0 0 S 0.0 0.0 0:03.27 migration/1
21 root 34 19 0 0 0 S 0.0 0.0 0:00.34 ksoftirqd/1
22 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
23 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 events/1
26 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kblockd/0
27 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kblockd/1
28 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 cqueue/0
29 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 cqueue/1
33 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 khubd
35 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kseriod
102 root 15 0 0 0 0 S 0.0 0.0 0:00.00 khungtaskd
105 root 10 -5 0 0 0 S 0.0 0.0 0:29.50 kswapd0
106 root 12 -5 0 0 0 S 0.0 0.0 0:00.00 aio/0
107 root 12 -5 0 0 0 S 0.0 0.0 0:00.00 aio/1
225 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 xenfb thread
243 root 11 -5 0 0 0 S 0.0 0.0 0:00.00 kpsmoused
270 root 14 -5 0 0 0 S 0.0 0.0 0:00.00 ata/0
271 root 14 -5 0 0 0 S 0.0 0.0 0:00.00 ata/1
272 root 14 -5 0 0 0 S 0.0 0.0 0:00.00 ata_aux
282 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kstriped
298 root 10 -5 0 0 0 S 0.0 0.0 0:01.87 kjournald
331 root 10 -5 0 0 0 S 0.0 0.0 0:03.82 kauditd
364 root 11 -4 14040 2240 488 S 0.0 0.1 0:00.04 udevd
847 root 13 -5 0 0 0 S 0.0 0.0 0:00.00 kmpathd/0
848 root 13 -5 0 0 0 S 0.0 0.0 0:00.00 kmpathd/1
849 root 13 -5 0 0 0 S 0.0 0.0 0:00.00 kmpath_handlerd
916 root 11 -5 0 0 0 S 0.0 0.0 0:00.00 kjournald
918 root 10 -5 0 0 0 S 0.0 0.0 0:04.80 kjournald
920 root 10 -5 0 0 0 S 0.0 0.0 0:03.26 kjournald
1247 root 11 -4 27324 828 584 S 0.0 0.0 0:00.44 auditd
1249 root 7 -8 81800 772 616 S 0.0 0.0 0:00.19 audispd
1271 root 15 0 5908 608 488 S 0.0 0.0 0:00.13 syslogd
1274 root 15 0 3804 428 344 S 0.0 0.0 0:00.00 klogd
1313 root 18 0 10760 372 244 S 0.0 0.0 0:00.50 irqbalance
1334 rpc 15 0 8052 572 452 S 0.0 0.0 0:00.00 portmap
1365 root 12 -5 0 0 0 S 0.0 0.0 0:00.00 rpciod/0
1366 root 13 -5 0 0 0 S 0.0 0.0 0:00.00 rpciod/1
1375 root 17 0 10160 796 656 S 0.0 0.0 0:00.00 rpc.statd
1409 root 19 0 55180 768 304 S 0.0 0.0 0:00.00 rpc.idmapd
1434 dbus 18 0 31500 1100 832 S 0.0 0.0 0:00.00 dbus-daemon
1500 haldaemo 15 0 30520 3524 1520 S 0.0 0.1 0:00.11 hald
1501 root 18 0 21692 1032 868 S 0.0 0.0 0:00.00 hald-runner
1536 root 20 0 119m 1520 1116 S 0.0 0.1 0:00.32 automount
1587 root 15 0 62608 1212 656 S 0.0 0.0 0:00.00 sshd
1626 root 17 0 21644 880 668 S 0.0 0.0 0:00.00 xinetd
1643 ntp 15 0 23388 5028 3904 S 0.0 0.2 0:00.07 ntpd
1657 root 15 0 74808 1220 644 S 0.0 0.0 0:00.52 crond
1691 xfs 18 0 20828 1636 704 S 0.0 0.1 0:00.00 xfs
1753 oracle 15 0 81248 12m 9416 S 0.0 0.5 0:01.94 tnslsnr
1897 root 17 0 3792 484 412 S 0.0 0.0 0:00.00 mingetty
1898 root 16 0 3792 480 412 S 0.0 0.0 0:00.00 mingetty
1899 root 16 0 3792 484 412 S 0.0 0.0 0:00.00 mingetty
1900 root 16 0 3792 480 412 S 0.0 0.0 0:00.00 mingetty
1901 root 16 0 3792 480 412 S 0.0 0.0 0:00.00 mingetty
1906 root 20 0 3792 480 412 S 0.0 0.0 0:00.00 mingetty
1913 root 18 0 3800 540 464 S 0.0 0.0 0:00.00 agetty
2048 root 15 0 0 0 0 S 0.0 0.0 0:00.30 pdflush
2059 root 10 -5 0 0 0 S 0.0 0.0 0:01.24 kjournald
2096 root 15 0 0 0 0 S 0.0 0.0 0:00.02 pdflush
2382 oracle 15 0 1257m 391m 373m S 0.0 15.3 0:03.40 oracle
8025 oracle 18 0 1235m 59m 57m S 0.0 2.3 0:00.16 oracle
8032 root 15 0 90112 3384 2608 S 0.0 0.1 0:00.03 sshd
8034 root 15 0 66060 1528 1144 S 0.0 0.1 0:00.00 bash
8195 oracle 15 0 1236m 27m 24m S 0.0 1.1 0:00.04 oracle
8197 oracle 15 0 1236m 32m 29m S 0.0 1.3 0:00.05 oracle
8199 oracle 15 0 1235m 14m 13m S 0.0 0.6 0:00.02 oracle
8200 root 15 0 12604 948 708 R 0.0 0.0 0:00.00 top
20456 oracle 15 0 1237m 18m 16m S 0.0 0.7 0:01.98 oracle
20458 oracle -2 0 1235m 15m 13m S 0.0 0.6 0:00.05 oracle
20462 oracle 15 0 1235m 15m 13m S 0.0 0.6 0:00.10 oracle
20464 oracle 18 0 1235m 15m 13m S 0.0 0.6 0:00.18 oracle
20466 oracle 15 0 1235m 123m 121m S 0.0 4.8 0:00.40 oracle
20468 oracle 15 0 1235m 15m 13m S 0.0 0.6 0:32.55 oracle
20470 oracle 18 0 1235m 19m 17m S 0.0 0.8 0:03.46 oracle
20472 oracle 15 0 1235m 34m 33m S 0.0 1.4 0:00.20 oracle
20474 oracle 15 0 1263m 194m 169m S 0.0 7.6 0:30.04 oracle
20476 oracle 15 0 1251m 37m 35m S 0.0 1.5 15:30.81 oracle
20478 oracle 16 0 1235m 26m 24m S 0.0 1.0 0:08.56 oracle
20480 oracle 15 0 1245m 417m 412m S 0.0 16.3 2:49.99 oracle
20482 oracle 15 0 1236m 124m 121m S 0.0 4.9 0:00.39 oracle
20484 oracle 15 0 1241m 397m 391m S 0.0 15.5 0:39.48 oracle
20486 oracle 15 0 1235m 64m 62m S 0.0 2.5 0:01.28 oracle
20488 oracle 18 0 1241m 15m 13m S 0.0 0.6 0:00.07 oracle
20490 oracle 18 0 1236m 14m 12m S 0.0 0.6 0:00.08 oracle
20528 oracle 15 0 1235m 17m 15m S 0.0 0.7 0:00.33 oracle
20540 oracle 15 0 1240m 451m 443m S 0.0 17.6 10:28.91 oracle
20552 oracle 15 0 1240m 288m 280m S 0.0 11.3 0:02.58 oracle
20608 oracle 15 0 1235m 16m 14m S 0.0 0.6 0:01.02 oracle

Lets go through the data reported by top.

First line:
top - 12:48:20 up 79 days, 16:09, 1 user, load average: 0.00, 0.00, 0.00
First line tells the current time (12:48:20), the system has been up 79 day, there is only one user logged on and the system is totally free. The three numbers tell the load average for the last 1,5 and 15 minutes. The uptime command gives the same report as the first line of top command.

Second line tells the amount of processes in the system
Tasks: 97 total, 1 running, 96 sleeping, 0 stopped, 0 zombie

Third and fourth lines tell the CPU utilization. The system is very free with 98.9% being available. In a multi-CPU system you will see a separate line for each CPU.
Cpu(s): 0.7%us, 0.1%sy, 0.0%ni, 98.8%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st

Following two lines report on memory usage
Mem: 2621440k total, 2544616k used, 76824k free, 223376k buffers
Swap: 2104504k total, 131176k used, 1973328k free, 1974716k cached

There is about 2,56 of main memory. The free part indicates that there is 700k. Actually Linux uses free memory as IO-cache. It is not totally uncommon to be alarmed when the amount of free memory looks low. The real amount of free memory is free+buffers variable so in my example I have almost all memory as free. For more information see: http://serverfault.com/questions/377617/how-to-interpret-output-from-linux-top-command.

The lines that follow in top output tell information about individual processes running on the system.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20587 root 15 0 10860 1060 772 R 0.3 0.0 0:00.03 top
1 root 15 0 10348 696 584 S 0.0 0.0 0:00.82 init
2 root RT -5 0 0 0 S 0.0 0.0 0:00.17 migration/0

Check details on the fields from here (and more on top command):

http://linux.die.net/man/1/top

The fields used are:
PID = process id
User = user who started the process
PRI = priority of the process
NI = nice value, higher values indicate lower priority. You can change the priority of processes with the nice command
Virt = virtual memory used by this process
SHR = shared memory used by this process.
S = The status of the task which can be one of: 'D' = uninterruptible sleep 'R' = running 'S' = sleeping 'T' = traced or stopped 'Z' = zombie
%CPU = percentage of CPU used by this process. The sum of all processes is 100%.
%MEM = percentage of memory used by this process
TIME+= Total CPU time used by this process
Command = the command that was used to start this process

Some formatting and display options of top
If you run top in interactive mode, pressing the uppercase M key sorts the output by memory usage. (Note that using lowercase m will turn the memory summary lines on or off at the top of the display.) This is very useful when you want to find out who is consuming the memory.

The most useful is -d, which indicates the delay between the screen refreshes. To refresh every second, use top -d 1.

The other useful option is -p. If you want to monitor only a few processes, not all, you can specify only those after the -p option. To monitor processes 13609, 13608 and 13554, issue:

top -p 13609 -p 13608 -p 13554

Tip for Oracle database

If the process that is causing either CPU or IO load is an oracle database process, you can use the following handy command to found out what part of the DB is the cause:

select s.sid, s.username, s.program

from v$session s, v$process p

where spid = <process id from top command>

and p.addr = s.paddr

/

This tip is from this good article (the actual tip is in the middle of the article):

http://www.oracle.com/technetwork/articles/linux/part2-085179.html

Troubleshoot performance part 3 - Using mpstat for CPU analysis

Mpstat for CPU
If the problem is at CPU, mpstat (multiple processor statistics) gives a cleaner report. Especially mpstat reports separately information about each CPU separately so you can see if a single CPU is overloaded while others are free. The syntax for mpstat is:

mpstat <interval> <count>

Interval is the time in seconds between printing out a line of statistics. Count is the number of lines of output you want.

The report generated by the mpstat command has the following format:

CPU: Processor number, starts with 0. The keyword all indicates that statistics are calculated as averages among all processors.
%user: Percentage of CPU utilization that occurred while executing at the user level (application).
%nice: Percentage of CPU utilization that occurred while executing at the user level with nice priority.
%sys: Percentage of CPU utilization that occurred while executing at the system level (kernel). Note that this does not include time spent servicing interrupts or softirqs.
%iowait: Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
%irq: Percentage of time spent by the CPU or CPUs to service interrupts.
%soft: Percentage of time spent by the CPU or CPUs to service softirqs. A softirq (software interrupt) is one of up to 32 enumerated software interrupts which can run on multiple CPUs at once.
%steal: Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
%idle: Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
intr/s: Total number of interrupts received per second by the CPU or CPUs.

[root@soaserver ~]# mpstat 2 3
Linux 2.6.9-55.0.0.0.2.ELsmp (soaserver) 10/14/2012
04:03:03 PM CPU %user %nice %system %iowait %irq %soft %idle intr/s
04:03:05 PM all 0.25 0.00 0.25 0.00 0.25 0.00 99.25 1129.65
04:03:07 PM all 0.25 0.00 0.00 0.00 0.00 0.00 99.75 1114.50
04:03:09 PM all 3.75 0.00 1.25 0.25 0.00 0.00 94.75 1124.00
Average: all 1.42 0.00 0.50 0.08 0.08 0.00 97.92 1122.70

On SMP machines a processor that does not have any activity at all is a disabled (offline) processor.

Below are some general tips, which you can use while interpreting the output -

If %user is very high then your application is consuming the CPUs and it is being overburdened.

If %sys is high then your server is burnened by the system (kernel) calls.

If %iowait constantly a non-zero number, then you may have some disk I/O contention. It is recommended to check the “Time spent waiting for IO (wa)” of vmstat to see whether there is any waiting on disk storage subsystem. You should also consult the iostat output.

Links:

http://en.wikipedia.org/wiki/Mpstat
http://linuxcommand.org/man_pages/mpstat1.html

Troubleshoot performance part 2 - using vmstat to see if it is the CPU or IO

Is it CPU or IO with vmstat?

Vmstat provides a coarse overview of the health of the system. You need to be root or have system admin rights to use this tool. Usually you can see from vmstat report whether the problem is at the CPU or at IO side.

Vmstat takes too arguments – interval and count. Interval tells for how many seconds we calculate and report the measured values. Vmstat shows then the averages for that period. The count tells how many reports to run and print out.

Procs – r: Total number of processes waiting to run
Procs – b: Total number of busy processes
Memory – swpd: Used virtual memory
Memory – free: Free virtual memory
Memory – buff: Memory used as buffers
Memory – cache: Memory used as cache.
Swap – si: Memory swapped from disk (for every second)
Swap – so: Memory swapped to disk (for every second)
IO – bi: Blocks in. i.e blocks received from device (for every second)
IO – bo: Blocks out. i.e blocks sent to the device (for every second)
System – in: Interrupts per second
System – cs: Context switches
CPU – us, sy, id, we, st: CPU user time, system time, idle time, wait time

[root@soaserver /]# vmstat 2 3

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------

r b swpd free buff cache si so bi bo in cs us sy id wa st

1 0 0 12682768 147204 1459768 0 0 113 7 172 27 1 0 99 1 0
0 0 0 12682768 147204 1459768 0 0 0 16 1005 308 0 0 100 0 0
0 0 0 12682768 147212 1459764 0 0 0 22 1005 301 0 0 100 0 0
[root@soaserver /]#

First line is summary since the last boot and for our purposes it is best to ignore it as it does not really tell much about the situation at the moment. Some people like to use the first line as indication whether the system is getting worse or better.

The bi/bo fields can be watched especially. They will show how much the system is writing to disk and reading from disk. A high value in either or both indicates that the system is IO bound and you need to figure out what system is behind it.

You can use the user (us), system (sy) and idle time (column id) to see if there is lots of user space processing or kernel level activity or lots idle time in CPUs to indicate whether the CPU is loaded or not. However as these are averages, we can have a situation in a multi-CPU node that one of the CPUs is hot while the other one is free so bear in mind this is a coarse average value here. The first field (# of procs ready to run) tells how many processes have been waiting for an available CPU and that also tells about CPU load or CPU saturation. You may have situations where a number of procs wait up all at the same time and need CPU so they need to queue for a free CPU and for the remainder of the measuring interval (2 secs in our example) CPU is free so it is possible for the CPU to be saturated and still have a high amount of idle time as the vmstat reports averages over the measuring interval.

If there is a lot of si = swap in, so=swap out activity, it indicates that the system is swapping. (meaning that the processes are using more memory than there is physical memory and the system needs to constantly write unused memory blocks to disk and read memory from disk back to main memory). This will kill the performance of the system. Remedy is to add memory or reduce memory consumption of processes.

Links:Lots of good examples from vmstat and other related tools:

http://www.thegeekstuff.com/2011/07/iostat-vmstat-mpstat-examples/

For a good video on vmstat that is Solaris based, see this set of videos:

http://dtrace.org/blogs/brendan/2011/04/27/vmstat-videos/

Troubleshoot performance - part1. Using common Linux commands to perform performance troubleshoot your SOA Suite installation

Part 1. Overview

This set of posts covers a set of common Linux commands that you can use to do initial troubleshooting when the SOA Suite environment is up and running, but very slow.

Identify bottleneck area

Basic process is such that:

First identify if the problem is on the CPU side or on the IO side
If it is on the CPU side, identify the process that is eating the memory or if there are too many processes in the system (system is over capacity)
If it is on the IO side, identify if the system is trashing (all processes combined use more main memory than available in the system) or if some process is eating all IO capacity, then identify the process eating IO
For database, find out if some SQL query is not using indexes correctly. There is a number of guidelines from Oracle to do these (not covered here, please move along...)
For advanced trouble shooting EM and diagnostics packs or statistics packs can be used (not covered here)

Executing tests one change at a time
In the process of troubleshooting you may do configuration or other changes to the system. In order to understand what is the root cause for the problem (you will need to build an understanding of this or the problems are bound to re-appear again and again), it is important you change only one parameter at a time is changed and then you monitor the system again (or re-run your performance tests if you are have a replica of production system available). This is a way to see how each parameter affects the performance.