Linux Performance Analysis in 60 seconds
Assessing server status within 60 seconds using the CLI in a standard Linux environment
Reference: netflixtechblog.com
Quick overview
$ uptime
$ dmesg | tail
$ vmstat 1
$ mpstat -P ALL 1
$ pidstat 1
$ iostat -xz 1
$ free -m
$ sar -n DEV 1
$ sar -n TCP,ETCP 1
$ top
First 60 Seconds: Summary
1. uptime
uptime is the easiest way to check the load average value, which indicates how many processes are currently waiting
On Linux systems, this value includes not only waiting processes but also processes blocked by I/O operations such as disk I/O
This lets you see how many resources are being used, but it cannot be analyzed precisely
The 3 numbers above (1.88 2.16 2.07) represent load averages over 1 minute, 5 minutes, and 15 minutes, respectively
This allows you to see changes over time
ex)
When you log into the instance after hearing about a failure, if the 1-minute value is smaller than the 15-minute value,
it means the failure occurred and you logged in too late.
In the example above, the 1-minute value is about 30 and the 15-minute value is about 19, indicating a recent increase
A high number here carries many implications.
It is likely a CPU demand issue, but to confirm this, you can use commands like
vmstatormpstatdescribed below.
2. dmesg | tail
dmesg is a command for checking system messages
Since all kernel messages are output starting from boot, tail is used to display only the last 10 lines
Through these messages, you can find errors that may cause performance issues.
In the example above, you can see that
oom-killer (out of memory)occurred andTCP requestswere dropped.
3. vmstat 1
vmstat, short for virtual memory stat, is a tool available in most environments
vmstat with argument 1 displays information every second
The first line shows average values since boot
Items to check
r
The number of processes running on the CPU
A good value for checking if CPU resource saturation is occurring
If the
rvalue is greater than the number of CPUs, it is considered saturated.
free
Shows free memory in KB
If the free memory has too many digits, you can use
free -mfor easier reading
si, so
Values for swap-in and swap-out
If these are not 0, the system currently has insufficient memory
us, sy, id, wa, st
You can measure the average CPU time across all CPUs
These represent user time, system time used by the kernel, idle, wait I/O, and stolen time respectively
Stolen time refers to the time the real CPU was occupied while the hypervisor was servicing virtual CPUs
4. mpstat -P ALL 1
This command measures CPU time per individual CPU
This method allows you to check for imbalanced states across CPUs,
where only one CPU is working, meaning the application is running as a single thread
5. pidstat 1
pidstat is similar to running the
topcommand per processHowever, the difference is that instead of displaying on the full screen, it continuously shows changing conditions, making it great for recording changes
Looking at the example above, the CPU usage of two Java processes is enormous
The
%CPUitem represents the total usage across all CPUsTherefore, the Java processes using 1591% indicate that they are using close to 16 CPUs.
6. iostat -xz 1
A great tool for understanding how
block devices (HDD, SSD, ...)are performingItems to check
r/s, w/s rkB/s, wkB/s
Shows read requests, write requests, read kB/s, and write kB/s
An important metric for checking which type of requests are most frequent
Performance issues can sometimes be caused by excessive requests.
await
The average I/O processing time expressed in milliseconds
For applications, this is the time to queue an I/O request and receive service, so the application waits during this time
If it is longer than the normal request processing time of the device, it indicates that there is a problem with the block device itself, or that the device is saturated
7. free -m
Items to check
buffers: Buffer cache for block device I/O, usage amount
cached: Amount of page cache used by the file system
These values should not approach 0
This indicates high Disk I/O is occurring (can be confirmed with
iostat)In the example above, the values are 59MB and 541MB respectively, which are acceptable
8. sar -n DEV 1
This tool allows you to measure network throughput (Rx, Tx KB/s)
In the example above, the receive volume of
eth0is approximately 22 Mbytes/s (21999.10 rxkB/s)This is 176 Mbits/s, which is still far below the 1 Gbit/s limit
The
%ifutilvalue is the network device utilization, which can also be measured with nicstatHowever, as with nicstat, it is difficult to get accurate values, and it does not work well in the example above either
9. sar -n TCP,ETCP 1
Shows a summary of TCP traffic.
active/s: Shows the number of TCP connections per second initiated locally (e.g., connections via connect()).
passive/s: Shows the number of TCP connections per second requested from remote (e.g., connections via accept()).
retrans/s: Shows the number of TCP retransmissions per second.
Viewing the
activeandpassivecounts is convenient for roughly measuring server loadBased on the description above, you might think active is outbound and passive is inbound connections, but that is not always the case.
ex) Connections like localhost to localhost
retransmitsindicate that there are network or server issuesThis refers to an unreliable network environment (public internet) or connections exceeding the server's capacity causing packet drops
In the example above, you can see one TCP connection coming in per second.
10. top
The
topcommand makes it easy to check the various metrics reviewed above.It has the advantage of being easy to check overall system values
However, since the screen changes continuously, it is difficult to find patterns
To catch intermittent pauses, you need to quickly freeze the screen periodically
ex) Ctrl+S pauses updates, and Ctrl+Q resumes them. Then the screen gets cleared
Last updated