Linux Performance Analysis in 60 seconds

Assessing server status within 60 seconds using the CLI in a standard Linux environment

Reference: netflixtechblog.comarrow-up-right

Quick overview

$ uptime
$ dmesg | tail
$ vmstat 1
$ mpstat -P ALL 1
$ pidstat 1
$ iostat -xz 1
$ free -m
$ sar -n DEV 1
$ sar -n TCP,ETCP 1
$ top

First 60 Seconds: Summary

1. uptime

  • uptime is the easiest way to check the load average value, which indicates how many processes are currently waiting

    • On Linux systems, this value includes not only waiting processes but also processes blocked by I/O operations such as disk I/O

      • This lets you see how many resources are being used, but it cannot be analyzed precisely

  • The 3 numbers above (1.88 2.16 2.07) represent load averages over 1 minute, 5 minutes, and 15 minutes, respectively

    • This allows you to see changes over time

      • ex)

        • When you log into the instance after hearing about a failure, if the 1-minute value is smaller than the 15-minute value,

          • it means the failure occurred and you logged in too late.

        • In the example above, the 1-minute value is about 30 and the 15-minute value is about 19, indicating a recent increase

          • A high number here carries many implications.

            • It is likely a CPU demand issue, but to confirm this, you can use commands like vmstat or mpstat described below.

2. dmesg | tail

  • dmesg is a command for checking system messages

    • Since all kernel messages are output starting from boot, tail is used to display only the last 10 lines

  • Through these messages, you can find errors that may cause performance issues.

    • In the example above, you can see that oom-killer (out of memory) occurred and TCP requests were dropped.

3. vmstat 1

  • vmstat, short for virtual memory stat, is a tool available in most environments

    • vmstat with argument 1 displays information every second

    • The first line shows average values since boot

      • Items to check

        • r

          • The number of processes running on the CPU

          • A good value for checking if CPU resource saturation is occurring

          • If the r value is greater than the number of CPUs, it is considered saturated.

        • free

          • Shows free memory in KB

          • If the free memory has too many digits, you can use free -m for easier reading

        • si, so

          • Values for swap-in and swap-out

          • If these are not 0, the system currently has insufficient memory

        • us, sy, id, wa, st

          • You can measure the average CPU time across all CPUs

          • These represent user time, system time used by the kernel, idle, wait I/O, and stolen time respectively

            • Stolen time refers to the time the real CPU was occupied while the hypervisor was servicing virtual CPUs

4. mpstat -P ALL 1

  • This command measures CPU time per individual CPU

    • This method allows you to check for imbalanced states across CPUs,

      • where only one CPU is working, meaning the application is running as a single thread

5. pidstat 1

  • pidstat is similar to running the top command per process

    • However, the difference is that instead of displaying on the full screen, it continuously shows changing conditions, making it great for recording changes

      • Looking at the example above, the CPU usage of two Java processes is enormous

        • The %CPU item represents the total usage across all CPUs

          • Therefore, the Java processes using 1591% indicate that they are using close to 16 CPUs.

6. iostat -xz 1

  • A great tool for understanding how block devices (HDD, SSD, ...) are performing

    • Items to check

      • r/s, w/s rkB/s, wkB/s

        • Shows read requests, write requests, read kB/s, and write kB/s

        • An important metric for checking which type of requests are most frequent

          • Performance issues can sometimes be caused by excessive requests.

      • await

        • The average I/O processing time expressed in milliseconds

        • For applications, this is the time to queue an I/O request and receive service, so the application waits during this time

        • If it is longer than the normal request processing time of the device, it indicates that there is a problem with the block device itself, or that the device is saturated

7. free -m

  • Items to check

    • buffers: Buffer cache for block device I/O, usage amount

    • cached: Amount of page cachearrow-up-right used by the file system

      • These values should not approach 0

        • This indicates high Disk I/O is occurring (can be confirmed with iostat)

          • In the example above, the values are 59MB and 541MB respectively, which are acceptable

8. sar -n DEV 1

  • This tool allows you to measure network throughput (Rx, Tx KB/s)

    • In the example above, the receive volume of eth0 is approximately 22 Mbytes/s (21999.10 rxkB/s)

      • This is 176 Mbits/s, which is still far below the 1 Gbit/s limit

    • The %ifutil value is the network device utilization, which can also be measured with nicstatarrow-up-right

      • However, as with nicstat, it is difficult to get accurate values, and it does not work well in the example above either

9. sar -n TCP,ETCP 1

  • Shows a summary of TCP traffic.

    • active/s: Shows the number of TCP connections per second initiated locally (e.g., connections via connect()).

    • passive/s: Shows the number of TCP connections per second requested from remote (e.g., connections via accept()).

    • retrans/s: Shows the number of TCP retransmissions per second.

  • Viewing the active and passive counts is convenient for roughly measuring server load

    • Based on the description above, you might think active is outbound and passive is inbound connections, but that is not always the case.

      • ex) Connections like localhost to localhost

  • retransmits indicate that there are network or server issues

    • This refers to an unreliable network environment (public internet) or connections exceeding the server's capacity causing packet drops

      • In the example above, you can see one TCP connection coming in per second.

10. top

  • The top command makes it easy to check the various metrics reviewed above.

    • It has the advantage of being easy to check overall system values

      • However, since the screen changes continuously, it is difficult to find patterns

        • To catch intermittent pauses, you need to quickly freeze the screen periodically

          • ex) Ctrl+S pauses updates, and Ctrl+Q resumes them. Then the screen gets cleared

Last updated