...
CPU - Always check stats, logs, and settings of the above components while troubleshooting. CPU-related issues occur due to CPU power settings so check the system CPU power policy and ensure that the power setting is running in the performance mode. If the system is overloaded with multiple processes, check the CPU load average. If the load average is double, then try adding more CPUs.
Memory - A system has limited memory and is divided into various system components. System and user processes consume the largest chunk of the memory. Due to frequent allocation and deallocation, memory gets fragmented and causes memory allocation to fail. This leads to slow memory allocation or no memory allocation, and the process suffers and runs terribly slow. Always keep enough free memory for the process, otherwise increase the RAM.
Storage - Several types of storage media are available such as slow, fast, and extremely fast. Applications frequently access those storage media to store and read data. The performance of an application gets hampered if the storage media is faulty, overloaded, or not configured properly. Confirm the same by looking at the kernel logs, disk configuration, or by running some benchmark test, such as Fio.
Networking - Communication is an important component of a system. Bad configuration or link can trigger a false alarm, for example, bond0 can switch to a secondary slave if no activity is detected on the primary slave. A poor or bad network connection causes frequent retransmission of the packets and leads to slow packet transmission over a network. A bad MTU size of a NIC also causes fragmentation issues. For such scenarios, check path MTU size (MTU size of a sender NIC, switch/router port and the destination NIC must be the same). To debug network issues, check the kernel logs, NIC, and socket stats.
System Component | Problem | How to Troubleshoot | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CPU |
|
| ||||||||||||
CPU | System repones is quick, but applications are running slow |
| ||||||||||||
Memory | System/process is running slow OR a process is failed to run | Due to high memory pressure, system starts behaving weird and process runs slow. Due to poor memory allocation, system tries to defrag the memory, but it is slow. To identify such issues, run the following commands to check the memory availability.
| ||||||||||||
Memory | The process is killed due to Out of Memory (OOM). |
For example, Cat /proc/<process id>/oom_score | ||||||||||||
Storage | The application is running slow due to slow read & write | Read and write performance is impacted due to a bad disk, overloaded disks, multiple input-outputs (IOs) issued on a disk, or queue is full.
| ||||||||||||
Storage | Disk benchmarking with and without SWARM software stack | Customers often feel that SWARM read and write is slow on an expensive hardware. To isolate the issue or to prove that the SWARM is working as expected, DataCore Swarm runs the Fio benchmark test with and without SWARM to prove disks are slow. Or Before installing a SWARM cluster, customers wants to know the compatibility and performance with their existing or new hardware (storage). For this, DataCore Swarm runs the Fio benchmark test. Note Do not run write tests on production setup or production (debug build) setup, it may corrupt the data. Instead, run the read test. SWARM 15.0 provides the provision to run read test through SWARM GUI. To run the read performance tests on individual disks, refer to the below steps:
On a debug non-production setup, run the below commands:
| ||||||||||||
Storage | Benchmarking file system operations | To benchmark filesystem operations such as read write, flush, mkdir, rmdir, and so on, use “dbench” benchmarking tool. | ||||||||||||
Networks
| The file upload is slow or internode object replication is slow | Multiple reasons for slow reads and writes include:
Network latency
Network card and socket buffers
Protocol based stats
| ||||||||||||
Network | Network monitoring |
| ||||||||||||
Network | Debugging network throughout / latency issues on multicast, TCP or UDP configuration. |
Ex server: iperf -s -u -B < IP address > -i 1 -p 7070 Ex client: iperf -c <IP address> -u -T 1 -t 30 -i 1 -p 7070 -b 10m -l 1300
Note: please refer following link to know SWARM services and their port numbers, this would help you while selecting the non-SWARM (unused) ports. |