System Component | Problem | How to Troubleshoot |
---|
CPU | | Use htop command to check CPU load average and process/task thread count. For example, htop Identify which process is consuming lot of CPUs and check the CPU% consumed. The load average of last 5, 10 and 15 minutes appears. If load average is high, increase the CPU count on system or reduce the application task/thread count. To know about the CPU usage, run mpstat command. The stats for each CPU appears like where CPU is spending most of the time in kernel or in user space, and which all CPUs are busy and idle. For example, mpstat -A
|
CPU | System repones reponse is quick, but applications are running slow | Use htop command to check load average. Check the current CPU governor and ensure that it is set to performance. For example, cat/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor To learn more about CPU frequency and its type, refer to https://wiki.archlinux.org/title/CPU_frequency_scaling. Check kernel logs, if any driver is missing or not compatible with current CPU. Kernel logs the warning/info message in dmesg.
|
Memory | System/process is running slow OR a process is failed to run | Due to high memory pressure, system starts behaving weird and process runs slow. Due to poor memory allocation, system tries to defrag the memory, but it is slow. To identify such issues, run the following commands to check the memory availability. Run htop command to check memory stats in visual form or run free -thw command to check the current memory usage Check system memory usage pattern of past few days with the help of sar command. For example, sar -2 -rh Here, -2 denotes last 2 days. This gives an idea about past memory usage pattern. Compare it with system logs to identify what operations were executing during that period.
|
Memory | The process is killed due to Out of Memory (OOM). | Check the memory stats using above commands. Due to continuous memory pressure, system tries to kill some process, though the process selection is complex. Check the process going to be killed next. Each process has a score. Based on score, system chooses process to kill. Higher the score, higher the chances of getting killed by OOM.
For example, Cat /proc/<process id>/oom_score |
Storage | The application is running slow due to slow read & write | Read and write performance is impacted due to a bad disk, overloaded disks, multiple input-outputs (IOs) issued on a disk, or queue is full. Run the iostat -tkx 2 command to check the disk stats. It shows how many IO's are in queue and what is the IO latency for read and write. The output of the command shows stats for all disks, find unusual latency and queue size if any. It is possible that a disk has gone bad. To check the disk health, run the smartctl -a </dev/device name> command and check errors.
|
Storage | Disk benchmarking with and without SWARM software stack | Customers often feel that SWARM read and write is slow on an expensive hardware. To isolate the issue or to prove that the SWARM is working as expected, DataCore Swarm runs the Fio benchmark test with and without SWARM to prove disks are slow. Or Before installing a SWARM cluster, customers wants to know the compatibility and performance with their existing or new hardware (storage). For this, DataCore Swarm runs the Fio benchmark test. Note Do not run write tests on production setup or production (debug build) setup, it may corrupt the data. Instead, run the read test. SWARM 15.0 provides the provision to run read test through SWARM GUI. To run the read performance tests on individual disks, refer to the below steps: Go to <hostname:90/nodestatus>. For example, http://abc.caringo.com:90/nodestatus Click on node info. Select fio read test from the drop-down. 4. Select the disk name. It starts the test and takes a few minutes to complete.
On a debug non-production setup, run the below commands: Login to setup using ssh. Run the below command in the terminal. Code Block |
---|
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=readtest --filename=/dev/sda --iodepth=1 --size=128M --time_based --runtime=30 --readwrite=randread |
|
Storage | Benchmarking file system operations | To benchmark filesystem operations such as read write, flush, mkdir, rmdir, and so on, use “dbench” benchmarking tool. See also https://linux.die.net/man/1/dbench. |
Networks | The file upload is slow or internode object replication is slow | Multiple reasons for slow reads and writes include: If storage and memory subsystems are not a bottleneck, then it is most likely network that causing the issue. A major problem in networks is retransmission that occurs due to a bad link, jittery connection, and overflowing of sender side or receiving side network buffers.
Network latency Check network latency using ping command. Code Block |
---|
ping <destination ip address> |
If ping latency exceeds 5-6 milisecondsmilliseconds, it indicates bad link. Check the physical link/switch/router. It is also possible that the MTU set in SWARM and the MTU of the link (switch) are not the same. To verify, run the below ping command. It sends the ping packets of size 9000 bytes. Code Block |
---|
ping -M do -s 8972 <destination IP address> |
If the above ping command fails, check the MTU size of the host NIC/switch port/router port. The size must be 9000 bytes.
Network card and socket buffers An overflowing buffer causes the packet retransmission. Check the NIC RX and TX queue size. Set the size to max if not set already. Check the protocol stats, if retransmission is above 5%, then increase the socket buffer size. Code Block |
---|
ethtool -g < interface name> |
Check the current settings and set it to maximum, if not set. For example, The maximum hardware size is 4096 bytes on virtual machine. It could be different so check the size before setting. Code Block |
---|
Ethtool -G <interface name> rx 4096 tx 4096 |
Protocol based stats Run the below command to check stats. Per NIC stats. protocol based stats
Change the network buffer size based on stats, if required. The current buffer size is available at the following locations: /proc/sys/net/ipv4/udp_mem /proc/sys/net/core/rmem_max /proc/sys/net/ipv4/tcp_rmem /proc/sys/net/ipv4/tcp_wmem
The commands to change the network buffer size are: echo ‘net.core.wmem.max=<max size>’ >> /etc/sysctl.conf echo ‘net.core.rmem.max=<max size>’ >> /etc/sysctl.conf echo ‘net.ipv4.tcp_rmem=<minimum> <initial> <max>’ >> /etc/sysctl.conf echo ‘net.ipv4.tcp_wmem=<minimum> <initial> <max>’ >> /etc/sysctl.conf echo ‘net.ipv4.udp_mem=<minimum> <initial> <max>’ >> /etc/sysctl.conf
|
Network | Network monitoring | To monitor network load on each NIC, use the nload tool. For example, To monitor network packets, use tcpdump. For example, Display available / known interfaces - tcpdump -D Capture all packets on eno1 / display output in ASCII format - tcpdump -A -i eno1 Display captured packets in HEX and ASCII - tcpdump -xx -i eno1 Capture packets based on IP address - tcpdump -n -i eno1 Capture only tcp packets - tcpdump -i eno1 tcp Capture packet from specific port - tcpdump -i eno1 port 90 Capture packets from specific IO address / sender based on IP - tcpdump -i eth0 src 192.168.2.20
|
Network | Debugging network throughout / latency issues on multicast, TCP or UDP configuration. | High network latency and low throughput are very common issues, to identify such issues we can take the help of the iperf tool. Based on the protocol configuration we can run iperf with different parameters Debugging multicast Debugging TCP Debugging UDP
Debugging multicast: Run iperf server on each of the cluster node, make sure you dont don't select SWARM ports (used ports), then run iperf client on one of the cluster node. Similarly do it for rest of the nodes in the cluster.
Code Block |
---|
Example iperf server: iperf -s |
Code Block |
---|
Example iperf client: iperf -c <iperf server ip> -b 9000M -t 120 |
Note: please refer following link to know SWARM services and their port numbers, this would help you while selecting the non-SWARM (unused) ports. Code Block |
---|
https://caringo.atlassian.net/wiki/spaces/public/pages/2443808571/Setting+up+the+Swarm+Network |
Setting up the Swarm Network |