Debugging CAstor node using Linux tools
Runtime Environments
Datacore Swarm runs in 3 different modes, production, debug, and probe. Debugging issues on these builds are not the same.
The production build does not have SSH (Secure Shell) access, hence, debug issues through cluster interface (GUI).
The debug build is interactive and has SSH enabled. Login to the cluster nodes and run various commands to debug the issues or go through logs to understand issues.
The probe build runs minimal stack and does not have clustering or storage capabilities. It is used to check the system capabilities, configuration, network connectivity, and overall performance of the system.
Issue Types
The issues are divided into four categories, CPU, memory, storage, and networking.
CPU - Always check stats, logs, and settings of the above components while troubleshooting. CPU-related issues occur due to CPU power settings so check the system CPU power policy and ensure that the power setting is running in the performance mode. If the system is overloaded with multiple processes, check the CPU load average. If the load average is double, then try adding more CPUs.
Memory - A system has limited memory and is divided into various system components. System and user processes consume the largest chunk of the memory. Due to frequent allocation and deallocation, memory gets fragmented and causes memory allocation to fail. This leads to slow memory allocation or no memory allocation, and the process suffers and runs terribly slow. Always keep enough free memory for the process, otherwise increase the RAM.
Storage - Several types of storage media are available such as slow, fast, and extremely fast. Applications frequently access those storage media to store and read data. The performance of an application gets hampered if the storage media is faulty, overloaded, or not configured properly. Confirm the same by looking at the kernel logs, disk configuration, or by running some benchmark test, such as Fio.
Networking - Communication is an important component of a system. Bad configuration or link can trigger a false alarm, for example, bond0 can switch to a secondary slave if no activity is detected on the primary slave. A poor or bad network connection causes frequent retransmission of the packets and leads to slow packet transmission over a network. A bad MTU size of a NIC also causes fragmentation issues. For such scenarios, check path MTU size (MTU size of a sender NIC, switch/router port and the destination NIC must be the same). To debug network issues, check the kernel logs, NIC, and socket stats.
System Component | Problem | How to Troubleshoot |
---|---|---|
CPU |
|
|
CPU | System reponse is quick, but applications are running slow |
|
Memory | System/process is running slow OR a process is failed to run | Due to high memory pressure, system starts behaving weird and process runs slow. Due to poor memory allocation, system tries to defrag the memory, but it is slow. To identify such issues, run the following commands to check the memory availability.
|
Memory | The process is killed due to Out of Memory (OOM). |
For example, Cat /proc/<process id>/oom_score |
Storage | The application is running slow due to slow read & write | Read and write performance is impacted due to a bad disk, overloaded disks, multiple input-outputs (IOs) issued on a disk, or queue is full.
|
Storage | Disk benchmarking with and without SWARM software stack | Customers often feel that SWARM read and write is slow on an expensive hardware. To isolate the issue or to prove that the SWARM is working as expected, DataCore Swarm runs the Fio benchmark test with and without SWARM to prove disks are slow. Or Before installing a SWARM cluster, customers wants to know the compatibility and performance with their existing or new hardware (storage). For this, DataCore Swarm runs the Fio benchmark test. Note Do not run write tests on production setup or production (debug build) setup, it may corrupt the data. Instead, run the read test. SWARM 15.0 provides the provision to run read test through SWARM GUI. To run the read performance tests on individual disks, refer to the below steps:
On a debug non-production setup, run the below commands:
|
Storage | Benchmarking file system operations | To benchmark filesystem operations such as read write, flush, mkdir, rmdir, and so on, use “dbench” benchmarking tool. |
Networks
| The file upload is slow or internode object replication is slow | Multiple reasons for slow reads and writes include:
Network latency
Network card and socket buffers
Protocol based stats
|
Network | Network monitoring |
|
Network | Debugging network throughout / latency issues on multicast, TCP or UDP configuration. |
Note: please refer following link to know SWARM services and their port numbers, this would help you while selecting the non-SWARM (unused) ports. |
© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.