Document Identifier:	TechNote 2015005 (replaces TechNote 2012003)
Document Date:	August 31, 2015
Software Package:	SWARM
Version:	Swarm 6.5.0 or later

Abstract

Swarm uses network multicast, among other things, to decide where replicas of objects should reside in the cluster both on initial write and during subsequent load balancing and validation by the health processor. Understanding how the multicast responses are tracked within Swarm can help administrators evaluate overall communication health in the cluster and potentially isolate communication issues.

The target audience for this paper is Swarm administrators.

Multicast Histogram Overview

Every Swarm cluster has different hardware and configuration. Within a cluster, client loads may change over time. Swarm maintains a multicast histogram to monitor cluster conditions and adjust expectations of multicast and UDP performance. Every time a node does a multicast, it times the round-trip time to responses that are heard from all nodes in the cluster. The last N times are kept in a circular buffer. The mean and tail values are used to determine how long to wait on UDP responses to multicast queries, where the tail is the value greater than a fraction f of the samples. The default values for N and f are 10,000 and 0.9998. These are configuration settings that can be adjusted.

Histogram Log Messages

The state of the multicast histogram sample is periodically logged. A set of sample log messages follows with an explanation of each message (numbered for convenience).

Oct 11 09:52:44 omega15 2012-10-11 14:52:44,380 INTERNODE INFO: Multicasts: 474 (WRITEQUERY:12, REPQUERY:462)
Oct 11 09:52:44 omega15 2012-10-11 14:52:44,380 INTERNODE INFO: Multicast responses sent late/stale: 1
Oct 11 09:52:44 omega15 2012-10-11 14:52:44,380 INTERNODE INFO: Multicast Response Histogram
Oct 11 09:52:44 omega15 2012-10-11 14:52:44,381 INTERNODE INFO: Total samples: 10.80 million, histogram of last 10000 samples
Oct 11 09:52:44 omega15 2012-10-11 14:52:44,387 INTERNODE INFO: (min,mean,max)=(0.000456, 0.003729, 0.195042)
Oct 11 09:52:44 omega15 2012-10-11 14:52:44,387 INTERNODE INFO: 99.980% of the samples are smaller than 0.150
Oct 11 09:52:44 omega15 2012-10-11 14:52:44,387 INTERNODE INFO: [9753, 19, 18, 16, 22, 15, 25, 20, 27, 36, 38, 7, 0, 0, 2, 1, 0, 0, 0, 1]

Message 1: This message details the number of multicasts done since the last report and a breakdown of the different types of multicasts. WRITEQUERY multicasts are used to determine where an object replica should best reside in the cluster. The HP does REPQUERY multicasts to look for other replicas in the cluster of a stream being examined.

Message 2: This message indicates responses that were delayed internally, arriving after the write query or rep query had stopped waiting for them. These do not indicate a system error, as there are transparent retry mechanisms where needed.

Message 4: This message indicates how many samples were taken and how many are remembered for the sake of the histogram statistics in subsequent lines. Samples are the response times for replica queries (one sample per response), regardless of multicast or unicast. However since multicasts ask for responses from the entire cluster, multicast responses will dominate the samples.

Message 5: The min/mean/max are the standard statistical operations on the sampled response times.

Message 6: The "99.980% of the samples are smaller" identifies a statistic that we call the "tail", the estimated time one would need to wait for a large majority of the responses, ignoring true outliers. This tail value impacts how long each replica query request waits to hear responses and can be reset using the 'cip.histogramTolerance' configuration setting.

Message 7: This message is a textual depiction of the histogram where each number represents the number of responses seen in 0.01 second-width buckets. It is common to see a longer list than above, which is short for clarity. Notice that the majority of the samples, 9753 out of 10,000 are in the first bucket with relatively few responses spread among the remaining buckets. This is a typical pattern.

Interpreting Histograms

Multicast histograms will vary from cluster to cluster and on a single cluster over time depending on the number of nodes, the network characteristics of the cluster, and the current activity within the cluster. In general, a cluster at rest with no inbound client activity and normal health processing activity (i.e. no volume recoveries in process) should have a sub-second histogram tail unless the cluster has high latency characteristics.
When under load the histogram mean, max and tail values will increase as nodes take longer to respond while they complete other tasks. This is normal and not indicative of a problem.
However, a tail value higher than a second for a cluster at rest or multi-second tail values for a cluster under light load can indicate network latency issues that should be investigated. Additionally, a single node with a much higher tail than other nodes in the cluster can indicate a hardware related network problem for that node. Similarly, if all nodes in the cluster except one have long histogram tails, this may indicate disk issues for the one node that is different, particularly if consumer grade drives are mistakenly in use. In this situation there should be log messages for ‘long I/Os’ for the one different node.

Since the multicast histogram is adaptive, it will react to other events or cluster changes besides the ones described above. Changes do not by themselves indicate a problem in the cluster but knowing the general histogram characteristics can be a valuable tool for evaluating both load and health in the cluster.

Knowledge Base

TechNote 2015005: Interpreting Multicast Histograms

Abstract

Multicast Histogram Overview

Histogram Log Messages

Interpreting Histograms