How to check integrity of streams
This article describes how to check the integrity of erasure-coded streams written to a Swarm cluster. This process is not needed under normal circumstances. This article presumes knowledge of the Support tools and their general installation/usage. If unclear about the Support tools, please visit: https://perifery.atlassian.net/wiki/spaces/KB/pages/1374027783
Instructions
The following instructions will help enumerate the cluster (or some subset), run an integrity check against those streams, and then understand the output.
Update the support bundle on any one of your SCS/CSN/Gateways. This process was primarily tested on an SCS/CentOS 7. If online, run:
./updateBundle.sh
from the support tools directory. Otherwise, download a fresh copy from here: https://support.cloud.datacore.com/tools/swarm-support-tools.tgzDepending on the primary reason for running integrity checks, you may want to check the integrity of a particular virtual folder with a bucket, a bucket, a domain, or all domains. Depending on the number of logical streams, it might make sense to run enumeration against a smaller subset of streams at a time. Enumeration for multiple millions of streams can take quite a while, and running integrity checks will take even longer. Your performance will vary depending on your particular cluster and requirements.
To get a count of streams in the entire cluster including versioned streams, you would run:
./indexer-enumerator.sh -d ALL -a [Swarm IP] -NZzy -c
You can remove -y if there are no versions for your streams (or versioning is disabled). It should not hurt to leave the option included if unsure. -c is the flag for counting the streams instead of enumerating them. It is a good idea to get a count before doing a full enumeration. If you prefer to get a list of all buckets with their object counts in all domains, you can run:./indexer-enumerator.sh -d ALL -iy
The output will show you the file you can look at to see these bucket usage counts. If you prefer to specify a specific domain, use-d [domain]
in the query. If specifying a bucket in that domain, use-b [bucket]
../indexer-enumerator.sh -d example.com -b bucket1 -a [Swarm IP] -NZzy -c
for example will get a count of all versioned-capable, erasure coded streams in bucket1 in domain example.com. If you want to check a subset of a bucket, you can additionally add-p [virtual folder name]
. Add -N to only get named streams as there may be failed partial uploads that will eventually get cleaned up with garbage collection but which are not interesting in this workflow.Once you have determined the subset or superset of streams that you want to check the integrity on, remove the -c flag and add the -J flag. The -J flag works with the -Z flag to not only get the etag (UUID) of the streams but also the name of the stream itself. Primarily we will be using the etag for integrity checks but we may need the name of the stream later if there are manifest 404s. The final command might look like this:
> ./indexer-enumerator.sh -d 1-domain-for-versioning -b sync-test -JyzZN Starting to enumerate the requested streams in domain: 1-domain-for-versioning All of your objects can be found here: ./OUTPUTDIR-2023_0512-140120/All-Objects.txt All of your etags from those objects can be found here: ./OUTPUTDIR-2023_0512-140120/All-etags.csv Only streams that are erasure coded (EC) are listed- (the -z flag was chosen). Not supported with older Swarm/ ES versions. Only streams in buckets with versioning enabled are listed. If versions are present, they will be listed- (the -y flag was chosen) 1-domain-for-versioning/sync-test has 6 unique objects of stype: named, withreps, uses 14.43MB disk space.
You can see what the file All-Objects.txt captures. All-etags.csv only outputs the first column and that is what we will be iterated over in the check integrity script.
> cat ./OUTPUTDIR-2023_0512-140120/All-Objects.txt 3dd0e346f2d093a5dccf716aa496bbd2,1-domain-for-versioning/sync-test/philipp-pilz-QZ2EQuPpQJs-unsplash.jpg a9276be3b88d0d12833289bb234ae5b5,1-domain-for-versioning/sync-test/marek-szturc-CM1oVEUzsNM-unsplash.jpg 3abdd116e6cc65779774af84714148f4,1-domain-for-versioning/sync-test/taylor-Qvwl_jOJxfs-unsplash.jpg 69a57bdd0c98f34dd2f8c50f194de347,1-domain-for-versioning/sync-test/marc-olivier-jodoin-tauPAnOIGvE-unsplash.jpg dedc408c590a0127da728ed1acd64b94,1-domain-for-versioning/sync-test/gregoire-bertaud-wK_DZlAJJ_Q-unsplash.jpg ad382ce9793a0eae4c92df99a2df19cb,1-domain-for-versioning/sync-test/taylor-Qvwl_jOJxfs-unsplash.jpg
Verify that your All-etags.csv output only has etags.
> cat ./OUTPUTDIR-2023_0512-140120/All-etags.csv 3dd0e346f2d093a5dccf716aa496bbd2 a9276be3b88d0d12833289bb234ae5b5 3abdd116e6cc65779774af84714148f4 69a57bdd0c98f34dd2f8c50f194de347 dedc408c590a0127da728ed1acd64b94 ad382ce9793a0eae4c92df99a2df19cb
For this next step especially, it is good to run in a screen session or via a terminal console as it may take hours, if not days, to finish. This is very important.
Run the following from the support tools directory. If the support tools directory doesn’t have plenty of available space in the partition, run the command from another partition that does. It is possible there could be quite a few files output depending on the scope of the work and the number of errors found. This will start the main script which uses xargs with 10 processes at a time to check the integrity of each stream. You can use
-p [processes]
to change number of processes to use in xargs from the default of 10. I have run it with-p 20
without issue, but your results may vary. Each integrity check does a GET to the stream with a special query arg to find each segment as delineated by the EC manifest. With smaller streams, there may only be 6 segments for example, each of which are verified. With larger streams, the manifest may contain hundreds or thousands of segments that each need to be verified. The purpose of the script is to check EACH segment and report the results. To reiterate, this can take a long time. As the script runs, you will see the check integrity worker script being called for each etag. This is normal.When the script is finished, it will tar the results but also leave the results in place. Once run, here is a very short example of what it looks like:
There are a few things that the script outputs. You can download that file ending in tgz to your computer and extract it or look at the results in place. There will be a directory like: CHECKINTEGRITY-[timestamp]. In that new directory, type: ls *csv *txt
The date that the command started and finished is in the dates.txt file to give you an idea of how long it took to run as you might not be watching when it finishes.
successes.csv lists the etags where the stream was found in the cluster and all of its segments were returned as expected.
errors.csv lists the etags where the stream manifest was found, but there are missing segments. Also damaged-objects.csv which looks up the name of the stream along with the etag (important when versions are involved). When calling the main script, a header file and segments listing file are generated per UUID. These are deleted for brevity if the file is successfully found as there is no problem to look at. If there are missing segments, though, there is a file output like h-[etag].out that shows the header for the stream and a file like s-[etag].out that includes the listing of the segments and their return code. In my example, I have 1 error. I deliberately deleted two of the segments and you can see that here.
If there aren’t enough segments for the stream to recreate the stream (depends on how many segments are missing compared to the k:p value for the stream), that stream is corrupt and will need to be deleted and re-uploaded. This is a data loss case. Contact support before continuing.
It is possible for the manifest to actually be missing from the cluster (likely including all of its segments) while still having a reference in Elasticsearch (the indexer feed). In that case, you will see a 404s.csv file. That file will contain a list of etags where the manifest was not available in the cluster. Because there is only an etag in the 404s.csv file, you won’t know what the original filename was since it can’t be INFO’d from the storage cluster. In that case, you can grep the etag from the original All-Objects.txt file created during the initial enumeration. In these cases, what likely happened is that the ES cluster missed a delete of the stream so contains information that it should not. In these cases, this “ghost” entry in ES will need to be manually removed. Contact support before continuing.
Note- in the above indexer-enumerator.sh commands output, the output filenames (All-etags.csv and All-Objects.txt will also have a timestamp in the filename)