Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

How Does Listing Cache Work

  • Ensure Sufficient Disk Space: Listing Cache stores each folder in a separate SQLite database, which consumes disk space. Provide ample disk space to avoid frequent evictions of folder databases, as this impacts performance.

  • Automatic Folder Detection: Listing Cache automatically learns about folders through ongoing list, write, and delete requests. No manual intervention is required to create or manage databases for each folder.

  • Monitor Cache Population: Initially, for any new folder, the cache starts with an "infinite gap," meaning it has no data cached and queries ElasticSearch. Over time, as more listings are cached, the gap reduces until the folder is fully cached and can be served without querying ElasticSearch.

  • Real-Time Cache Updates: Ongoing write and delete requests are intercepted and used to keep the folder databases updated, ensuring the cache remains consistent with the actual data.

  • LRU-Based Eviction: The system automatically evicts the least recently used (LRU) databases when disk space is full. If a folder's database is evicted and later requested, the cache process restarts for that folder.

  • Disk Space Directly Impacts Performance: The more disk space available, the fewer evictions occur, allowing more folders to remain fully cached and reducing the need for frequent ElasticSearch queries.

  • Prepare for ElasticSearch Querying: In case of cache misses or folder database evictions, ElasticSearch will be queried. Ensure that ElasticSearch is properly configured to handle such requests, especially during periods of high cache turnover.

How to Enable Listing Cache

The steps for enabling Listing Cache in a production environment depend on the platform or technology stack. The procedure to enable Listing Cache in Swarm is outlined below:

Info

Validate your system support Listing Cache. Many frameworks or databases like Redis, Memcached, or certain web frameworks support caching.

  1. Locate the cache configuration settings in your system’s configuration file or interface. In gateway.cfg, just set [storage_cluster]disableListingCache=false

  2. Set the cache engine (e.g., Redis, Memcached) and specify settings such as cache size, eviction policies (e.g., LRU - Least Recently Used), and TTL (Time to Live) for cached items. I don’t know where you got this from. None of these technologies were used in LC????

  3. Update code to leverage the Listing Cache. For example, fetch listings from the cache instead of querying the database, and store listings in the cache after the first database query. This is all transparent to gateway S3/SCSP clients.

  4. Add checks to ensure cache invalidation happens when the data changes (e.g., if product details are updated).

  5. After testing in a staging environment, roll out the Listing Cache to production by deploying the necessary configurations and code changes.

  6. Monitor performance impact closely during the rollout phase.

  7. Optional. Pre-warm the cache with commonly accessed listings before enabling it in production, so the initial requests are served from the cache.

Metrics

...

Table of Contents
minLevel1
maxLevel6
outlinefalse
styledisc
typelist
printabletrue

Overview

Listing Cache (LC) is a performance optimization feature designed to improve the speed of listing large datasets within Swarm storage. It works by caching directory listings, reducing the time and resource consumption required to fetch and display object listings repeatedly.

The listing cache solves a scalability problem with the gateway's delimited folder listing functionality. To determine if a folder has subfolders, an Elasticsearch query has to enumerate all objects with the folder name as a prefix to their object names. This can run into the millions of objects for large buckets. When such queries are issued repeatedly and at high frequencies, the resulting CPU use brings an entire Elasticsearch cluster to a halt. This has a crippling effect on any sizeable Veeam deployment.

Prerequisites

The listing cache prototype can be installed as a regular gateway. No extra config is required. Increasing java heap size is recommended, and the disk storing /var/spool/caringo/cloudgateway should be SSD with preferably >=100Gb free capacity.

Memory

  • Minimum: 4 vCPU, 8 GB RAM (heap) - VM 12GB, 100GB dedicated partition on SSD ( on XFS filesystem ) ← THIS STATES MEM, CPU AND DISK REQS

  • Recommended: 8 vCPU 12 GB RAM(heap) - VM 16 GB, 200GB dedicated partition on SSD ( on XFS )

CPU

  • The system must have adequate CPU resources, as caching can lead to higher CPU utilization due to cache management (e.g., cache eviction policies, invalidation, etc.).

  • Cache with complex data structures or serialization/deserialization can impact CPU usage. Basically already stated above

Network

  • If any.

Disk

  • If any.

Limitations

...

Client-Specific Binding: Bound to a dedicated client, with no cross-gateway sharing allowed. The gateway must be able to intercept every write and delete that happens in Swarm.

...

Non-Persistent Cache: The disk/memory cache is discarded by default on restart.

...

Limited Lifecycle and Recursive Deletion Support: No support for bucket lifecycle policies, delete lifepoints, or recursive deletes. All writes and deletes must originate from the gateway.

...

Memory Constraints: Caching large volumes of data can quickly consume system memory. Misconfiguring cache sizes can lead to memory exhaustion or excessive eviction, reducing cache effectiveness.

...

Cache Invalidation Complexity: Managing cache invalidation (i.e., ensuring cached data is refreshed when source data changes) can be complex, especially in distributed environments or when working with dynamic data. Ditto

...

Overhead on Writes: When new data is added or existing data is modified, cache updates and invalidation can add extra overhead, potentially slowing down write operations. Because LC allows switching of synchronous indexing, writes actually became faster. I do not think we need to raise this point.

...

Cache Miss Penalties: If the cache miss rate is high (meaning data isn’t found in the cache often), the overhead of checking the cache and then falling back to the database could negatively impact performance. The worst case is no worse than without caching so I do not see this as a limitation.

...

Custom delimiters are not yet supported, only forward slash "/"

Table of Contents
minLevel1
maxLevel2
outlinefalse
styledisc
typelist
printabletrue

Overview

Listing Cache (LC) is a performance optimization feature designed to improve the speed of listing large datasets within Swarm storage. It works by caching pseudo-folder listings, reducing the time and resource consumption required to fetch and display object listings repeatedly.

The Listing Cache solves a scalability problem with the gateway's delimited folder listing functionality. To determine if a folder has subfolders, an Elasticsearch query has to enumerate all objects with the folder name as a prefix to their object names. This can run into the millions of objects for large buckets. When such queries are issued repeatedly and at high frequencies, the resulting CPU use brings an entire Elasticsearch cluster to a halt.

Limitations

  • Client-specific binding: Bound to a dedicated client, with no cross-gateway sharing allowed. Once you decide to serve 1 or more domains on a listing-cache enabled gateway it must serve all requests to those domain(s) exclusively. This is achieved by configuring your load-balancer with dedicated host-based traffic redirection rules.

  • Non-persistent cache: The disk/memory cache is discarded by default on restart.

  • Limited lifecycle and recursivedeletion support: No support for bucket lifecycle policies, delete lifepoints, or recursive deletes. All writes and deletes must originate from the gateway.

  • Memory constraints: Caching large volumes of data can quickly consume system memory. Misconfiguring cache sizes can lead to memory exhaustion or excessive eviction, reducing cache effectiveness.

  • Delimiters support: Custom delimiters are not yet supported, only forward slash "/".

  • Replication support: Do not setup replication when LC is enabled.

  • Not supported functionalities: Custom delimiters, S3 lifecycles, and recursive deletes.

Prerequisites

The Listing Cache can be enabled on gateway 8.1.2 or above. Ensure the following prerequisites are met before deploying Listing Cache:

  • Hardware Requirements:

    • 8 vCPUs

    • 16GB RAM

    • 200GB dedicated partition formatted with XFS

  • Load Balancing Configuration:

    • Hardcode domains to a single gateway with Listing Cache (LC).

Info

Info

Shared gateway support is currently not available.

Assuming you are using recommended settings, you will need to do the following:

Set Java Memory Heap

Panel
bgColor#DEEBFF

vim /etc/sysconfig/cloudgateway

HEAP_MIN="12228m"
HEAP_MAX="12228m"

Create disk cache partition

Panel
bgColor#DEEBFF

vgcreate swarmspool /dev/sdb
lvcreate -L 195G -n diskcache swarmspool
mkfs.xfs /dev/swarmspool/diskcache
mount /dev/swarmspool/diskcache /var/spool/caringo/

Persist it by adding at the end of /etc/fstab

/dev/mapper/swarmspool-diskcache /var/spool/caringo xfs defaults 0 0

Do Not Use Listing Cache If:

  1. You use multipart S3 operations.

  2. You use custom delimiters in search queries.

  3. You need the ability to do recursive deletes of domains and buckets.

  4. You use S3 lifecycle policies.

  5. You need support for the delete lifepoints.

  6. You do not use pseudo folders or all objects are in a single pseudo folder.

How to Enable Listing Cache

The procedure to enable Listing Cache in Swarm is outlined below:

  1. Add in the /etc/caringo/cloudgateway/gateway.cfg.

Code Block
[storage_cluster]
disableListingCache=false
  1. After testing in a staging environment, roll out the Listing Cache to production by deploying the necessary configurations and code changes.

  2. Monitor performance impact closely during the rollout phase.

  3. Optional. Pre-warm the cache with commonly accessed listings before enabling it in production, so the initial requests are served from the cache.

How Does Listing Cache Work

  • Ensure Sufficient Disk Space: Listing Cache stores each folder in a separate SQLite database, which consumes disk space. Provide ample disk space to avoid frequent evictions of folder databases, as this impacts performance.

  • Automatic Folder Detection: Listing Cache automatically learns about folders through ongoing list, write, and delete requests. No manual intervention is required to create or manage databases for each folder.

  • Monitor Cache Population: Initially, for any new folder, the cache starts with an "infinite gap," meaning it has no data cached and queries Elasticsearch. Over time, as more listings are cached, the gap reduces until the folder is fully cached and can be served without querying Elasticsearch.

  • Real-Time Cache Updates: Ongoing write and delete requests are intercepted and used to keep the folder databases updated, ensuring the cache remains consistent with the actual data.

  • LRU-Based Eviction: The system automatically evicts the least recently used (LRU) databases when disk space is full. If a folder's database is evicted and later requested, the cache process restarts for that folder.

  • Disk Space Directly Impacts Performance: The more disk space available, the fewer evictions occur, allowing more folders to remain fully cached and reducing the need for frequent Elasticsearch queries.

  • Prepare for Elasticsearch Querying: In case of cache misses or folder database evictions, Elasticsearch will be queried. Ensure that Elasticsearch is properly configured to handle such requests, especially during periods of high cache turnover.

How to Determine If the Listing Cache is Working Correctly

  1. Monitor Cache Hit Rate

    • If you have telemetry and Grafana available, check the Listing Cache dashboard.

  2. Check Response Time

    • Compare the response time before and after enabling the Listing Cache. Reduced response times, particularly for frequently requested folder listings, indicate the cache functions correctly.

  3. Resource Utilization

    • Monitor memory usage and CPU utilization. Increased memory usage and steady CPU activity are normal in a caching system, but excessively high CPU or memory usage may indicate misconfiguration.

Deployment Steps

Follow these steps to deploy the Listing Cache:

Step 1: Prepare the Environment

  1. Provision a server with the specified hardware requirements.

  2. Ensure the server’s 200GB partition is formatted with the XFS file system.

  3. Verify network connectivity to other components of the S3 environment.

Step 2: Configure Load Balancer

  1. Modify load-balancing rules to hardcode domains to a single gateway.

  2. Ensure that all LC-enabled domains point to the appropriate gateway.

  3. Test the load balancer configuration to confirm proper routing.

Step 3: Install and Configure Listing Cache

  1. Download the LC installation package from the designated repository.

  2. Install the package on the prepared server.

  3. Configure LC settings according to your environment’s specifications:

    • Set up domain-specific configurations.

    • Enable pseudo folder support as required.

Step 4: Validate Deployment

  1. Perform basic functionality tests:

    • Verify data retrieval and storage through LC.

    • Test operations within pseudo folders.

  2. Check system logs for any errors or warnings.

  3. Monitor performance metrics to ensure the hardware is sufficient.

Step 5: Go Live

  1. Enable LC for production workloads.

  2. Monitor system performance and address any issues promptly.

Post-deployment Recommendations

  • Regularly monitor LC’s performance and resource utilization.

  • Plan for updates as new features and improvements are released.

  • Document any environment-specific configurations for future reference.

HAProxy Configuration for LC-Enabled Gateway

Below is the suggested generic HAProxy configuration tailored for a Listing Cache (LC)-enabled gateway.

  1. This configuration is designed for HAProxy version 2.2 and higher.

  2. Failover without failback is enabled for Listing Cache. Since restarting LC clears its cache, it is optimal to only failover if the gateway becomes unavailable.

  3. SCSP traffic is not routed to the Listing Cache. The configuration is primarily intended for handling S3 traffic.

  4. Specific domains are redirected to the LC-enabled gateway, while all other traffic is routed to the regular non-cached pool.

The following is an example /etc/haproxy.conf file.

Code Block
global
    log 127.0.0.1 local2 alert
    chroot /var/lib/haproxy
    stats socket /var/lib/haproxy/stats mode 660 level admin
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

    ca-base /etc/pki/ca-trust/
    crt-base /etc/haproxy/certs

    ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!MD5:!DSS:!3DES
    ssl-default-bind-options no-sslv3
    maxconn 2048
    tune.ssl.default-dh-param 2048

defaults
    log     global
    mode    http
    option  forwardfor
    option  httplog
    option  dontlognull
    timeout connect 5000
    timeout client  50000
    timeout server  130000

frontend HTTP_IN
    bind *:80 name *:80
  option http-keep-alive
  acl acl_is_http req.proto_http
    http-request redirect scheme https if acl_is_http

frontend stats
    mode http
    bind 0.0.0.0:8404
    stats enable
    stats uri /stats
    stats refresh 10s
    stats admin if LOCALHOST

frontend HTTPS_IN
    bind *:443 name *:443 ssl alpn h2,http/1.1 crt /etc/haproxy/certs/wildcard.acme.com.pem
  mode http
  option http-keep-alive
  option httplog

    acl acl_is_content_ui path -m beg /_admin/portal
  acl acl_awsauth hdr_sub(Authorization) -i AWS
    acl acl_aws path_reg -i (?<=[?&])(AWSAccessKeyId|X-Amz-Credential)=
    # Define an acl per domain you want to send to LC
  acl acl_is_domain_a hdr(host) -i  domaina.acme.com

  use_backend POOL-S3-listingcache if acl_is_domain_a
  use_backend POOL-S3 if acl_awsauth || acl_aws
  use_backend POOL-scsp if is_content_ui

backend POOL-scsp
    mode http
    balance leastconn
    stick-table type ip size 50k expire 30m  
    stick on src
    http-reuse safe 
    server GW01 10.11.21.33:8080 check inter 10s
    server GW02 10.11.21.34:8080 check inter 10s

backend POOL-S3-listingcache
     balance source
   stick-table type ip size 50k expire 24d  
     stick on src

   option httpchk
     http-check connect
     http-check send meth HEAD uri / ver HTTP/1.1 hdr Host haproxy-healthcheck
     http-check expect status 403

   server GW03 10.11.21.35:8090 check inter 10s fall 3 rise 2  
     server GW04 10.11.21.36:8090 check inter 10s fall 3 rise 2  backup

backend POOL-S3
    balance leastconn
    stick-table type ip size 50k expire 30m  
    stick on src  

    option httpchk
    http-check connect
    http-check send meth HEAD uri / ver HTTP/1.1 hdr Host haproxy-healthcheck
    http-check expect status 403

    server GW01 10.11.21.33:8090 check inter 10s fall 3 rise 2
    server GW02 10.11.21.34:8090 check inter 10s fall 3 rise 2

Metrics

Code Block
caringo_listingcache_request (Summary)
        Request counts and latencies for write/delete/list, versioned/nonversioned.
        Labels: method=["list"write, delete, "prime"list], mode=[V, NV]

caringo_listingcache_backendrequest_query_recserrors (Counter)
        NumberRequest oferror ES recordscounts queried for primingwrite/delete/listinglist, versioned/nonversioned.
        Labels: method=["list"write, delete, "prime"list], mode=[V, NV]

caringo_listingcache_cachelisted_queryrecs (SummaryCounter)
        CountsTotal andnumber latenciesof ofrecords SqliteDBreturned queriesby forthe priming/listing cache, versioned/nonversioned.
        Labels: method=["list", "prime", "reconciliation"], mode=[V, NV]

caringo_listingcache_cachebackend_query_recs (CounterSummary)
        Counts Numberand latencies of SqliteDBES records queriedqueries for priming/listing, versioned/nonversioned.
        Labels: method=["list", "prime", "reconciliation"], mode=[V, NV]

caringo_listingcache_flushes_pending (Gauge)
        Folder updates pending flush to SqliteDB disk cache., NV]

caringo_listingcache_flushesbackend_query_donerecs (Counter)
        Number Folderof updatesES flushedrecords toqueried SqliteDB disk cachefor priming/listing, versioned/nonversioned.
 caringo_listingcache_trims_pending (Gauge)         Folders pending trim in memory cache.Labels: method=["list", "prime"], mode=[V, NV]

caringo_listingcache_trimscache_donequery (CounterSummary)
        FoldersCounts trimmedand inlatencies memoryof cache.SqliteDB 
caringo_listingcache_folder_pulls_pending (Gauge)
queries for priming/listing, versioned/nonversioned.
       Folders marked to be internally pulled into cache.Labels: method=["list", "prime", "reconciliation"], mode=[V, NV]

caringo_listingcache_foldercache_pullsquery_donerecs (Counter)
        Folders internally pulled into cache.

caringo_listingcache_mem_cached (Gauge)Number of SqliteDB records queried for priming/listing, versioned/nonversioned.
        Labels: method=["list",  Folders currently in memory cache.

"prime", "reconciliation"], mode=[V, NV]

caringo_listingcache_memflushes_evictedpending (CounterGauge)
        Folder updates pending Foldersflush evictedto fromSqliteDB memorydisk cache.

caringo_listingcache_dbhandleflushes_cacheddone (GaugeCounter)
        Folder SqliteDBupdates handlesflushed currentlyto inSqliteDB memorydisk cache.

caringo_listingcache_dbhandletrims_evictedpending (CounterGauge)
        SqliteDBFolders handlespending evictedtrim fromin memory cache.

caringo_listingcache_disktrims_cacheddone (GaugeCounter)
        SqliteDBsFolders currentlytrimmed in diskmemory cache.

caringo_listingcache_diskfolder_pulls_evictedpending (CounterGauge)
        Folders marked to Foldersbe evictedinternally frompulled diskinto cache.

caringo_listingcache_diskfolder_cachedpulls_bytesdone (GaugeCounter)
        SizeFolders ininternally bytespulled of SqliteDBs currently in disk into cache.

caringo_listingcache_diskmem_evicted_bytescached (CounterGauge)
        SizeFolders currently in bytes of SqliteDBs evicted from diskmemory cache.

caringo_listingcache_reconciliationsmem_doneevicted (Counter)
        Folders evicted Numberfrom ofmemory cache.
records reconciled
caringo_listingcache_dbhandle_cached (versionidGauge)
mismatches corrected based on etag).    SqliteDB handles currently in  Labels: origin=[backend,cache]memory cache.

caringo_listingcache_memorydbhandle_usedevicted (GaugeCounter)
        Memory use as perceived by the listingSqliteDB handles evicted from memory cache.

caringo_listingcache_disk_freecached (Gauge)
        DiskSqliteDBs freecurrently space as perceived by the listing cache.

How to Determine Listing Cache is Working Correctly

...

Monitor Cache Hit Rate

  • Use cache statistics (available through the cache system, such as Redis or Memcached) to monitor cache hit-and-miss rates. If they have telemetry and grafana available, they can check the LC dashboard.

  • A high cache hit rate (e.g., above 90%) indicates that the cache effectively serves requests.

...

Check Response Time

  • Compare the response time before and after enabling the Listing Cache. Reduced response times, particularly for frequently requested data, indicate the cache functions correctly.

...

Resource Utilization

  • Monitor memory usage and CPU utilization. Increased memory usage and steady CPU activity are normal in a caching system, but excessively high CPU or memory usage may indicate misconfiguration or over-reliance on the cache.

...

Log Analysis

  • Review application logs for cache access events, if you have implemented custom logging for cache hits, misses, and evictions. This helps verify that requests are being served from the cache. This requires switching on DEBUG logging. Do we want customers to do that? If yes I can explain which log messages to look for.

Data Consistency Checks

...

in disk cache.

caringo_listingcache_disk_evicted (Counter)
        Folders evicted from disk cache.

caringo_listingcache_disk_cached_bytes (Gauge)
        Size in bytes of SqliteDBs currently in disk cache.

caringo_listingcache_disk_evicted_bytes (Counter)
        Size in bytes of SqliteDBs evicted from disk cache.

caringo_listingcache_reconciliations_done (Counter)
        Number of cache records reconciled (versionid mismatches corrected based on etag).
        Labels: origin=[backend,cache]

caringo_listingcache_memory_used (Gauge)
        Memory use as perceived by the listing cache.

caringo_listingcache_disk_free (Gauge)
        Disk free space as perceived by the listing cache.