Per Domain Index

Overview

With Swarm 16.1.2 and Gateway 8.1.0, we introduced a new feature called “Per Domain Index”. We can now store metadata from each domain in separate indices within our Elasticsearch database. This feature allows for more efficient searching, querying, and retrieval of objects within a specific domain. Per domain index is by default enabled in Swarm.

Data in Elasticsearch is organized into indices. Each index is made up of one or more shards. Each shard is an instance of a Lucene index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of the data in an Elasticsearch cluster.

Key benefits of the Per Domain Index include:

  1. Improved Search Performance: By indexing objects within a domain, searches can be limited to the domain, reducing the search scope and speeding up retrieval.

  2. Granular Data Management: Each domain can have its index, allowing administrators to manage indexing based on domain-specific needs.

  3. Scalability: As the number of objects grows, having separate indices for each domain helps in scaling the system without performance degradation.

Per Domain Index is useful in multi-tenant environments, where each tenant (domain) may need separate indexing to optimize their storage and access.

With Per Domain Index (PDI), customers are separated by different domains and use distinct indices. As a result, the latency for each domain is directly related to the number of documents in its respective index, allowing for more efficient performance management.

Prerequisites

  • Gateway 8.1.0 or higher

  • Swarm Storage 16.1.2 or higher

Hardware Sizing

  • Shard Size: Recommended shard size is between 20 GB and 50 GB.

  • Shard Alignment: Align the number of shards with the number of nodes (shards should be a multiple of nodes).

  • Heap Size: Maximum heap size of 30.5 GB per node.

  • Shard Capacity: A node can handle up to 30.5 GB heap * 20 shards per GB of heap, resulting in a maximum of approximately 600 shards per node.

  • Elastic Limit: Elasticsearch has a default maximum shard limit of 1,000 per node. It is advisable to stay below 600 shards per node for optimal performance.

When to Use Per Domain Index (PDI)

Per Domain Indexing (PDI) is ideal for customers having multi-tenant environments or large domain-specific workloads. Here are some specific scenarios where PDI is recommended:

  1. Multi-Tenant Setups: If the customer has multiple clients, departments, or tenants, and each requires isolated and independent indexing, PDI ensures that search and metadata operations are scoped only to that particular domain. This enhances performance and simplifies management for each tenant. For example - A service provider offers object storage to multiple clients where each client needs to search and index their data independently.

  2. Domain-Specific Workloads: For customers who organize their data into domains and need domain-level indexing for better search performance, PDI can provide more efficient search results as it keeps the index smaller and more manageable within each domain. For example - Large enterprises that divide storage into domains for different departments, projects, or users. Each domain may require its indexing structure for customized search capabilities.

  3. High Search Query Performance Needs: If the customer frequently runs complex search queries or metadata lookups within specific domains, PDI ensures that the search operations are faster and more efficient by isolating the indexes to individual domains rather than the entire cluster. For example - Media companies that need to quickly search through content metadata stored within specific domains for rapid content retrieval.

  4. Heavy Index Operations: If the customer performs multiple index-heavy operations like frequent updates to metadata, PDI helps reduce the load on the overall system by distributing index operations across different domain indexes, rather than centralizing all updates in one index. For example - E-commerce platforms where product catalogs are updated frequently, and each catalog (domain) needs fast indexing for search and retrieval.

When Not to Use Per Domain Index (PDI)

Here are some scenarios where PDI is not an ideal approach and a global index is preferable for customers:

  1. Small Deployments or Low Indexing Needs: For customers with small clusters, low object count, or low volumes of search or metadata queries, managing individual domain indexes adds unnecessary complexity. A global index will suffice in such cases. For example - Small businesses or customers with limited object storage usage may find the overhead of maintaining separate domain indexes outweighs the benefits.

  2. Single Tenant / Homogenous Data Environments: If a customer only uses a single domain, there is no need for PDI. However, if there are multiple domains, even with homogeneous data, PDI may still be beneficial as it helps manage index growth over time and improves performance.

Implementing Per Domain Index (PDI) Without Downtime

  1. Keep the new search feed index at 100% after PDI is enabled: When enabling PDI, a new set of indices is created, each corresponding to a specific domain. The new search feed needs to be fully populated (100%) before switching to those indices. This means that all contexts and objects are indexed without any gaps or missing data.

  2. Once completed, set the new search feed as the default: When the search feed is 100%, remember to verify indexerHosts in gateway.cfg and restart Gateways whenever the default search feed is changed. Now, the cluster is ready to list as per PDI. This switch should be seamless to ensure that users experience no disruption.

  3. Clean up the previous index (created without PDI) as needed: The old index stored data without per-domain segregation is now redundant. Cleaning it up can free up resources and reduce storage costs.

Implementing Per Domain Index (PDI) With Downtime

  1. Set the new search feed as the default after enabling PDI: If downtime is acceptable, the process is simpler. You can immediately switch to the new search feed after enabling PDI, even if they are not fully populated. Users will experience downtime or incomplete search results until reindexing is complete.

  2. The new index will be temporarily unavailable for listing: During reindexing, the new domain indices may not be fully operational. Users may experience missing or incomplete data in search results.

  3. Listing will resume after the new search feed reaches 100% completion: Once the reindexing is complete, the search functionality will return to full capacity with complete and accurate listings.

Disabling Per Domain Index

  1. Set search.perDomainIndex = False and delete the associated search feed: To disable PDI, switch the configuration setting that enables search.perDomainIndex to False. This setting controls whether the next search feed that is created should use per-domain indices or not. After disabling, delete all per-domain search feeds to delete all the per-domain indices.

  2. Create a new search feed: Since search.perDomainIndex = False, this will use a single index for all domains. This simplifies data management but loses the benefits of domain-specific indexing.

Limitations

  • Increased Complexity: Managing multiple indexes can complicate the architecture of the search engine or database system. It requires more sophisticated algorithms and infrastructure to handle indexing and searching across domains.

  • Resource Intensive: Each index requires storage space and processing power. Maintaining multiple indexes can lead to higher resource consumption, increasing costs for hardware, maintenance, and energy.

  • Index Synchronization: Keeping indexes up-to-date across multiple domains can be challenging. Changes in the data must be reflected in all relevant indexes, which can introduce delays or errors.

  • Scalability Issues: As the number of domains increases, scaling the infrastructure to support numerous indexes can become difficult. Performance may suffer if not properly managed.

How to Determine PDI is Working

Search Feed Schema

curl -X GET "http://ESNODE:9200/index_{CLUSTERNAME}{feed-id}*/_mappings?pretty"
  • When search.perDomainIndex = False, the schema will have a null_value of "2.1".

  • When search.perDomainIndex = True, the schema will have a null_value of "3.0".

Per Domain Index

curl -X GET "http://ESNODE:9200/_cat/indices?s=index" | grep '{CLUSTERNAME}{feed-id}'

This command will list all indices for each domain in the cluster.

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.