Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
minLevel1
maxLevel2
outlinefalse
styledisc
typelist
printabletrue

Overview

Per-domain indexing (PDI) is a feature in storage systems, such as SWARM, where an index is created for each domain (a domain being a logical grouping of objects, similar to a namespace). This With Swarm 16.1.2 and Gateway 8.1.0, we introduced a new feature called “Per Domain Index”. We can now store metadata from each domain in separate indices within our Elasticsearch database. This feature allows for more efficient searching, querying, and retrieval of objects within a specific domain. It is enabled by settingsearch.perDomainIndex=Truebefore creating a Search Feed. Key benefits of per-domain indexing

scsctl storage config set -d "search.perDomainIndex=true"

scsctl storage config set -d "search.numberOfShards=5"

Data in Elasticsearch is organized into indices. Each index is made up of one or more shards. Each shard is an instance of a Lucene index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of the data in an Elasticsearch cluster.

Key benefits of Per Domain Index include:

  1. Improved Search Performance: By indexing objects within a domain, searches can be limited to the domain, reducing the search scope and speeding up retrieval.

  2. Granular Data Management: Each domain can have its own index, allowing administrators to manage indexing based on domain-specific needs.

  3. Scalability: As the number of objects grows, having separate indices for each domain helps in scaling the system without performance degradation.

Per -domain indexing Domain Index is especially useful in multi-tenant environments, where each tenant (domain) may need separate indexing to optimize their storage and access.

With Per Domain Index (PDI), customers are separated by different domains and use distinct indices. As a result, the latency for each domain is directly related to the number of documents in its respective index, allowing for more efficient performance management.

This should only be enabled with Support guidance. The number of domains that the Swarm cluster supports, is limited due to limitations to the number of shards per Elasticsearch node.

Prerequisites

  • Gateway 8.1.0 or higher

  • Swarm Storage 16.1.2 or higher

Hardware Sizing

  • Shard Size: Recommended between 20 GB and 50 GB.

  • Shard Alignment: Align the number of shards with the number of nodes (shards should be a multiple of nodes).

  • Heap Size: Maximum heap size of 30.5 GB per node.

  • Shard Capacity: A node can handle up to 30.5 GB heap * 20 shards per GB of heap, resulting in a maximum of approximately 600 shards per node.

  • Elastic Limit: Elasticsearch has a default maximum shard limit of 1,000 per node. It is advisable to stay below 600 shards per node for optimal performance.

How to Implement/Enable in Production

Implementing PDI requires the following configurations:

  • Set search.numberOfShards = 5 to optimize performance and avoid exceeding the 600 shards per Elasticsearch node limit.

  • Set search.perDomainIndex = True in the cluster configuration before creating the Search Feed.

  • Gateway version 8.1.0 or higher.

  • Set search.numberOfShards = 5 to optimize performance and avoid exceeding the 600 shards per Elasticsearch node limit.

...

  • Create separate indices for each domain. For example, domain1_index, domain2_index, etc.

  • Store data from multiple domains in a single index with a domain identifier. For example, { "domain": "domain1", "content": "..." }.

Implementing Per Domain Index (PDI) Without Downtime

  1. Keep the new search feed index at 100% after PDI is enabled: When enabling PDI, a new set of indices is created, each corresponding to a specific domain. The term "search feed" refers to the data being indexed. The new indices need to be fully populated (100%) before switching to them. This means that all documents should be indexed without any gaps or missing data.
    Implementation Steps:

    1. Enable PDI in the system settings.

    2. Create new indices for each domain dynamically based on predefined templates or configurations.

    3. Reindex all existing documents from the old, non-PDI index to the new domain-specific indices.

    4. Monitor the progress of the reindexing process until all indices reflect 100% of the expected documents.

  2. Once completed, set the new search feed as the default: After all domain-specific indices are fully populated, the system should switch to using these indices for search queries. This switch should be seamless to ensure that users experience no disruption.

    • Implementation Steps:

      1. Confirm that reindexing is complete for all domain indices.

      2. Update the search configuration to point to the new Per Domain Indices.

      3. Conduct tests to ensure that search results from the new indices are accurate and complete.

  3. Clean up the previous index (created without PDI) as needed: The old index, which stored data without per-domain segregation, is now redundant. Cleaning it up can free up resources and reduce storage costs.

    • Implementation Steps:

      1. Verify that the new indices are performing correctly and no issues exist.

      2. Take a final backup of the old index, if necessary, before deletion.

      3. Delete or archive the old index to remove unnecessary data.

Implementing

...

Per Domain Index (PDI) With Downtime

  1. Set the new search feed as the default after enabling PDIThe new : If downtime is acceptable, the process is simpler. You can immediately switch to the new indices after enabling PDI, even if they aren’t fully populated. Users will experience downtime or incomplete search results until reindexing is complete.
    Implementation Steps:

    1. Enable PDI and create new domain-specific indices.

    2. Immediately update the search configuration to use these new indices.

    3. Reindex documents into the new indices while search may be temporarily affected.

  2. The new index will be temporarily unavailable for listing: During reindexing, the new domain indices may not be fully operational. Users may experience missing or incomplete data in search results.
    Implementation Steps:

    1. Inform users about the planned downtime and potential service disruption.

    2. Monitor the reindexing process to track progress and identify any issues.

  3. Listing will resume after the new search feed reaches 100% completion from the reindexing process: Once the reindexing is complete, the search functionality will return to full capacity with complete and accurate listings.
    Implementation Steps:

    1. Monitor reindexing progress and ensure that all expected documents are indexed.

    2. Conduct validation checks to ensure data integrity and accuracy.

    3. Notify users when the service is fully restored.

Disabling Per Domain

...

Index

  1. Set search.perDomainIndex = False and delete the associated search feed.All per-domain indices Search Feed: To disable PDI, switch the configuration setting that enables it (search.perDomainIndex) to False. This setting controls whether the system should use per-domain indices. After disabling, delete all per-domain indices to consolidate data back into a single index.
    Implementation Steps:

    1. Set the configuration search.perDomainIndex to False in your application or search engine settings.

    2. Delete or archive the Per Domain Indices as they are no longer needed.

  2. All Per Domain Indices will be removed, leaving a single index operational: After disabling PDI, the system will revert to using a single index for all domains. This simplifies data management but loses the benefits of domain-specific indexing.
    Implementation Steps:

    1. Ensure that the single index is configured to handle all domain data.

    2. Reindex documents from the per-domain indices back into the single consolidated index, if necessary.

  3. The number of shards will remain unchanged, so there won’t be any issues with single indexing: The shard count configuration is typically set at the cluster level and doesn’t change when switching from PDI to a single index. This ensures that the system’s capacity for handling search queries remains stable.
    Implementation Steps:

Limitations

  • Customers with multiple domains and large datasets may experience delays in search operations.

...

A single large index for all domains can lead to slower search performance.

...

    1. Review the shard configuration of your search engine to confirm it is optimal for a single index.

    2. Adjust the shard count if necessary to balance performance and resource usage.

Limitations

  • Increased Complexity: Managing multiple indexes can complicate the architecture of the search engine or database system. It requires more sophisticated algorithms and infrastructure to handle indexing and searching across domains.

  • Resource Intensive: Each index requires storage space and processing power. Maintaining multiple indexes can lead to higher resource consumption, increasing costs for hardware, maintenance, and energy.

  • Data Duplication: Some data may be relevant to multiple domains, leading to potential duplication across different indexes. This can result in inconsistencies and increased storage requirements.

  • Index Synchronization: Keeping indexes up-to-date across multiple domains can be challenging. Changes in the data must be reflected in all relevant indexes, which can introduce delays or errors.

  • Search Performance: While Per Domain Index can improve search performance for specific queries, it may degrade overall performance when a query spans multiple domains. This can result in longer query times as the system needs to aggregate results from various indexes.

  • Scalability Issues: As the number of domains increases, scaling the infrastructure to support numerous indexes can become difficult. Performance may suffer if not properly managed.

  • Complex Querying: Queries to access information from multiple domains can become more complex. Users may have to specify which domain to search within, leading to a less user-friendly experience.

  • Difficulty in Aggregating Insights: Analyzing data trends across multiple domains can be more complicated, as the separation of indexes may obscure holistic insights.

  • Limited Inter-Domain Relationships: Understanding relationships and connections between different domains may be more challenging when they are indexed separately, limiting the ability to perform cross-domain analytics.

  • Potential for Over-Optimization: Focusing too much on optimizing for specific domains may lead to neglecting broader optimization strategies that could enhance overall system performance.

How to Determine PDI is Working

Search Feed Schema

Code Block
languagebash
curl -X GET "http://ESNODE:9200/index_{CLUSTERNAME}{feed-id}*/_mappings?pretty"

...

This command will list all indices for each domain in the cluster.

ES Node Template

Code Block
languagebash
curl -X GET http://ESNODE:9200/_template?pretty | grep '{CLUSTERNAME}{feed-id}'

This command will check if the cluster template is visible on the ES node.

Split Index Utility

For customers with indexes exceeding 50GB, it is recommended to use the split index utility to enhance performance.