OpenSearch is a distributed, open-source search and analytics suite used for a broad set of use cases like real-time application monitoring, log analytics, and website search. Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) is a managed service that makes it easy to secure, deploy, and operate OpenSearch clusters at scale. Amazon OpenSearch Service provides a broad range of cluster configurations to meet your use cases. In 2021, we released automated memory management under Auto-Tune. Auto-Tune is an adaptive resource management system in Amazon OpenSearch Service that continuously monitors incoming workloads and optimizes cluster resources to improve efficiency and performance.
Today, we’re excited to announce the release of admission control for Auto-Tune. Admission control in Amazon OpenSearch Service enhances the overall resiliency of OpenSearch clusters by limiting new incoming requests early, at the REST layer, when a node is stressed. This mechanism prevents potential node failures and cascading effects on the cluster.
Overview of admission control
Admission control acts like a lever to regulate traffic based on cluster state. It does so by allocating tokens for each OpenSearch request, based on predicted resource usage. It releases the tokens when the process is complete. After all the tokens are acquired, any additional requests to the node are throttled with a “too many requests” exception until tokens are available again for request processing. In some cases, an operator can utilize admission control to completely shut down traffic and prevent frequent node drops until a certain condition is met, such as shards being assigned.
Admission control is a gatekeeper for nodes, limiting the number of requests processed to a node based on its current capacity.
Admission control prevents Amazon OpenSearch Service domains from getting overloaded both by steady increases and surges in traffic. It’s resource-aware, so it tunes the cluster based on incoming request cost (content length of request payload), and the point-in-time state of the node (overall Java Virtual Machine (JVM)). This awareness enables real-time, state-based admission control on the node. Admission control for Auto-Tune is available in all AWS Regions on domains running OpenSearch 1.0, or Elasticsearch 6.7 and higher.
By default, admission control throttles
_bulk requests when JVM memory pressure and request size thresholds are breached.
- JVM memory pressure threshold
Admission control keeps track of the current state of JVM memory pressure and throttles incoming requests based on a preconfigured JVM memory pressure threshold. When the threshold is breached, all configured
_bulkrequests are throttled until the memory is released on the node and memory pressure is below the threshold.
- Request size threshold
The size of a particular request is determined by it’s content-length. Admission control keeps track of in-flight requests and allocates tokens to every request based on this content length. Admission control then throttles incoming requests based on memory occupancy when the aggregated size of in-flight requests breaches the pre-configured threshold. All new
_bulkrequests are throttled until the in-flight requests complete, relinquishing the quota to be occupied by new requests.
How Auto-Tune works
Auto-Tune uses performance and usage metrics from OpenSearch clusters to suggest memory-related configuration changes to improve cluster speed and stability. You can view its recommendations on the Amazon OpenSearch Service console. Admission control is a non-disruptive change, meaning that the changes can be applied without rebooting the node.
Admission control’s predefined request size threshold of 10% satisfies most use cases. However, Auto-Tune can now dynamically increase and decrease the default threshold, typically between 5–15%, based on the amount of JVM that is currently occupied on the system. Request size threshold auto-tuning is enabled by default when you enable Auto-Tune.
Auto-Tune currently doesn’t tune the JVM memory pressure threshold.
Monitoring admission control
Amazon OpenSearch Service sends two Auto-Tune metrics to Amazon CloudWatch:
AutoTuneFailed. Each metric contains a sub-category called
AutotuningType, which indicates the specific type of change in question. Admission control adds a new type called
Admission control introduces request-based rejections of
_bulk requests when there are too many requests or JVM usage is high, breaching thresholds. This prevents the nodes from running into cascading effects of failures arising due to the following:
- Surges in traffic – Sudden surges or spikes in request traffic, leading to quick buildup in usage across the nodes
- Skew in shard distribution – Improper distribution of shards, leading to hot spots and bottlenecks, affecting the overall performance
- Slow Nodes – Entire data node starts to slow down due to degraded hardware such as disk, network volumes, or software bugs
Stay tuned for more exciting updates about Amazon OpenSearch Service and features.
About the Authors
Saurabh Singh is a Senior Software Engineer working on AWS OpenSearch at Amazon Web Services. He is passionate about solving problems related to data retrieval and large-scale distributed systems. He is an active contributor to OpenSearch.