In large-scale distributed systems, users interact with data through both programmatic APIs and user-friendly web consoles. Regardless of the interface, every incoming request usually passes through a proxy layer that ensures secure, reliable, and efficient routing. Envoy is one such high-performance edge and service proxy, and is often the core of this layer.
Envoy is widely adopted in cloud-native systems and handles not just routing but also observability, load balancing, and authentication. It isn’t deployed as a single instance, but rather as a distributed set of containerized services. This architecture provides scalability, fault isolation, and efficient resource utilization, making Envoy a natural choice for latency-sensitive workloads such as payment gateways, storage backends, or real-time applications.
In these systems, resilience is as important as raw speed. Even a few milliseconds of added latency, or an outage in a dependent service, can cascade into large-scale failures. This tutorial will show you how to configure Envoy for resilience, tune it for low latency, and validate performance under real-world conditions.
This comprehensive tutorial covers essential strategies for optimizing Envoy proxy performance and resilience in production environments:
Reducing latency in Envoy requires optimizations across filter chains, caching, service placement, resource provisioning, and configuration management.
Envoy processes requests through filter chains and each filter adds overhead. Poorly designed chains increase request latency.
When Envoy relies on an external authorization service, you must decide how to handle failures. The ext_authz
filter controls this with the failure_mode_allow
flag:
http_filters:
- name: envoy.filters.http.ext_authz
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz
failure_mode_allow: true # true = fail-open, false = fail-close
http_service:
server_uri:
uri: auth.local:9000
cluster: auth_service
timeout: 0.25s
This YAML configuration defines an external authorization filter for Envoy proxy that determines how to handle authentication failures. Here’s what each section does:
Filter Declaration:
http_filters
: Declares this as an HTTP filter in Envoy’s filter chainenvoy.filters.http.ext_authz
: The specific filter for external authorizationFilter Configuration:
typed_config
: Specifies the configuration type using Protocol Buffers@type
: Points to the ExtAuthz v3 API definitionCritical Resilience Setting:
failure_mode_allow
: This is the key decision point for resilience:
true
(fail-open): If the auth service is down/unreachable, allow requests to proceedfalse
(fail-close): If the auth service is down/unreachable, block all requestsExternal Service Configuration:
uri
: Points to the external authorization service at auth.local:9000
cluster
: References a cluster named auth_service
(likely defined elsewhere in the config)timeout
: Sets a 250ms timeout for auth service callsauth.local:9000
failure_mode_allow: true
→ Request continues (fail-open)failure_mode_allow: false
→ Request is blocked (fail-close)true
): Requests continue if the auth service fails. This prioritizes uptime but reduces security.false
): Requests are blocked if the auth service fails. This prioritizes security but risks downtime.Envoy enforces the behavior, making it the decision point for availability vs. security.
Fail-open and fail-close are not just config flags; they represent operational philosophies. Operators can choose the right mode per service depending on business risk.
Configuration changes must be validated under realistic conditions. Nighthawk is Envoy’s dedicated load testing tool, designed to simulate real-world traffic patterns and record latency metrics.
Use Docker to run Nighthawk against your Envoy deployment:
docker run --rm envoyproxy/nighthawk --duration 30s http://localhost:10000/
This generates sustained load, recording throughput, latency distributions, and error rates.
Nighthawk provides comprehensive performance metrics that are crucial for understanding Envoy’s behavior under load:
Requests per second (RPS): Measures throughput capacity - the number of requests Envoy can process per second. Higher RPS indicates better performance, but it’s important to balance throughput with latency requirements.
Latency percentiles:
Error percentage under load: The percentage of failed requests when the system is under stress. This metric helps identify when Envoy starts failing and at what load threshold resilience mechanisms activate.
These metrics work together to provide a complete picture of Envoy’s performance characteristics, helping you identify bottlenecks, validate resilience configurations, and ensure your system can handle production traffic patterns.
Running these tests validates whether Envoy is resilient under both expected and extreme conditions.
Fail-open and fail-close are two resilience strategies for handling external service failures in Envoy:
Fail-open (failure_mode_allow: true
): When the external authorization service fails, Envoy allows requests to proceed. This prioritizes availability over security and is suitable for non-critical services where uptime is more important than strict access control.
Fail-close (failure_mode_allow: false
): When the external authorization service fails, Envoy blocks all requests. This prioritizes security over availability and is recommended for sensitive services where unauthorized access could cause significant damage.
The choice depends on your business requirements and risk tolerance. Payment systems typically use fail-close, while public APIs might use fail-open.
Use Nighthawk, Envoy’s dedicated load testing tool, to measure performance metrics:
docker run --rm envoyproxy/nighthawk --duration 30s http://localhost:10000/
Key metrics to monitor include:
Set up continuous monitoring with tools like Prometheus and Grafana to track these metrics in production.
Optimizing Envoy filter chains involves several strategies:
Regular performance testing helps identify which filters are causing latency issues.
The choice depends on your service’s criticality and business requirements:
Consider factors like:
Envoy integrates well with several observability tools:
Set up comprehensive monitoring that covers:
This observability stack helps you identify performance bottlenecks and validate that your resilience configurations are working as expected.
Envoy isn’t just a proxy—it’s the critical decision point where trade-offs between availability and security are enforced in your microservices architecture. This guide has shown you how to:
The strategies covered here, from filter optimization to resilience testing, provide a solid foundation for running Envoy in production environments where every millisecond matters. Remember that the right approach depends on your specific use case, compliance requirements, and business risk tolerance.
Ready to implement these strategies? Start with DigitalOcean’s managed Kubernetes service to deploy your Envoy-powered microservices with built-in monitoring and observability tools.
Continue your journey with these complementary DigitalOcean articles:
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Senior Platform Engineer
I help Businesses scale with AI x SEO x (authentic) Content that revives traffic and keeps leads flowing | 3,000,000+ Average monthly readers on Medium | Sr Technical Writer @ DigitalOcean | Ex-Cloud Consultant @ AMEX | Ex-Site Reliability Engineer(DevOps)@Nutanix
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.