Retry

A retry setting specifies the maximum number of times an Envoy proxy attempts to connect to a service if the initial call fails. Retries can enhance service availability and application performance by making sure that calls don’t fail permanently because of transient problems such as a temporarily overloaded service or network. The interval between retries prevents the called service from being overwhelmed with requests.

Envoy Gateway supports the following retry settings:

  • NumRetries: is the number of retries to be attempted. Defaults to 2.
  • RetryOn: specifies the retry trigger condition.
  • PerRetryPolicy: is the retry policy to be applied per retry attempt.

Envoy Gateway introduces a new CRD called BackendTrafficPolicy that allows the user to describe their desired retry settings. This instantiated resource can be linked to a Gateway, HTTPRoute or GRPCRoute resource.

Note: There are distinct circuit breaker counters for each BackendReference in an xRoute rule. Even if a BackendTrafficPolicy targets a Gateway, each BackendReference in that gateway still has separate circuit breaker counter.

Prerequisites

Follow the steps from the Quickstart task to install Envoy Gateway and the example manifest. Before proceeding, you should be able to query the example backend using HTTP.

Verify the Gateway status:

kubectl get gateway/eg -o yaml
egctl x status gateway -v

Test and customize retry settings

Before applying a BackendTrafficPolicy with retry setting to a route, let’s test the default retry settings.

curl -v -H "Host: www.example.com" "http://${GATEWAY_HOST}/status/500"

It will return 500 response immediately.

*   Trying 172.18.255.200:80...
* Connected to 172.18.255.200 (172.18.255.200) port 80
> GET /status/500 HTTP/1.1
> Host: www.example.com
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< date: Fri, 01 Mar 2024 15:12:55 GMT
< content-length: 0
<
* Connection #0 to host 172.18.255.200 left intact

Let’s create a BackendTrafficPolicy with a retry setting.

The request will be retried 5 times with a 100ms base interval and a 10s maximum interval.

cat <<EOF | kubectl apply -f -
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: retry-for-route
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      name: backend
  retry:
    numRetries: 5
    perRetry:
      backOff:
        baseInterval: 100ms
        maxInterval: 10s
      timeout: 250ms
    retryOn:
      httpStatusCodes:
        - 500
      triggers:
        - connect-failure
        - retriable-status-codes
EOF

Save and apply the following resource to your cluster:

---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: retry-for-route
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      name: backend
  retry:
    numRetries: 5
    perRetry:
      backOff:
        baseInterval: 100ms
        maxInterval: 10s
      timeout: 250ms
    retryOn:
      httpStatusCodes:
        - 500
      triggers:
        - connect-failure
        - retriable-status-codes

Execute the test again.

curl -v -H "Host: www.example.com" "http://${GATEWAY_HOST}/status/500"

It will return 500 response after a few while.

*   Trying 172.18.255.200:80...
* Connected to 172.18.255.200 (172.18.255.200) port 80
> GET /status/500 HTTP/1.1
> Host: www.example.com
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< date: Fri, 01 Mar 2024 15:15:53 GMT
< content-length: 0
<
* Connection #0 to host 172.18.255.200 left intact

Let’s check the stats to see the retry behavior.

egctl x stats envoy-proxy -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-name=eg,gateway.envoyproxy.io/owning-gateway-namespace=default | grep "envoy_cluster_upstream_rq_retry{envoy_cluster_name=\"httproute/default/backend/rule/0\"}"

You will expect to see the stats.

envoy_cluster_upstream_rq_retry{envoy_cluster_name="httproute/default/backend/rule/0"} 5

Last modified November 14, 2024: docs: fix api doc (#4711) (639a441)