英文:
envoy: grpc-web takes a few requests until it no longer times out
问题
问题: 当我重新启动 envoy 容器时,需要进行一些请求,直到 grpc-web 调用不再超时。这是在 envoy 容器完全启动后。即使等待更长时间也不能防止这种情况发生(等待了几小时)。在这大约 3 次失败的请求之后,一切正常,直到容器重新启动。
可能的原因: 这个问题可能与 Envoy 的启动过程以及 gRPC 服务的初始化有关。当 Envoy 容器刚启动时,可能存在一些初始化延迟或竞争条件,导致最初的请求失败。以下是可能导致此行为的一些原因:
-
Envoy 启动时间: 即使 Envoy 容器已经启动并运行,它可能需要一些额外的时间来加载配置、建立连接和准备就绪。在这个过程中,初始请求可能会失败。
-
gRPC 服务初始化: 如果 gRPC 服务需要一些时间来启动和准备接收请求,那么在 Envoy 启动后立即发送请求可能会导致超时。
-
连接重用: Envoy 可能在刚启动时不会立即建立连接池,这可能导致初始请求失败。一旦连接池建立并重用,请求就可以正常进行。
解决方案: 要解决这个问题,你可以考虑以下一些可能的解决方案:
-
延迟启动 gRPC 服务: 如果可能的话,尝试将 gRPC 服务的启动延迟一段时间,以确保 Envoy 完全启动并准备好接收请求。
-
增加 Envoy 启动时间: 如果 Envoy 启动时间较长,你可以考虑增加容器的启动时间,以确保 Envoy 完全运行。
-
连接池预热: 考虑配置 Envoy,使其在启动时预热连接池,以确保连接可用性。
-
使用健康检查: 在 Envoy 中配置健康检查,以确保 gRPC 服务已准备好接收流量,然后再将流量路由到该服务。
这些解决方案中的一个或多个可能有助于减轻你遇到的问题。最好的解决方案取决于你的具体部署和要求。
英文:
Context:<br>
I run envoy with grpc-web. I have a bunch of gRPC servers to route to. Each server has a dedicated route and cluster (see config below). Envoy runs inside a docker-container with no special changes (only config and SSL). Envoy and the gRPC servers are connected via a docker network.
Problem:<br>
Whenever I restart the envoy-container, it takes a few requests until the grpc-web calls go through and don't time out. This is after the envoy-container is 100% started. Leaving it running for longer does not prevent this (left it for hours). After these first ~3 failing requests everything works fine until the container is restarted.
Relevant configs:<br>
I removed any obviously unnecessary config in the docker compose and otherwise condensed the config as much as possible (removed all the repeated parts for each server).
envoy:
admin:
access_log_path: /tmp/admin_access.log
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 8080 }
listener_filters:
- name: "envoy.filters.listener.tls_inspector"
typed_config: { }
filter_chains:
# Use HTTPS (TLS) encryption for ingress data
# Disable this to allow tools like bloomRPC which don't work via https
transport_socket:
name: envoy.transport_socket.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
tls_certificates:
- certificate_chain:
filename: "/etc/envoy/envoy.pem"
private_key:
filename: "/etc/envoy/envoy.key"
filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
codec_type: auto
stat_prefix: ingress_http
access_log:
- name: envoy.access_loggers.file
# Logger for gRPC requests (can be identified by the presence of the "x-grpc-web"-header)
filter:
header_filter:
header:
name: "x-grpc-web"
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
path: /dev/stdout
format: "[%START_TIME%] \"%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%\": \"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\" -> \"%UPSTREAM_HOST%\" [gRPC-status: %GRPC_STATUS%] (cluster: %UPSTREAM_CLUSTER% route: %ROUTE_NAME%)\n"
- name: envoy.access_loggers.file
# Logger for HTTP(s) requests (everything that is not a gRPC request)
filter:
header_filter:
header:
name: "x-grpc-web"
invert_match: true
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
path: /dev/stdout
format: "[%START_TIME%] \"%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%\": \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\" -> \"%UPSTREAM_HOST%\" [http(s)-status: %RESPONSE_CODE%] (cluster: %UPSTREAM_CLUSTER% route: %ROUTE_NAME%)\n"
stream_idle_timeout: 43200s # 12h
route_config:
name: local_route
virtual_hosts:
- name: gRPC-Web-Proxy
domains: [ "*" ]
request_headers_to_add:
- header:
key: "source"
value: "envoy"
append: false
- header:
key: "downstream-address"
value: "%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%"
append: false
cors:
allow_origin_string_match:
- prefix: "*"
allow_methods: GET, PUT, DELETE, POST, OPTIONS
allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout,x-envoy-retry-grpc-on,x-envoy-max-retries,auth-token,x-real-ip,client-ip,x-forwarded-for,x-forwarded,x-cluster-client-ip,forwarded-for,forwarded
max_age: "1728000"
expose_headers: grpc-status,grpc-message
routes: # https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto
- name: grpcserver_gRPCRoute
match:
prefix: "/api/services.grpcserver"
route:
cluster: grpcserver_gRPCCluster
prefix_rewrite: "/services.grpcserver"
timeout: 0s # No timeout. Otherwise, streams will be aborted regularly
http_filters:
- name: envoy.filters.http.grpc_web
- name: envoy.filters.http.cors
- name: envoy.filters.http.router
clusters:
- name: grpcserver_gRPCCluster
connect_timeout: 0.25s
type: static
http2_protocol_options: { }
lb_policy: round_robin
load_assignment:
cluster_name: grpcserver_gRPCCluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 20001
transport_socket:
# Connect to microservice via TLS
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
common_tls_context:
tls_certificates:
- certificate_chain: { "filename": "/etc/envoy/envoy.pem" }
private_key: { "filename": "/etc/envoy/envoy.key" }
# Validate CA of microservice
validation_context:
match_subject_alt_names:
trusted_ca:
filename: /etc/ssl/certs/ca-certificates.crt
docker-compose.yml:
version: '2.4'
networks:
core:
name: Service_Core
driver: bridge
ipam:
config:
- subnet: 198.51.100.0/24
gateway: 198.51.100.1
services:
envoy:
container_name: "envoy"
image: "envoyproxy/envoy:v1.17.1"
ports:
- 8080:8080
networks:
- core
restart: always
security_opt:
- apparmor:unconfined
environment:
- ENVOY_UID=17200
- ENVOY_GID=17200
volumes:
- "/somepath/envoy.pem:/etc/envoy/envoy.pem:ro"
- "/somepath/envoy.key:/etc/envoy/envoy.key:ro"
- "/somepath/ca.pem:/etc/ssl/certs/ca-certificates.crt:ro"
- "/somepath/envoy.yml:/etc/envoy/envoy.yaml:ro"
grpcserver:
image: "<grpcserver>"
container_name: "grpcserver"
restart: always
networks:
- core
security_opt:
- apparmor:unconfined
frontend:
image: "<frontend>" # an nginx with the files for the UI
container_name: "frontend"
restart: always
networks:
- core
ports:
- 80:80
- 443:443
volumes:
- "/somepath/ssl/:/opt/ssl/"
security_opt:
- apparmor:unconfined
What could be causing this behavior?<br>
I am only interested in a fix regarding the docker or envoy config. I already considered using a workaround, but I would rather fix it instead.
答案1
得分: 1
在您的集群配置中,您指定了一个连接超时时间为250毫秒。
如果服务在此时间内未响应,调用将失败。
似乎第一次对服务的调用无法在这么短的时间内完成,将其设置为较高的值(几秒钟)应该能解决问题。
英文:
In your cluster configuration, you specify a connect-timeout of 250ms.
If the service does not respond within this timeframe, the call will fail.
It seems that the first call to the service isn't able to finish within this short timeframe, setting it to a higher value (a few seconds) should do the trick.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论