使用OpenTelemetry统一不同服务的跨度

huangapple go评论97阅读模式
英文:

Unify spans of different services with OpenTelemetry

问题

我刚刚开始使用OpenTelemetry,并为此目的创建了两个(微)服务:StandardGeoMap

最终用户向Standard服务发送请求,Standard服务再向GeoMap发送请求以获取信息,然后将结果返回给最终用户。我在所有通信中使用gRPC。

我已经对我的函数进行了仪表化处理,如下所示:

对于Standard

type standardService struct {
    pb.UnimplementedStandardServiceServer
}

func (s *standardService) GetStandard(ctx context.Context, in *pb.GetStandardRequest) (*pb.GetStandardResponse, error) {

    conn, _:= createClient(ctx, geomapSvcAddr)
    defer conn1.Close()

    newCtx, span1 := otel.Tracer(name).Start(ctx, "GetStandard")
    defer span1.End()

    countryInfo, err := pb.NewGeoMapServiceClient(conn).GetCountry(newCtx,
        &pb.GetCountryRequest{
            Name: in.Name,
        })

    //...

    return &pb.GetStandardResponse{
        Standard: standard,
    }, nil

}

func createClient(ctx context.Context, svcAddr string) (*grpc.ClientConn, error) {
    return grpc.DialContext(ctx, svcAddr,
        grpc.WithTransportCredentials(insecure.NewCredentials()),
        grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor()),
    )
}

对于GeoMap

type geomapService struct {
    pb.UnimplementedGeoMapServiceServer
}

func (s *geomapService) GetCountry(ctx context.Context, in *pb.GetCountryRequest) (*pb.GetCountryResponse, error) {

    _, span := otel.Tracer(name).Start(ctx, "GetCountry")
    defer span.End()

    span.SetAttributes(attribute.String("country", in.Name))

    span.AddEvent("Retrieving country info")

    //...
    
    span.AddEvent("Country info retrieved")

    return &pb.GetCountryResponse{
        Country: &country,
    }, nil

}

这两个服务都配置为将它们的跨度发送到Jaeger后端,并共享一个几乎相同的main函数(注释中有小的差异):

const (
    name        = "mapedia"
    service     = "geomap" //or standard
    environment = "production"
    id          = 1
)

func tracerProvider(url string) (*tracesdk.TracerProvider, error) {
    // Create the Jaeger exporter
    exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(url)))
    if err != nil {
        return nil, err
    }
    tp := tracesdk.NewTracerProvider(
        // Always be sure to batch in production.
        tracesdk.WithBatcher(exp),
        // Record information about this application in a Resource.
        tracesdk.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName(service),
            attribute.String("environment", environment),
            attribute.Int64("ID", id),
        )),
    )
    return tp, nil
}

func main() {

    tp, err := tracerProvider("http://localhost:14268/api/traces")
    if err != nil {
        log.Fatal(err)
    }

    defer func() {
        if err := tp.Shutdown(context.Background()); err != nil {
            log.Fatal(err)
        }
    }()
    otel.SetTracerProvider(tp)

    listener, err := net.Listen("tcp", ":"+port)
    if err != nil {
        panic(err)
    }

    s := grpc.NewServer(
        grpc.UnaryInterceptor(otelgrpc.UnaryServerInterceptor()),
    )
    reflection.Register(s)
    pb.RegisterGeoMapServiceServer(s, &geomapService{}) // or pb.RegisterStandardServiceServer(s, &standardService{})
    if err := s.Serve(listener); err != nil {
        log.Fatalf("Failed to serve: %v", err)
    }
}

当我查看由最终用户请求Standard服务生成的跟踪时,我可以看到它按预期调用了GeoMap服务:

使用OpenTelemetry统一不同服务的跨度

然而,我没有看到我添加到子跨度(在GeoMapGetCountry函数中添加了一个属性和两个事件)的任何属性或事件。

然而,我注意到这些属性在另一个单独的跟踪中可用(在Jaeger中的“geomap”服务下可用),其跨度ID与Standard服务中的子跨度完全不相关:

使用OpenTelemetry统一不同服务的跨度

现在,我希望只有一个跟踪,并在Standard跨度内的子跨度中看到与GeoMap相关的所有属性/事件。如何从这里达到预期的结果?

英文:

I am just starting with OpenTelemetry and have created two (micro)services for this purpose: Standard and GeoMap.

The end-user sends requests to the Standard service, who in turn sends requests to GeoMap to fetch informations before returning the result to the end-user. I am using gRPC for all communications.

I have instrumented my functions as such:

For Standard:

type standardService struct {
pb.UnimplementedStandardServiceServer
}
func (s *standardService) GetStandard(ctx context.Context, in *pb.GetStandardRequest) (*pb.GetStandardResponse, error) {
conn, _:= createClient(ctx, geomapSvcAddr)
defer conn1.Close()
newCtx, span1 := otel.Tracer(name).Start(ctx, "GetStandard")
defer span1.End()
countryInfo, err := pb.NewGeoMapServiceClient(conn).GetCountry(newCtx,
&pb.GetCountryRequest{
Name: in.Name,
})
//...
return &pb.GetStandardResponse{
Standard: standard,
}, nil
}
func createClient(ctx context.Context, svcAddr string) (*grpc.ClientConn, error) {
return grpc.DialContext(ctx, svcAddr,
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor()),
)
}

For GeoMap:

type geomapService struct {
pb.UnimplementedGeoMapServiceServer
}
func (s *geomapService) GetCountry(ctx context.Context, in *pb.GetCountryRequest) (*pb.GetCountryResponse, error) {
_, span := otel.Tracer(name).Start(ctx, "GetCountry")
defer span.End()
span.SetAttributes(attribute.String("country", in.Name))
span.AddEvent("Retrieving country info")
//...
span.AddEvent("Country info retrieved")
return &pb.GetCountryResponse{
Country: &country,
}, nil
}

Both services are configured to send their spans to a Jaeger Backend and share an almost identic main function (small differences are noted in comments):

const (
name        = "mapedia"
service     = "geomap" //or standard
environment = "production"
id          = 1
)
func tracerProvider(url string) (*tracesdk.TracerProvider, error) {
// Create the Jaeger exporter
exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(url)))
if err != nil {
return nil, err
}
tp := tracesdk.NewTracerProvider(
// Always be sure to batch in production.
tracesdk.WithBatcher(exp),
// Record information about this application in a Resource.
tracesdk.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName(service),
attribute.String("environment", environment),
attribute.Int64("ID", id),
)),
)
return tp, nil
}
func main() {
tp, err := tracerProvider("http://localhost:14268/api/traces")
if err != nil {
log.Fatal(err)
}
defer func() {
if err := tp.Shutdown(context.Background()); err != nil {
log.Fatal(err)
}
}()
otel.SetTracerProvider(tp)
listener, err := net.Listen("tcp", ":"+port)
if err != nil {
panic(err)
}
s := grpc.NewServer(
grpc.UnaryInterceptor(otelgrpc.UnaryServerInterceptor()),
)
reflection.Register(s)
pb.RegisterGeoMapServiceServer(s, &geomapService{}) // or pb.RegisterStandardServiceServer(s, &standardService{})
if err := s.Serve(listener); err != nil {
log.Fatalf("Failed to serve: %v", err)
}
}

When I look at a trace generated by an end-user request to the Standard Service, I can see that it is, as expected, making calls to its GeoMap service:

使用OpenTelemetry统一不同服务的跨度

However, I don't see any of the attributes or the events I have added to the child span (I added an attribute and 2 events when instrumenting the GetCountry function of GeoMap).

What I notice however is that these attributes are available in another separate trace (available under the "geomap" service in Jaeger) with a span ID totally unrelated to the child spans in the Standard service:

使用OpenTelemetry统一不同服务的跨度

Now what I would have expected is to have a single trace, and to see all attributes/events related to GeoMap in the child span within the Standard span. How to get to the expected result from here?

答案1

得分: 2

跨度上下文(包含跟踪ID和跨度ID,如“服务仪表板和术语”中所述)应该从父跨度传播到子跨度,以便它们成为同一跟踪的一部分。

使用OpenTelemetry,通常可以通过使用提供的插件为各种库(包括gRPC)来为代码进行仪表化来自动完成此操作。
然而,在您的情况下,传播似乎没有正常工作。

在您的代码中,您在GetStandard函数中启动了一个新的跨度,然后在进行GetCountry请求时使用该上下文(newCtx)。这是正确的,因为新的上下文应该包含父跨度(GetStandard)的跨度上下文。
但问题可能与您的createClient函数有关:

func createClient(ctx context.Context, svcAddr string) (*grpc.ClientConn, error) {
    return grpc.DialContext(ctx, svcAddr,
        grpc.WithTransportCredentials(insecure.NewCredentials()),
        grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor()),
    )
}

您在这里正确地使用了otelgrpc.UnaryClientInterceptor,它应该确保上下文正确传播,但不清楚何时调用此函数。如果它在调用GetStandard函数之前调用,则用于创建客户端的上下文将不包含GetStandard的跨度上下文。

为了进行测试,请确保在调用GetStandard函数之后创建客户端,并在整个请求过程中使用相同的上下文。
您可以通过直接将newCtx传递给GetCountry函数来实现这一点,如下所示修改您的GetStandard函数的版本:

func (s *standardService) GetStandard(ctx context.Context, in *pb.GetStandardRequest) (*pb.GetStandardResponse, error) {
    newCtx, span1 := otel.Tracer(name).Start(ctx, "GetStandard")
    defer span1.End()

    conn, _:= createClient(newCtx, geomapSvcAddr)
    defer conn.Close()

    countryInfo, err := pb.NewGeoMapServiceClient(conn).GetCountry(newCtx,
        &pb.GetCountryRequest{
            Name: in.Name,
        })

    //...

    return &pb.GetStandardResponse{
        Standard: standard,
    }, nil
}

现在,用于创建客户端和进行GetCountry请求的上下文将包含GetStandard的跨度上下文,并且它们应该作为同一跟踪的一部分显示在Jaeger中。

(如往常一样,请检查createClientGetCountry等函数返回的错误,此处未显示以保持简洁)。

此外:

  • 还要检查您的传播器:确保在两个服务中都使用相同的上下文传播器,最好使用OpenTelemetry中的W3C TraceContextPropagator,这是OpenTelemetry的默认传播器。

    您可以将传播器显式设置如下:

    otel.SetTextMapPropagator(propagation.TraceContext{})
    

    将上述行添加到两个服务的main函数的开头。

  • 确保传递元数据:gRPC拦截器应自动从请求的元数据中注入/提取跟踪上下文,但请仔细检查以确保其正常工作。

    GetCountry函数中启动跨度后,您可以记录跟踪ID和跨度ID:

    ctx, span := otel.Tracer(name).Start(ctx, "GetCountry")
    sc := trace.SpanContextFromContext(ctx)
    log.Printf("Trace ID: %s, Span ID: %s", sc.TraceID(), sc.SpanID())
    defer span.End()
    

    并在GetStandard函数中执行相同操作:

    newCtx, span1 := otel.Tracer(name).Start(ctx, "GetStandard")
    sc := trace.SpanContextFromContext(newCtx)
    log.Printf("Trace ID: %s, Span ID: %s", sc.TraceID(), sc.SpanID())
    defer span1.End()
    

    如果上下文被正确传播,两个服务中的跟踪ID应该匹配。

英文:

The span context (which contains trace ID and span ID, as described in "Service Instrumentation & Terminology") should be propagated from the parent span to the child span in order for them to be part of the same trace.

With OpenTelemetry, this is often done automatically by instrumenting your code with the provided plugins for various libraries, including gRPC.
However, the propagation does not seem to be working correctly in your case.

In your code, you are starting a new span in the GetStandard function, and then using that context (newCtx) when making the GetCountry request. That is correct, as the new context should contain the span context of the parent span (GetStandard).
But the issue might be related to your createClient function:

func createClient(ctx context.Context, svcAddr string) (*grpc.ClientConn, error) {
    return grpc.DialContext(ctx, svcAddr,
        grpc.WithTransportCredentials(insecure.NewCredentials()),
        grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor()),
    )
}

You are correctly using the otelgrpc.UnaryClientInterceptor here, which should ensure that the context is propagated correctly, but it is not clear when this function is being called. If it is being called before the GetStandard function is invoked, then the context used to create the client will not include the span context from GetStandard.

For testing, try and make sure that the client is created after the GetStandard function is invoked, and the same context is used throughout the request.

You can do this by passing the newCtx directly to the GetCountry function, as illustrated with this modified version of your GetStandard function:

func (s *standardService) GetStandard(ctx context.Context, in *pb.GetStandardRequest) (*pb.GetStandardResponse, error) {
    newCtx, span1 := otel.Tracer(name).Start(ctx, "GetStandard")
    defer span1.End()

    conn, _:= createClient(newCtx, geomapSvcAddr)
    defer conn.Close()

    countryInfo, err := pb.NewGeoMapServiceClient(conn).GetCountry(newCtx,
        &pb.GetCountryRequest{
            Name: in.Name,
        })

    //...

    return &pb.GetStandardResponse{
        Standard: standard,
    }, nil
}

Now, the context used to create the client and make the GetCountry request will include the span context from GetStandard, and they should appear as part of the same trace in Jaeger.

(As always, do check the returned errors from functions like createClient and GetCountry, not shown here for brevity).


In addition:

  • Check also your propagator: Make sure you are using the same context propagator in both services, preferably the W3C TraceContextPropagator, which is the default one in OpenTelemetry.

    You can set the propagator explicitly as follows:

    otel.SetTextMapPropagator(propagation.TraceContext{})
    

    Add the above line to the beginning of your main function in both services.

  • Ensure metadata is being passed: The gRPC interceptor should automatically inject/extract the tracing context from the metadata of the request, but double-check to make sure it is working properly.

    After starting a span in your GetCountry function, you can log the trace ID and span ID:

    ctx, span := otel.Tracer(name).Start(ctx, "GetCountry")
    sc := trace.SpanContextFromContext(ctx)
    log.Printf("Trace ID: %s, Span ID: %s", sc.TraceID(), sc.SpanID())
    defer span.End()
    

    And do the same in your GetStandard function:

    newCtx, span1 := otel.Tracer(name).Start(ctx, "GetStandard")
    sc := trace.SpanContextFromContext(newCtx)
    log.Printf("Trace ID: %s, Span ID: %s", sc.TraceID(), sc.SpanID())
    defer span1.End()
    

    The trace IDs in the two services should match if the context is being propagated correctly.

huangapple
  • 本文由 发表于 2023年7月3日 17:10:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76603368.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定