Sending OpenTelemetry spans from Apigee hybrid to Dynatrace

Introduction

A common enterprise requirement for production deployments, especially in the Financial Services and Telecommunication industries, is to instrument the API gateway with Business Activity Monitoring. 

Its objective is to complement the end-to-end call monitoring from all the invoking clients to all backends for all use cases in order to detect in near real-time timeouts, latency and performance degradations. Metrics are captured from the first component in the architecture that handles the API client requests and, following Google's SRE principles and best practices, provide Service Level Indicators and allow for granular monitoring and alerting.   

Apigee hybrid allows architects to choose from a wealth of deployment options for the runtime plane locations, including cloud providers different from Google Cloud Platform and also private data centers, using OpenShift or Anthos. 

Distributed tracing is an option available within the Apigee hybrid runtime plane, it is not enabled by default, and supports both GCP Cloud Trace and Jaeger formats. Apigee runtime planes need to be configured to send trace data to either a Cloud Trace or a Jaeger system

A detailed description of Cloud Trace is presented here.

You can find a detailed description on the tracing features of Apigee hybrid and how to enable it from the control plane and consume it from GCP Cloud Trace here. All the default trace variables in the tracing report are listed here.  Detailed steps to customize its configuration are here.

Unfortunately the Distributed tracing in Apigee only allows to configure a probabilistic sampling rate, with a maximum rate of 50%.

If a probabilistic sampling with a rate that is not above the maximum allowed by Apigee hybrid is enough, then Cloud Trace could still be used. The HTTP request that creates individual trace spans in Cloud Trace is documented here, even when a different target deployment infrastructure has been selected for the hybrid runtime plane(s). 

Where complete coverage for all API traffic is required, the need is to identify an approach that guarantees that all API calls are sent for tracing. Also, any deployment of the runtime plane outside GCP can't always rely on Cloud Trace, for example when the customer has a requirement to integrate with the existing tracing deployment from an existing vendor. In all these scenarios, the customer often presents a requirement to integrate with Dynatrace, which is a third party independent software vendor; this requirement is common where Dynatrace has already been adopted as the system of choice for the tracing and monitoring of all other components and applications in the production system. 

In the present article I describe how to send the 100% of the API call traces to the Dynatrace deployment independently on the choice of the Kubernetes flavor used to deploy the hybrid runtime plane. This is possible leveraging the open standards and without enabling the native distributed tracing option in Apigee hybrid. 

 

Approach description

 

This approach relies on the possibility to deploy the Dynatrace OneAgent operator (more on this later) within the Kubernetes or OpenShift cluster(s) that will also host the hybrid runtime planes; this component is able to receive the OpenTelemetry and OpenTracing spans which are then combined with additional OneAgent data into the Dynatrace format of PurePath® (the Dynatrace’s patented distributed tracing and code-level analysis technology) distributed traces. Once this is achieved, then Dynatrace will be able to also correctly ingest and process the spans coming from the Apigee runtime plane for all the API calls.

Step 1: Install Dynatrace OneAgent 

Detailed instructions on how to set up Dynatrace on Kubernetes and Openshift clusters are here.

https://www.dynatrace.com/news/blog/enable-dynatrace-oneagent-in-istio-service-mesh/ 

Dynatrace automatically captures all OpenTracing and OpenTelemetry spans, but you can control and adapt how OpenTelemetry and OpenTracing spans are combined with OneAgent data into. The span settings are available in the Server-side service monitoring section of the OneAgent settings. 

https://www.dynatrace.com/support/help/extend-dynatrace/extend-tracing/span-settings

Note that there are two different ways to send data to Dynatrace using OpenTelemetry: the first one only is in combination with the deployment of OneAgent. 

There is another where it is indeed still possible to instrument without leveraging OneAgent, where the OTLP/trace ingest API can be used. This second approach can be used for services that cannot be instrumented by a OneAgent code module or when the component to monitor already exposes - or can produce - trace data in the standard OpenTelemetry format (OTLP). 

For the Apigee runtime plane, in order to reconstruct span data from the individual API proxies, I will show you in the last part of the article how to achieve this by adding Apigee policies to the API proxies to produce the trace data in the format ready for Dynatrace.  

OneAgent Operator for Kubernetes and OpenShift

Dynatrace OneAgent 

The following two documents from Dynatrace describe in detail OneAgent and how to deploy it in K8S and OpenShift environments. 

https://www.dynatrace.com/support/help/setup-and-configuration/dynatrace-oneagent

https://www.dynatrace.com/news/blog/oneagent-operator-release-for-your-k8s-and-openshift-environment...

OneAgent deployment via container (OneAgent Operator) on both Kubernetes and Openshift is possible, although it has some limitations compared to the standard OneAgent installation. 

https://www.dynatrace.com/support/help/technology-support/oneagent-platform-and-capability-support-m...

Step 2: Instrument the Apigee hybrid runtime plane for OpenTelemetry

The Apigee runtime cluster has two different points where traces are generated. The first is where the istio-ingress pods are - in a corresponding Kubernetes namespace, the second is instead in the apigee namespace, and it consists of the spans corresponding to the message processor pods, which are the policy enforcement points within the API gateway. If you are not familiar with how the Apigee hybrid runtime plane processes the API calls, proceed to study the detailed description of the API call flow with each runtime Kubernetes cluster: refer to this article.   

Both components istio-ingress and Apigee hybrid API message processors will need to be instrumented. 

a) Instrumentation of the istio-ingress pods:

Instrumenting the istio-ingress pods within the runtime plane is pretty straightforward: in fact, the instrumentation of a native istio-ingress with the OpenTelemetry-standard collector for Dynatrace is supported natively and documented here in the official Istio documentation; observability is already built into the Istio design. This is not more complicated than following the steps to enable envoy access logging. Instrumentation of the native istio-ingress is thus achieved with the opentelemetry collector available for envoy; this exports to Dynatrace the metrics needed, collected from the cluster ingress. 

b) Instrumentation of the Apigee hybrid API proxies

Each Apigee hybrid API proxy, for each API call  will be able to prepare the payload to feed Dynatrace OneAgent with an individual HTTP POST call as a PostClient (response) flow operation, without the introduction of any additional latency in the client response. If you are not familiar with the PostClient (response) flow in the Apigee proxies, you can review this document.  

The  reference to all the definitions of all Apigee hybrid "flow variables" is available in this document. 

As the Message Processor pods will be emitting information to Dynatrace using the OpenTelemetry Protocol (OTLP), I will need to provide initially the description of the payload to send over HTTP. 

OpenTelemetry Protocol (OTLP) 

The OpenTelemetry Protocol (OTLP) specifications are documented here, in the official github.com project of the OpenTelemetry CNCF project.  

The OTLP/HTTP protocol is documented here, note that while the Binary format is considered stable, the JSON format is still classified as experimental, hence it might be subject to changes. 

An extensive additional valuable document on how to prepare and send telemetry data within a JSON OTLP-compliant payload over HTTP POST submissions to Dynatrace is provided in the Lightstep website here.  

Extensive documentation on the trace/v1 message type is provided here. Make sure to review this as it describes how to prepare and populate the Status section, including the permitted values  for the StatusCode field as part of the span; StatusCode = 1 corresponds to no error (success), while StatusCode = 2 represents an error. 

"status": {
            "code": {StatusCode},
            "message": "{StatusMessage}"
          }

One example of a valid OTLP/HTTP request, including the full URI, headers and JSON payload for the post is presented below. The most important section of the JSON payload is "spans" under the "instrumentationLibrarySpans'' stanza. Pay attention to the "attributes" section. 

According to the OpenTelemetry specifications, an Attribute is a key-value pair, which must have the following properties: 

  • The attribute key must be a non-null and non-empty string, 
  • The attribute value is either:
    • A primitive type: string, boolean, double precision floating point (IEEE 754-1985) or signed 64 bit integer.
    • An array of primitive type values. The array MUST be homogeneous, i.e., it must not contain values of different types.

For protocols that do not natively support non-string values, non-string values should be represented as JSON-encoded strings. You will find additional definitions and limits here, while you can read on how to map arbitrary data to OLTP AnyValue section, if needed, here. Attribute naming must follow these prescriptions, too.   

Specifications

In the following paragraph, I will show how this is used to populate additional key-value pair entries (dynamically assigned from the Apigee proxy flow variables) so that they are propagated to Dynatrace, in a JSON schema that is immediately accepted and parsed by Dynatrace.  The example is a valid curl command-line statement for an HTTP POST to a local endpoint exposed within the cluster by the Dynatrace collection service running on the local TCP port 55681, to the resource /v1/traces, which includes the mandatory header Content-Type and the JSON payload. 

The endpoint host of the HTTP (or HTTPS) POST is either an external Dynatrace deployment FQDN (for example, its Load Balancer hostname) or localhost to indicate that it is the in-cluster (same node) deployment of the OneAgent application. The endpoint TCP port is correspondingly either the external TCP port (for example, 443 for the default HTTPS) or the default port for the Dynatrace OneAgent listener (TCP:55681) or the customized different port number.

Example

curl -X POST \ 'http{s}://{Dynatrace-hostname|localhost}:{port|55681}/v1/traces' \
-H 'Content-Type: application/json' \
--data-raw 
'{
   "resourceSpans":[
      {
         "resource":{
            "attributes":[
               {
                  "key":"service.name",
                  "value":{
                     "stringValue":"unknown_service"
                  }
               },
               {
                  "key":"telemetry.sdk.language",
                  "value":{
                     "stringValue":"webjs"
                  }
               },
               {
                  "key":"telemetry.sdk.name",
                  "value":{
                     "stringValue":"opentelemetry"
                  }
               },
               {
                  "key":"telemetry.sdk.version",
                  "value":{
                     "stringValue":"0.23.0"
                  }
               }
            ],
            "droppedAttributesCount":0
         },
         "instrumentationLibrarySpans":[
            {
               "spans":[
                  {
                     "traceId":"5661215315a87ad7dd8448b4101a59a9",
                     "spanId":"29f50492db6b0ced",
                     "name":"files-series-info-0",
                     "kind":1,
                     "startTimeUnixNano":1625600800211400200,
                     "endTimeUnixNano":1625600800700400000,
                     "attributes":[
                        
                     ],
                     "droppedAttributesCount":0,
                     "events":[
                        {
                           "timeUnixNano":1625600800700400000,
                           "name":"fetching-span1-completed",
                           "attributes":[
                              
                           ],
                           "droppedAttributesCount":0
                        }
                     ],
                     "droppedEventsCount":0,
                     "status":{
                        "code":0
                     },
                     "links":[
                        
                     ],
                     "droppedLinksCount":0
                  }
               ],
               "instrumentationLibrary":{
                  "name":"example-tracer-web"
               }
            }
         ]
      }
   ]
}'

 

Under "spans", the following attributes, events and status will be populated, using the Apigee flow variable reference notation for:

  • {response.status.code}
  • {request.verb}
  • {system.timestamp}
  • {escapeJSON(EventMessage)}
  • {StatusCode}
  • {StatusMessage}

 

Attributes, events and status

In the following I will show sample content of the three main sections within the JSON document above: attributes, events and status

"attributes": [
              {
                 "key": "http.status_code",
                 "value":{
                    "string_value": "{response.status.code}"
                 }
              },
              {
                 "key": "http.method",
                 "value":{
                    "string_value": "{request.verb}"
                 }
              },
              }
           ]
"events": [
              {
                 "time_unix_nano": {system.timestamp}000000,
                                 "name": "event1",
                                 "attributes": [
                                    {
                                         "key": "exception.message",
                                         "value":{
                                            "string_value": "{escapeJSON(EventMessage)}"
                                            }
                                     }
                                 ]
              }
           ]
"status": {
                "code": {StatusCode},
             "message": "{StatusMessage}"
           }
 

The variables  traceId and  spanId values need to be globally unique to permit correlation to work correctly in Dynatrace using distributed tracing. If needed, Apigee exposes a function to create real UUIDs (with the guarantee to be globally unique within the application), this is documented here; it can be invoked within an Assign Message policy in the following way: 

 

<AssignVariable>
    <Name>flow.request.uuid</Name>
    <Template>{createUuid()}</Template>
</AssignVariable>

It is worth noting that if the API clients are adding an external request header named X-Request-Id , then this will be automatically assigned to the internal flow variable messageid. If not, messageid will be populated by Apigee, and can be used as traceId.

 

In conclusion

You can prepare the payload variable for the HTTP POST operation using the Apigee Assign Message Policy as described here. You can then progress to assemble the Apigee Service Callout policy following the policy documentation here. In the example below, the Assign Message Policy will populate the flow variable called _requestBody that contains the inner payload of the requestBody. This is  subsequently used in the second policy to perform the HTTP POST. 

 

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ServiceCallout async="false" continueOnError="false" enabled="true" name="DynatraceOneAgentCallout">
    <DisplayName>DynatraceOneAgentCallout</DisplayName>
    <Properties/>
    <Request clearPayload="true" variable="myRequest1">
        <Set>
            <Headers>
                <Header name="Content-Type">application/json</Header>
            </Headers>
            <Verb>POST</Verb>
            <Payload contentType="application/json" variablePrefix="@" variableSuffix="#">
                {
                   "requestBody" : "@_requestBody#"
                }
            </Payload>
        </Set>
        <IgnoreUnresolvedVariables>true</IgnoreUnresolvedVariables>
    </Request>
    <Response>dynatraceResponse</Response>
    <HTTPTargetConnection>
        <Properties/>
        <URL>http://localhost:55681/v1/traces</URL>
    </HTTPTargetConnection>
</ServiceCallout>

 

Full ServiceCallout Policy Example 

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ServiceCallout continueOnError="true" enabled="true" name="SC_DYNATRACE">
<DisplayName>SC_DYNATRACE</DisplayName>
<Properties/>
<Request clearPayload="false" variable="dyna-req">
     <Add>
         <Headers>
             <Header name="Content-Type">application/json</Header>
         </Headers>
     </Add>
     <Set>
         <Verb>POST</Verb>
         <Payload>
{
  "resourceSpans": [
{
   "resource": {
     "attributes": [
       {
         "key": "service.name",
         "value": {
           "stringValue": "apigee-runtime"
         }
       }
     ]
   },
   "instrumentationLibrarySpans": [
     {
       "spans": [
         {
           "trace_id": "{traceid}",
           "span_id": "{span_id}",
           "parent_span_id": "{parent_span_id}",
           "name": "{request.uri}",
           "kind": 2,
           "start_time_unix_nano": {client.received.start.timestamp}000000,
           "end_time_unix_nano": {client.sent.end.timestamp}000000,
           "droppedAttributesCount": 0,
           "droppedEventsCount": 0,
           "attributes": [
              {
                 "key": "http.host",
                 "value":{
                    "string_value": "localhost"
                 }
                 },
                 {
                 "key": "http.status_code",
                 "value":{
                    "string_value": "{response.status.code}"
                 }
              },
              {
                 "key": "http.method",
                 "value":{
                    "string_value": "{request.verb}"
                 }
              },
              {
                 "key": "http.request_content_length",
                 "value":{
                    "string_value": "{request.header.content-length}"
                 }
              },
              {
                 "key": "http.response_content_length",
                 "value":{
                    "string_value": "{response.header.content-length}"
                 }
              },
              {
                 "key": "http.server_name",
                "value":{
                    "string_value": "{target.host}"
                 }
              }
           ],
           "events": [
              {
                 "time_unix_nano": {system.timestamp}000000,
                                "name": "event1",
                                 "attributes": [
                                     {
                                         "key": "exception.message",
                                         "value":{
                                            "string_value": "{escapeJSON(EventMessage)}"
                                            }
                                     }
                                 ]
              }
           ],             
              "status": {
             "code": {StatusCode},
             "message": "{StatusMessage}"
           }
         }
       ],
       "instrumentationLibrary": {
         "name": "local-curl-example"
       }
     }
   ]
}
  ]
}
         </Payload>
     </Set>
     <IgnoreUnresolvedVariables>true</IgnoreUnresolvedVariables>
</Request>
<Response>dyna-res</Response>
<HTTPTargetConnection>
     <Properties/>
        <URL>http://otel-collector.dynatrace.svc.cluster.local:4318/v1/traces</URL>
    </HTTPTargetConnection>
</ServiceCallout>

Further resources

OpenTelemetry 

https://github.com/open-telemetry/opentelemetry-js/tree/main/examples/tracer-web

https://open-telemetry.github.io/opentelemetry-js-api/

Dynatrace OneAgent

 https://www.dynatrace.com/support/help/setup-and-configuration/dynatrace-oneagent

Dynatrace OpenTelemetry instrumentation without OneAgent

https://www.dynatrace.com/support/help/extend-dynatrace/opentelemetry#tabgroup--opentelemetry--instr...

Dynatrace OpenTelemetry metrics collector

https://www.dynatrace.com/support/help/extend-dynatrace/opentelemetry/opentelemetry-metrics/opentele... 

 

          
Contributors
Comments
cneumueller
Bronze 1
Bronze 1

Posting OTLP in JSON format to Dynatrace does not work directly, as the Dynatrace ingest endpoint only accepts OTLP/HTTP in protobuf format.

To fix that, you need to deploy a component that takes OTLP/JSON and forwards OTLP/protobuf to Dynatrace. This can be achieved with the OpenTelemetry collector. Here is the documentation on how to configure it properly for Dynatrace: https://www.dynatrace.com/support/help/extend-dynatrace/opentelemetry/basics/collector

Dg03cloud
Bronze 3
Bronze 3

It is very informative. I have a similar use case. To integrate Dynatrace with the GCP environment through the GKE auto pilot cluster. Are there any helpful references with examples?

Version history
Last update:
‎09-06-2022 06:26 AM
Updated by: