Span Metrics¶

Span metrics are collected for each span and provide insight into the performance of your application. On our platform, these metrics are generated by the Grafana Tempo metrics generator and exported to Prometheus. They help monitor request rates, error rates, latency, and payload sizes for instrumented services.

The following metrics are available:

traces_spanmetrics_calls_total: Counter for the total number of spans (requests) processed, labeled by service, operation, status code, and more.
traces_spanmetrics_latency_bucket: Histogram bucket for span latency (duration), useful for analyzing latency distribution and setting SLOs.
traces_spanmetrics_latency_count: Total count of observed span latency measurements.
traces_spanmetrics_latency_sum: Sum of all span latencies, used to calculate average latency.
traces_spanmetrics_size_total: Total size in bytes of all observed spans for monitoring payload sizes.

Available Labels¶

Span metrics include a rich set of labels (dimensions) for detailed analysis. Note that in Prometheus any . (dot) in the label name is replaced by an _ (underscore). The following labels are available:

Label	Scope	Description
service_name	General	The name of the service executing the operation.
service_namespace	General	The namespace in which the service is deployed.
k8s_cluster_name	General	The name of the Kubernetes cluster hosting the service.
span_kind	General	The kind of span (e.g., SERVER, CLIENT, PRODUCER, CONSUMER).
span_name	General	The name of the span, typically the operation or endpoint name.
status_code	General	The status code of the span (e.g., OK, ERROR, UNSET).
server_address	HTTP	The address of the server handling the request.
http_status_code	HTTP	The HTTP status code returned for the request.
http_response_status_code	HTTP	An alternative label for the HTTP response status code.
http_host	HTTP	The host header value from the HTTP request.
db_system	Database	The type of database (e.g., PostgreSQL, MySQL).
db_name	Database	The name of the database being accessed.
db_operation	Database	The database operation performed (e.g., SELECT, INSERT).
messaging_system	Messaging (Async)	The messaging system used (e.g., Kafka, RabbitMQ).
messaging_destination_name	Messaging (Async)	The destination name, such as a topic or queue.
messaging_operation	Messaging (Async)	The operation performed on the messaging system.

These labels help in analyzing metrics by service, namespace, cluster, span kind, span name, status code, HTTP status, database, messaging system, and more.

Span Name¶

The span name uniquely identifies the traced operation or endpoint. It should be a concise and descriptive label that consistently reflects the purpose of the span. This consistency assists in aggregating and visualizing metrics, enabling effective filtering and troubleshooting.

Key recommendations for span names:

Use clear, descriptive names that capture the specific operation performed.
Maintain consistency across similar operations to facilitate aggregation.
Avoid generic names to ensure clarity when filtering or grouping data.

Span Kind¶

Metrics are collected only for specific span kinds as defined by the OpenTelemetry specification. These include:

SERVER: For server-side handling of requests (e.g., an HTTP server receiving a request).
CLIENT: For client-side operations (e.g., outbound HTTP or database requests).
PRODUCER: For sending messages to messaging systems (e.g., publishing to a Kafka topic).
CONSUMER: For receiving messages from messaging systems (e.g., reading from a Kafka topic).

Spans with a resource attribute resource.service.name equal to nais-ingress are excluded from metrics to ensure that only application-relevant data is recorded.

Span Status¶

Span status indicates the outcome of an operation, providing context for troubleshooting. The common span status codes are:

OK: The span completed successfully.
ERROR: An error occurred during the span's execution.
UNSET: No explicit status was set, indicating an undefined state.

Monitoring span status alongside other metrics can quickly identify issues related to failed operations or unexpected behavior.

Example PromQL Queries¶

Below are some example PromQL queries for using span metrics in Prometheus or Grafana:

Requests per second for a specific service:

sum(rate(traces_spanmetrics_calls_total{service_name="my-app"}[5m]))

99^th percentile latency for a specific HTTP endpoint:

histogram_quantile(0.99, sum(rate(traces_spanmetrics_latency_bucket{service_name="my-app", http_host="api.example.com", http_status_code="200"}[5m])) by (le))

Error rate for a service (using HTTP status code):

sum(rate(traces_spanmetrics_calls_total{service_name="my-app", http_status_code=~"5.."}[5m]))

Database operation latency (e.g., for PostgreSQL SELECT queries):

histogram_quantile(0.95, sum(rate(traces_spanmetrics_latency_bucket{db_system="postgresql", db_operation="SELECT"}[5m])) by (le))

Request count by HTTP host (for multi-tenant apps):

sum(rate(traces_spanmetrics_calls_total{service_name="my-app"}[5m])) by (http_host)

Compare latency across environments (clusters):

histogram_quantile(0.90, sum(rate(traces_spanmetrics_latency_bucket{service_name="my-app"}[5m])) by (le, k8s_cluster_name))

Note: Adjust the label filters to match your application's configuration. Use only the appropriate set of labels relevant to each span context to maintain data clarity and consistency.

These metrics and queries provide a comprehensive overview of your application's distributed traces, aiding in observability and troubleshooting.