Span MetricsΒΆ
Span metrics are collected for each span and provide insight into the performance of your application. On our platform, these metrics are generated by the Grafana Tempo metrics generator and exported to Prometheus. They help monitor request rates, error rates, latency, and payload sizes for instrumented services.
The following metrics are available:
- traces_spanmetrics_calls_total: Counter for the total number of spans (requests) processed, labeled by service, operation, status code, and more.
- traces_spanmetrics_latency_bucket: Histogram bucket for span latency (duration), useful for analyzing latency distribution and setting SLOs.
- traces_spanmetrics_latency_count: Total count of observed span latency measurements.
- traces_spanmetrics_latency_sum: Sum of all span latencies, used to calculate average latency.
- traces_spanmetrics_size_total: Total size in bytes of all observed spans for monitoring payload sizes.
Available LabelsΒΆ
Span metrics include a rich set of labels (dimensions) for detailed analysis. Note that in Prometheus any .
(dot) in the label name is replaced by an _
(underscore). The following labels are available:
Label | Scope | Description |
---|---|---|
service_name | General | The name of the service executing the operation. |
service_namespace | General | The namespace in which the service is deployed. |
k8s_cluster_name | General | The name of the Kubernetes cluster hosting the service. |
span_kind | General | The kind of span (e.g., SERVER, CLIENT, PRODUCER, CONSUMER). |
span_name | General | The name of the span, typically the operation or endpoint name. |
status_code | General | The status code of the span (e.g., OK, ERROR, UNSET). |
server_address | HTTP | The address of the server handling the request. |
http_status_code | HTTP | The HTTP status code returned for the request. |
http_response_status_code | HTTP | An alternative label for the HTTP response status code. |
http_host | HTTP | The host header value from the HTTP request. |
db_system | Database | The type of database (e.g., PostgreSQL, MySQL). |
db_name | Database | The name of the database being accessed. |
db_operation | Database | The database operation performed (e.g., SELECT, INSERT). |
messaging_system | Messaging (Async) | The messaging system used (e.g., Kafka, RabbitMQ). |
messaging_destination_name | Messaging (Async) | The destination name, such as a topic or queue. |
messaging_operation | Messaging (Async) | The operation performed on the messaging system. |
These labels help in analyzing metrics by service, namespace, cluster, span kind, span name, status code, HTTP status, database, messaging system, and more.
Span NameΒΆ
The span name uniquely identifies the traced operation or endpoint. It should be a concise and descriptive label that consistently reflects the purpose of the span. This consistency assists in aggregating and visualizing metrics, enabling effective filtering and troubleshooting.
Key recommendations for span names:
- Use clear, descriptive names that capture the specific operation performed.
- Maintain consistency across similar operations to facilitate aggregation.
- Avoid generic names to ensure clarity when filtering or grouping data.
Span KindΒΆ
Metrics are collected only for specific span kinds as defined by the OpenTelemetry specification. These include:
- SERVER: For server-side handling of requests (e.g., an HTTP server receiving a request).
- CLIENT: For client-side operations (e.g., outbound HTTP or database requests).
- PRODUCER: For sending messages to messaging systems (e.g., publishing to a Kafka topic).
- CONSUMER: For receiving messages from messaging systems (e.g., reading from a Kafka topic).
Spans with a resource attribute resource.service.name
equal to nais-ingress
are excluded from metrics to ensure that only application-relevant data is recorded.
Span StatusΒΆ
Span status indicates the outcome of an operation, providing context for troubleshooting. The common span status codes are:
- OK: The span completed successfully.
- ERROR: An error occurred during the span's execution.
- UNSET: No explicit status was set, indicating an undefined state.
Monitoring span status alongside other metrics can quickly identify issues related to failed operations or unexpected behavior.
Example PromQL QueriesΒΆ
Below are some example PromQL queries for using span metrics in Prometheus or Grafana:
- Requests per second for a specific service:
- 99th percentile latency for a specific HTTP endpoint:
histogram_quantile(0.99, sum(rate(traces_spanmetrics_latency_bucket{service_name="my-app", http_host="api.example.com", http_status_code="200"}[5m])) by (le))
- Error rate for a service (using HTTP status code):
- Database operation latency (e.g., for PostgreSQL SELECT queries):
histogram_quantile(0.95, sum(rate(traces_spanmetrics_latency_bucket{db_system="postgresql", db_operation="SELECT"}[5m])) by (le))
- Request count by HTTP host (for multi-tenant apps):
- Compare latency across environments (clusters):
histogram_quantile(0.90, sum(rate(traces_spanmetrics_latency_bucket{service_name="my-app"}[5m])) by (le, k8s_cluster_name))
Note: Adjust the label filters to match your application's configuration. Use only the appropriate set of labels relevant to each span context to maintain data clarity and consistency.
These metrics and queries provide a comprehensive overview of your application's distributed traces, aiding in observability and troubleshooting.