Overview
The cluster logging system is composed of the following components. They can be deployed together as a complete in-cluster logging solution, or separately to store logs outside the cluster.
- Collector
-
-
collect logs from containers and nodes.
-
add meta-data describing where the logs came from.
-
forward annotated logs to an in-cluster Loki store
-
forward logs off-cluster:
syslog
,kafka
,cloudwatch
and more.
-
- Store
-
-
aggregates logs from the entire cluster in a central place.
-
accepts complex queries to select, combine and filter logs.
-
- Console
-
-
displays logs selected by simple menu selections or complex store queries
-
Cluster Logging Operator
The Cluster Logging Operator (CLO) provides the ClusterLogForwarder (CLF) resource.
This is a simple but flexible API to describe "what you want": which logs to forward, and where to send them.
The operator generates a more complex "how to do it" configuration and deploys a daemonset
running
DataDog Vector daemons on each cluster node.
LokiStack Operator
The LokiStack operator deploys a Grafana Loki log store to collect aggregated logs and a proxy to control access to logs based on Openshift credentials.
Log types
Logs are categorized into three types:
Application |
Container logs from non-infrastrure containers. |
Infrastructure |
|
Audit |
Node logs from |
Container logs are the stdout and stderr output from containers in pods in the cluster.Node logs are from the cluster node operating system, journald and /var/log/
|
Normalization
Kubernetes does not enforce a uniform format for logs.
Anything that a containerized process writes to stdout
or stderr
is considered a log.
This "lowest common denominator" approach allows pre-existing applications to run on the cluster.
Traditional log formats write entries as ordered fields; but the order, field separator, format and meaning of fields varies.
Structured logs write log entries as JSON objects on a single line. However names, types, and meaning of fields in the JSON object varies between applications.
The Kubernetes Structured Logging proposal will standardize the log format for some k8s components, but there will still be diverse log formats from non-k8s applications running on the cluster.
The collector adds meta-data to container logs as per the cluster logging data model.
Infrastructure node logs lack the kubernetes
section since they are not associated with a container.
Audit and k8s event logs are structured logs that contain their own meta-data, they are forwarded unmodified.
Metrics and Labels
The set of metric labels for logging is described here.
Labels identifying a container
Metrics associated with a Pod get the following labels:
-
namespace
: namespace name -
pod
: pod name -
uid
: pod UUID -
node
: node name, as returned byoc get node -o=jsonpath='{@.items[*].metadata.name}
Note: this is node resource name. It may, or may not, coincide with the host name, DNS name, or IP address.
Metrics that are associated with a container also get this label:
-
container
: container name
For example, the following metrics are associated with container logs, and have all the above labels:
-
log_logged_bytes_total
provided by a separate agent that watches writes to log files. -
log_collected_bytes_total
provided by the collector
These labels are compatible with kubelet
, here’s an example kubelet
metric:
# HELP kubelet_container_log_filesystem_used_bytes [ALPHA] Bytes used by the container's logs on the filesystem.
# TYPE kubelet_container_log_filesystem_used_bytes gauge
kubelet_container_log_filesystem_used_bytes{container="authentication-operator",namespace="openshift-authentication-operator",pod="authentication-operator-67c88594b5-zftcn ",uid="ead91de5-5e10-42b9-8ab9-6386f21cd554"} 3.44064e+07
Label identifying a cluster
Multi-cluster deployments need a cluster
label.
OpenShift clusters provide a unique, human-readable name name via the API , which can be retrieved by:
oc get infrastructure/cluster -o template="{{.status.infrastructureName}}"
This will not work on other clusters. Most cluster providers offer a unique name but there is no universal plain kubernetes way to get one.
Relevant discussions:
Prometheus standard labels
These labels are added by prometheus, they are of limited relevance for to logging. They identify the agent that collects logs, not the resource that produced them.
-
instance
: address of scrape endpoint in the form "<ip-literal>:<port>" -
job
: arbitrary string name to identify related endpoints (e.g. "log_collector")
Observability and Correlation
Observability means collecting, forwarding, storing, analyzing and correlating different types of signal from a cluster to monitor its health, identify and fix problems, and plan for capacity changes.
Types of signal
- Log entry
-
A block of text (usually one line) written by application or infrastructure processes, annotated by the logging system and forwarded for further processing.
- Metric
-
A statistic that changes over time and is sampled to produce a time_series. Presently we can assue all metrics are in Prometheus format, and are sampled by Prometheus. Processes to be monitored must provide a HTTP "scrape endpoint" where the metrics can be read.
- Alert
-
Alerts are not primary signals but summaries of metric time series. They identify conditions that need attention.
- Trace
-
Data attached to request-response scopes to track the progress and outcome of a request. Trace context can follow a chain of requests from server to server, provided the servers co-operate in passing the trace context.
Correlation Points
Correlation means associating different types of signal, for example:
-
I have an alert saying that an application is in trouble. I want to see logs from that application around the time of the alert.
-
I have a log entry showing that a request failed. I want to see traces for the entire life-cycle of that request and dependent requests.
Correlation requires that the signals have data that can be matched, for example:
timestamp |
All signals carry a timestamp. Signals can always be correlated by time-interval. |
origin resource |
Many signals are associated with a resource.
Resource signals can be correlated by |
trace-id |
Trace signals carry a trace-id. Applications that support tracing should include trace-id in an identifiable way in their logs. [2] Metrics can’t carry trace-ids due to cardinality limitations. |
Traces need more research. |
That names and formats of correlation fields or labels are not uniform across signals. OpenTelemetry does define a reference data-model for all signal types but it is not compatible with the current naming schemes of Openshift Logging, Monitoring or Telemetry, or Kubernetes.
The data model we currently use is:
-
Metrics: TODO Openshift and Kubernetes have consistent labels and naming schemes.
-
Traces: TODO
OpenTelemetry specs for comparison:
Correlation Stories
-
On receiving an alert, I want to see correlated logs or traces.
-
On reading a log, I want to see correlated traces or metrics.
-
On following a trace, I want to see correlated logs or metrics.
-
I want to query for a report on (logs/metrics/traces) and have correlated signals included as well.