Grafana Cloud
Grafana Cloud is a hosted version of Grafana, Prometheus, Loki, and Tempo.
If you prefer this observability stack, now you don't have to host and maintain it yourself anymore. But self-hosting comes at a cost. Providing a service with good uptime, retention, and durability has its problems that need solving.
tip
If you decide to self-host, it is still beneficial reading this chapter. Even though it is about setting up Grafana Cloud, the pricing model of Grafana Cloud forces you to be diligent with your metrics: store only a subset of all available metrics with emphasis on low cardinality. An exercise that you have to do eventually with your self-hosted stack as well to be able to provide a reliable service.
Shipping metrics
Grafana Cloud has a mostly self-explanatory setup. You have to install Prometheus, Loki, and Tempo on your cluster in shipping configuration. You will have a Prometheus running in your cluster that scrapes all metrics, but you will also have a remote_write
configuration. After scraping, a subset of metrics will be forwarded to Grafana Cloud to benefit from storage, dashboards, and retention. See the setup guide (here)(https://grafana.com/docs/grafana-cloud/metrics-prometheus/).
Alternatively, you can use the Grafana Agent project that is based on the open-source Prometheus and Loki projects, factored into a small package that contains the metric and log shipping parts.
If you chose Gimlet Stack as the installation method, it has a preconfigured Grafana Cloud integration with pruned metrics and logs.
Day-two operations
Billing alerts
When you use Grafana Cloud, you should always set billing alerts.
The built-in Grafana Cloud Billing dashboard allows you to track your usage. Make a copy of this dashboard, and set alerts for the total billable logs and metrics series.
On the included quota
Depending on your package, Grafana Cloud includes:
- 100GB logs per month
- 15000 metrics
The logging quota is fairly straightforward, but the metrics quota is not so self-evident.
- Most off-the-shelf exporters push you over the 15K limit
- To only ship metrics you use in your dashboards, put them on an allow list. See how.
- Cloud billing is a dark art, learn how Grafana bills.
On the cost of Histogram metrics
Histogram metrics weigh heavier than other metrics. Each distinct label variation counts as a metric series.
If you have 3 labels with 10 different values each, that is 10x10x10 = 1000 metrics. So be careful with the number of different values you have per label.
This is especially true for histograms, as they have buckets (10 by default), and a histogram coming from a server/pod/thread counts as 10 metrics.
If you have 10 buckets and 10 workers, it is 100 metrics coming from a single metric line in code.
To identify the largest metrics you have, you can run
topk(10, count by (__name__)({__name__=~".+"}))
The top metric for me had 672 metric series.
Querying the metric, I could see that there are only a couple of labels: cluster
, job
, le
, albeit rather high cardinality.
count(count by (le) (image_process_time_bucket))
shows 21 bucketscount(count by (job) (image_process_time_bucket))
31 distinct jobscount(count by (cluster) (image_process_time_bucket))
from 2 clusters
Since I pay $16 for 1000 series, this single metric (that is in code) costs $5 a month. High cardinality histograms are rather expensive.
tip
See how to analyze metrics cardinality
Grafana Cloud also includes a dashboard for cardinality analysis.