A Comprehensive Guide to Monitoring Linux Servers with Prometheus and Grafana

Dear fellow infrastructure geek,

Monitoring our Linux servers is absolutely critical to maintain high performance and reliability as our systems scale. We can‘t improve what we don‘t measure!

In this comprehensive guide, I‘ll share my experiences and recommendations for building a robust monitoring stack using Prometheus and Grafana – two incredibly powerful open source tools.

Why Prometheus and Grafana?

There are many monitoring solutions out there – both open source and commercial. So why Prometheus and Grafana?

Prometheus is a next-generation monitoring system designed for dynamic, large-scale environments. Its multi-dimensional data model and powerful query language make it stand out from other tools.

Some key capabilities:

Pull-based scraping using exporters – avoids complex push configurations
Highly dimensional data allows slicing and dicing of metrics
Built-in timeseries database for operational intelligence
PromQL lets you ask non-trivial questions about your data
Handles millions of metrics and hundreds of instances

Grafana is the best metric visualization solution I‘ve used, period. Its intuitive UI and many panel types enable you to build beautiful, information-rich dashboards with ease.

Notable features:

Support for dozens of data sources – integrate all your data in one place
Clean and customizable dashboard templates
Intuitive query builders and config options
Annotations, alerts and template variables for insights
Thriving community with 1000+ dashboard templates

Together, Prometheus and Grafana provide a monitoring experience that is truly greater than the sum of its parts. The deep integration, from data source setup to PromQL templating, enables powerful exploratory monitoring workflows.

According to a Datadog survey, over 50% of developers now use Prometheus and Grafana – significantly higher adoption than alternatives. This massive community provides long-term sustainability.

Now let‘s get to the good stuff – how to set up Prometheus and Grafana for monitoring Linux!

Step-by-Step Installation Guide

I‘ll be demonstrating installation on CentOS 7, but the instructions should be adaptable for any modern Linux distribution.

Installing Prometheus

Prometheus server is just a single static binary – easy to deploy. Grab the latest build from prometheus.io and extract it:

wget https://github.com/prometheus/prometheus/releases/download/v2.18.1/prometheus-2.18.1.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz

I would recommend deploying Prometheus as a dedicated prometheus user for security:

useradd --no-create-home --shell /bin/false prometheus

We need to provide storage for Prometheus‘s time series database. I‘ll use /data/prometheus, owned by the prometheus user:

mkdir /data/prometheus 
chown prometheus:prometheus /data/prometheus

Now we can configure /etc/prometheus/prometheus.yml – Prometheus‘s main config file:

global:
  scrape_interval: 15s
  external_labels:
    monitor: linux-1

scrape_configs:
  - job_name: prometheus 
    static_configs:
      - targets: [‘localhost:9090‘]

  - job_name: linux
    static_configs:
      - targets: [‘localhost:9100‘] # Node exporter

This config instructs Prometheus to scrape itself on port 9090 as well as Node Exporter on port 9100.

Finally, we can create a Systemd unit file to manage the Prometheus server process:

[Unit]
Description=Prometheus Time Series Collection and Processing Server
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file /etc/prometheus/prometheus.yml \
  --storage.tsdb.path /data/prometheus/ 

[Install]
WantedBy=multi-user.target

Start it up:

systemctl daemon-reload
systemctl start prometheus

You can validate it‘s working at http://:9090. Let‘s move on to Grafana!

Installing and Configuring Grafana

Grafana has official repositories for most major distros – I‘ll use their Yum repo on CentOS:

[grafana] 
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

Now install Grafana – I prefer the OSS package even though enterprise support is available:

yum install grafana
systemctl start grafana-server

By default Grafana runs on port 3000 – navigate there to get started with the initial admin credentials.

Let‘s set up Prometheus as a data source right away – the defaults are perfect:

Grafana Prometheus data source config

Now we can start building dashboards! I‘ll import the Node Exporter Full dashboard to monitor my Linux host.

Here‘s a peek at the system metrics it provides out of the box:

Grafana dashboard for Node Exporter

With Prometheus and Grafana set up, let‘s look at gathering metrics.

Exporters – Metrics Exposed

The most reliable way to get metrics into Prometheus is via exporters. Exporters are small programs that expose metric endpoints on your infrastructure.

Node Exporter is the standard for gathering Linux host metrics. It exposes OS, hardware, disk and network stats. Installation is just a matter of extracting the binary:

wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz 
cd node_exporter-1.0.1.linux-amd64
./node_exporter

This exposes metrics on port 9100. Prometheus will automatically discover and scrape them based on our config.

Another super useful exporter is cAdvisor for container metrics. cAdvisor integrates seamlessly with Docker to provide Prometheus metrics on container resource usage and performance.

There are hundreds of other exporters available for infrastructure like databases, storage, network etc. Exporters allow Prometheus to monitor almost anything!

Dashboarding and Alerting

Now that we have metrics flowing into Prometheus, let‘s unlock the power of Grafana. While the out-of-the-box dashboards are great starting points, you‘ll want to eventually build custom dashboards tailored to your stack.

Creating Custom Dashboards

Grafana makes it easy to build custom dashboards with its intuitive UI. I follow these best practices for effective monitoring dashboards:

1. Focus on business priorities – What metrics align with your organizational goals? Response time, revenue, conversion metrics, uptime SLAs etc. Center your dashboard around what matters most.

2. Use annotations to correlate events – Outages, deployments, config changes. Call out events on dashboards with annotations.

3. Break up long dashboards – Avoid cramming everything into one massive dashboard. Use dashboard variables, links and folders to split things up.

4. Layout and group related metrics – Logical visualization grouping helps users quickly reason about the data.

5. Summarize and aggregate – Use templates, repeats and queries to roll up metrics. Don‘t show too much granularity.

6. Use colors and thresholds thoughtfully – Color codify metrics meaningfully. Don‘t overdo it.

7. Monitor SLIs, SLAs and SLOs – Tracking service level indicators, agreements and objectives helps focus monitoring.

8. Include troubleshooting info – Provide links to documentation, runbooks and tools for debugging issues.

Following Grafana best practices results in effective, actionable dashboards.

Alerting and Integrations

To fully leverage monitoring data, we need alerting workflows.

Prometheus has a simple yet powerful alerting engine built-in. You can define alerting rules to send notifications based on metric thresholds or patterns. For example:

ALERT APIHighLatency
  IF api_http_request_latencies_second{quantile="0.99"} > 1
  FOR 1m
  LABELS {severity = "critical" }
  ANNOTATIONS {summary = "High API latency"}

This would trigger a critical alert if the 99th percentile request latency exceeds 1 second for over 1 minute.

You can route Prometheus alerts to tools like email, PagerDuty, Slack, Pushover and more. For example, sending critical alerts to PagerDuty while informational alerts go to Slack.

Grafana also has alerting capabilities via Grafana Alerts. You can define alert rules directly on Grafana metrics which are evaluated by Grafana‘s server.

In addition to alerts, Grafana integrates with hundreds of other systems via plugins. Some examples:

ChatOps with Slack, Teams and Discord plugins
Infrastructure workflows with Ansible, Terraform and Puppet plugins
Business analytics with Tableau and PowerBI plugins
Ticketing systems like ServiceNow, Jira and more

The integrations are endless!

ProTips for Prometheus

Here are some pro tips from my experience running Prometheus at scale to help you get the most from it:

Fine-tune scrape intervals: Scrape too often and you waste resources. Too little and you risk missing metrics. I find 15-60s is a good range for most deployments.

Watch your cardinality: Prometheus uses time series so labels create many unique series. Watch for label explosion.

Manage disk space: Prometheus needs sufficient space for its TSDB. Retention policies help. Remote read can offload storage.

Use metric relabelling: Relabelling lets you aggregate labels for cleaner, more meaningful metrics.

Test with PromQL: Learn PromQL well and use it to verify alerts, troubleshoot issues and build dashboards.

Enable remote read/write: For long term storage and analysis, use remote read/write to tools like Cortex or Thanos.

Set up federation: Federation allows scaling and aggregating metrics from multiple Prometheus instances.

Mastering Prometheus does take time and practice. Our robust monitoring depends on it!

Closing Thoughts

I hope this guide provided a comprehensive overview of monitoring Linux with Prometheus and Grafana. The possibilities are endless when you combine storage, visualizations and alerting together.

Here are some key takeaways as you build your monitoring stack:

Start with identifying your core metrics and SLIs/SLOs
Build dashboards tailored to your organization‘s needs
Leverage exporters to easily expose infrastructure metrics
Use Prometheus recording rules and alerts for automation
Integrate with other tools for workflows and analytics
Customize and tweak configurations as your environment evolves

Prometheus and Grafana are incredible open source tools for observability and monitoring. The vibrant community provides long-term sustainability. I‘m excited to see all the monitoring best practices you discover and implement!

Let me know if you have any other questions. Happy to help out a fellow metrics geek!

Regards,
[Your Name]