How to Automatically Restart Crashed Services - A Sysadmin‘s Guide

As a fellow sysadmin, I know how frustrating it can be when a critical service or process crashes unexpectedly. Every minute of downtime impacts customers and revenue, not to mention the headaches it causes us scrambling to get things back up.

In this comprehensive guide, I‘ll share the methods I‘ve learned over the years for setting up automatic restarts of key services when they crash or stop unexpectedly on Linux systems. Having these safeguards in place has saved my bacon many times!

I‘ll cover when to use auto restarts, different ways to implement them, key considerations, and some pro tips I‘ve picked up dealing with unreliable services. My goal is to provide a detailed technical explainer of the tools and techniques you can add to your sysadmin toolkit for minimizing downtime.

Why Auto Restart Services?

Let‘s quickly cover why having auto restart capabilities is so critical for any sysadmin:

Minimizes Disruption – Quickly bringing services back up after a crash keeps disruption to our users and customers to a minimum. Even small amounts of downtime can result in lost revenue, SLA violations and damage to our company‘s reputation.

Buys Diagnosis Time – Auto restarting buys us time to properly investigate the root cause of a crash, without the pressure and stress of ongoing downtime and user complaints.

Provides Resilience – Despite our best efforts, bugs, overloads and unplanned events happen. Auto restarts make our environment more resilient against inevitable crashes.

Covers Monitoring Gaps – Even robust monitoring and alerting can sometimes miss outages or not react quickly enough. Auto restarts provide another layer of protection.

You‘re probably already sold on the benefits. Now let‘s dig into the techniques and configs that make auto restarting happen.

Methods for Auto Restarting Services on Linux

There are several robust options available for automatically restarting stopped or crashed services on Linux:

1. Bash Scripts with Cron

Writing simple bash scripts that restart downed services, and scheduling them with cron is a straightforward and portable approach. For example:

#!/bin/bash

# Restart Nginx if it‘s not running
pgrep nginx >/dev/null || sudo systemctl restart nginx 

# Restart PHP-FPM if it‘s not running
pgrep php-fpm >/dev/null || sudo systemctl restart php-fpm

This checks if the processes are running with pgrep and restarts them if not found.

We‘d schedule it in crontab like:

*/5 * * * * /path/to/script.sh

Pros:

Simple, works on any Linux distro
Easy to customize checks for different services
Can set any interval with cron

Cons:

Need a script for each group of services
Only checks at cron intervals, not continuously

2. systemd Service Restart Options

For services managed by systemd, you can configure restart directly in the unit file with options like:

[Service]
Restart=always
RestartSec=30

This will automatically restart the service if it stops, waiting 30 seconds before restarting.

Pros:

No extra scripts needed
Granular control over restart policy and timing
Handles crashes and clean stops

Cons:

systemd only, no cron customization

3. sysadmin-friendly Tools like Monit or God

There are open source tools like Monit and God specifically built for process monitoring and restarting.

For example, Monit lets you specify services to monitor and customize restart conditions:

check process nginx with pidfile /var/run/nginx.pid
  start program = "/etc/init.d/nginx start" 
  stop program = "/etc/init.d/nginx stop"
  if failed host localhost port 80 protocol http then restart

Pros:

Advanced and robust monitoring options
Flexible policies for service restarts
Additional features like alerts

Cons:

Learning curve and setup of 3rd party tools

4. Commercial High Availability Solutions

For true mission critical uptime, commercial HA and clustering solutions like Pacemaker provide advanced failover and redundancy.

For instance, using a redundant pair with a floating IP that fails over if the primary node crashes.

Pros:

Maximum redundancy for critical services
Automated failover capabilities
Advanced monitoring and cluster management

Cons:

Expensive
Complex setup and management

5. Language and Framework Restart Hooks

Many languages and frameworks like Python and Node.js have built-in hooks to restart failed processes.

For example, Python‘s tornado web framework has a RestartHandler for this purpose.

Pros:

Handled automatically by the platform
No separate tools to configure
Fast restarting

Cons:

Only works for certain frameworks
Doesn‘t cover all failure scenarios

So those are the most common and useful options for automatically restarting stopped or crashed services. The right approach depends on your environment and services, which leads into…

Key Considerations When Implementing Auto Restarts

Here are some important best practices I‘ve learned for setting up and managing auto restarting mechanisms:

Tune the Restart Rate – Be careful not to restart too aggressively as that can make problems worse in some cases. Add delays and rate limiting as needed.

Comprehensive Monitoring – Make sure you have visibility into the restart events themselves, as well as system metrics.

Alert on Restarts – Send notifications when restarts occur so teams can investigate the root cause.

Log Analysis – Review logs to identify factors leading to crashes like memory leaks, high load etc.

Performance Tuning – Tune and profile services to minimize restarts, don‘t just rely on them as a band-aid.

Test Extensively – Simulate failures to ensure restart mechanisms work as intended.

Detailed Documentation – Document how the restart configurations work for each service.

Defense in Depth – Also have other practices like capacity planning, QA processes, deployment methods etc.

Auto Restart Services by Category

Let‘s look at some real examples of how I setup auto restarts for common categories of services:

Web Servers

For front-end web servers like Nginx, which tend to be very memory hungry, I take two approaches:

Use Monit to restart if unresponsive or using too much RAM:

check process nginx with pidfile /var/run/nginx.pid
  if failed host localhost port 80 protocol http for 3 cycles then restart
  if memory > 80% for 3 cycles then alert

Use systemd‘s memory limiting features:

[Service]
MemoryMax=1G

This ensures Nginx doesn‘t eat up all system RAM. systemd will terminate and restart it if this limit is exceeded.

I also have Grafana graphs tracking web server RAM usage, load, and restart events to spot trends.

Database Servers

Databases like MySQL and Postgres are sensitive to restarting too frequently. So for them I rely on systemd‘s restart options:

[Service]
Restart=on-failure
RestartSec=5min

This will only restart when the process exits with a non-zero status, and rate limit to once every 5 minutes.

I also have Prometheus exporting database metrics so I can create alerts for things like connection saturation and high restart rates.

Application Servers

For Java application servers like Tomcat, unexpected crashes are often caused by out of memory errors or blocking threads.

So in addition to Monit, I use JMX exporting and Grafana to graph heap usage, thread count, and garbage collection metrics. This helps identify memory leaks and performance issues proactively.

Key-Value Stores

For distributed stores like Redis I make use of their built-in high availability features like Redis Sentinel for auto promotion of replica nodes.

I also use Exporter integrations to monitor Redis performance metrics and alert when evictions, latency, and errors are increasing. This helps identify problems before restarts are necessary.

Critical Background Jobs

For essential asynchronous jobs, like billing batch processes, I combine Monit monitoring with framework restart hooks:

check process billing.py
  every 2 cycles
    if status != 0 
      alert
      restart

The Python job also uses a RestartHandler to reload on failures. This provides double redundancy.

For all critical background jobs, I have detailed Runbook documentation on how to resolve failures, since auto restarting alone may not suffice.

As you can see, the optimal restart approach depends on the type of service and its normal failure modes.

Quality of Life Improvement for Fellow Sysadmins

While you may spend time tweaking and perfecting your auto restart setups, I promise this investment will improve your quality of life as a sysadmin long term.

Some of the benefits I‘ve realized include:

More Restful Nights – Less late night alert wakes ups because services are automatically recovered.

Increased On-Call Morale – Less scramble during on-call shifts since services self-heal.

Happier Users – Fast auto restarts means less user complaints about downtime.

Proactive Improvement – Using metrics and logs around restarts to address root causes.

Reduced Fatigue – Spending less time firefighting crashes with auto restart safety nets.

Lower Stress Levels – The peace of mind that comes with having redundancy for failures.

While nothing replaces building stable and resilient services in the first place, auto restarts provide an essential last line of defense that every sysadmin needs.

I hope this guide has provided you with some useful techniques and insights to implement auto restarting in your own environment. As always, feel free to reach out if you have any other questions!

Why Auto Restart Services?

Methods for Auto Restarting Services on Linux

1. Bash Scripts with Cron

2. systemd Service Restart Options

3. sysadmin-friendly Tools like Monit or God

4. Commercial High Availability Solutions

5. Language and Framework Restart Hooks

Key Considerations When Implementing Auto Restarts

Auto Restart Services by Category

Web Servers

Database Servers

Application Servers

Key-Value Stores

Critical Background Jobs

Quality of Life Improvement for Fellow Sysadmins

Related