
As a fellow sysadmin, I know how frustrating it can be when a critical service or process crashes unexpectedly. Every minute of downtime impacts customers and revenue, not to mention the headaches it causes us scrambling to get things back up.
In this comprehensive guide, I‘ll share the methods I‘ve learned over the years for setting up automatic restarts of key services when they crash or stop unexpectedly on Linux systems. Having these safeguards in place has saved my bacon many times!
I‘ll cover when to use auto restarts, different ways to implement them, key considerations, and some pro tips I‘ve picked up dealing with unreliable services. My goal is to provide a detailed technical explainer of the tools and techniques you can add to your sysadmin toolkit for minimizing downtime.
Why Auto Restart Services?
Let‘s quickly cover why having auto restart capabilities is so critical for any sysadmin:
Minimizes Disruption – Quickly bringing services back up after a crash keeps disruption to our users and customers to a minimum. Even small amounts of downtime can result in lost revenue, SLA violations and damage to our company‘s reputation.
Buys Diagnosis Time – Auto restarting buys us time to properly investigate the root cause of a crash, without the pressure and stress of ongoing downtime and user complaints.
Provides Resilience – Despite our best efforts, bugs, overloads and unplanned events happen. Auto restarts make our environment more resilient against inevitable crashes.
Covers Monitoring Gaps – Even robust monitoring and alerting can sometimes miss outages or not react quickly enough. Auto restarts provide another layer of protection.
You‘re probably already sold on the benefits. Now let‘s dig into the techniques and configs that make auto restarting happen.
Methods for Auto Restarting Services on Linux
There are several robust options available for automatically restarting stopped or crashed services on Linux:
1. Bash Scripts with Cron
Writing simple bash scripts that restart downed services, and scheduling them with cron is a straightforward and portable approach. For example:
#!/bin/bash
# Restart Nginx if it‘s not running
pgrep nginx >/dev/null || sudo systemctl restart nginx
# Restart PHP-FPM if it‘s not running
pgrep php-fpm >/dev/null || sudo systemctl restart php-fpm
This checks if the processes are running with pgrep and restarts them if not found.
We‘d schedule it in crontab like:
*/5 * * * * /path/to/script.sh
Pros:
- Simple, works on any Linux distro
- Easy to customize checks for different services
- Can set any interval with cron
Cons:
- Need a script for each group of services
- Only checks at cron intervals, not continuously
2. systemd Service Restart Options
For services managed by systemd, you can configure restart directly in the unit file with options like:
[Service]
Restart=always
RestartSec=30
This will automatically restart the service if it stops, waiting 30 seconds before restarting.
Pros:
- No extra scripts needed
- Granular control over restart policy and timing
- Handles crashes and clean stops
Cons:
- systemd only, no cron customization
3. sysadmin-friendly Tools like Monit or God
There are open source tools like Monit and God specifically built for process monitoring and restarting.
For example, Monit lets you specify services to monitor and customize restart conditions:
check process nginx with pidfile /var/run/nginx.pid
start program = "/etc/init.d/nginx start"
stop program = "/etc/init.d/nginx stop"
if failed host localhost port 80 protocol http then restart
Pros:
- Advanced and robust monitoring options
- Flexible policies for service restarts
- Additional features like alerts
Cons:
- Learning curve and setup of 3rd party tools
4. Commercial High Availability Solutions
For true mission critical uptime, commercial HA and clustering solutions like Pacemaker provide advanced failover and redundancy.
For instance, using a redundant pair with a floating IP that fails over if the primary node crashes.
Pros:
- Maximum redundancy for critical services
- Automated failover capabilities
- Advanced monitoring and cluster management
Cons:
- Expensive
- Complex setup and management
5. Language and Framework Restart Hooks
Many languages and frameworks like Python and Node.js have built-in hooks to restart failed processes.
For example, Python‘s tornado web framework has a RestartHandler for this purpose.
Pros:
- Handled automatically by the platform
- No separate tools to configure
- Fast restarting
Cons:
- Only works for certain frameworks
- Doesn‘t cover all failure scenarios
So those are the most common and useful options for automatically restarting stopped or crashed services. The right approach depends on your environment and services, which leads into…
Key Considerations When Implementing Auto Restarts
Here are some important best practices I‘ve learned for setting up and managing auto restarting mechanisms:
Tune the Restart Rate – Be careful not to restart too aggressively as that can make problems worse in some cases. Add delays and rate limiting as needed.
Comprehensive Monitoring – Make sure you have visibility into the restart events themselves, as well as system metrics.
Alert on Restarts – Send notifications when restarts occur so teams can investigate the root cause.
Log Analysis – Review logs to identify factors leading to crashes like memory leaks, high load etc.
Performance Tuning – Tune and profile services to minimize restarts, don‘t just rely on them as a band-aid.
Test Extensively – Simulate failures to ensure restart mechanisms work as intended.
Detailed Documentation – Document how the restart configurations work for each service.
Defense in Depth – Also have other practices like capacity planning, QA processes, deployment methods etc.
Auto Restart Services by Category
Let‘s look at some real examples of how I setup auto restarts for common categories of services:
Web Servers
For front-end web servers like Nginx, which tend to be very memory hungry, I take two approaches:
- Use Monit to restart if unresponsive or using too much RAM:
check process nginx with pidfile /var/run/nginx.pid
if failed host localhost port 80 protocol http for 3 cycles then restart
if memory > 80% for 3 cycles then alert
- Use systemd‘s memory limiting features:
[Service]
MemoryMax=1G
This ensures Nginx doesn‘t eat up all system RAM. systemd will terminate and restart it if this limit is exceeded.
I also have Grafana graphs tracking web server RAM usage, load, and restart events to spot trends.
Database Servers
Databases like MySQL and Postgres are sensitive to restarting too frequently. So for them I rely on systemd‘s restart options:
[Service]
Restart=on-failure
RestartSec=5min
This will only restart when the process exits with a non-zero status, and rate limit to once every 5 minutes.
I also have Prometheus exporting database metrics so I can create alerts for things like connection saturation and high restart rates.
Application Servers
For Java application servers like Tomcat, unexpected crashes are often caused by out of memory errors or blocking threads.
So in addition to Monit, I use JMX exporting and Grafana to graph heap usage, thread count, and garbage collection metrics. This helps identify memory leaks and performance issues proactively.
Key-Value Stores
For distributed stores like Redis I make use of their built-in high availability features like Redis Sentinel for auto promotion of replica nodes.
I also use Exporter integrations to monitor Redis performance metrics and alert when evictions, latency, and errors are increasing. This helps identify problems before restarts are necessary.
Critical Background Jobs
For essential asynchronous jobs, like billing batch processes, I combine Monit monitoring with framework restart hooks:
check process billing.py
every 2 cycles
if status != 0
alert
restart
The Python job also uses a RestartHandler to reload on failures. This provides double redundancy.
For all critical background jobs, I have detailed Runbook documentation on how to resolve failures, since auto restarting alone may not suffice.
As you can see, the optimal restart approach depends on the type of service and its normal failure modes.
Quality of Life Improvement for Fellow Sysadmins
While you may spend time tweaking and perfecting your auto restart setups, I promise this investment will improve your quality of life as a sysadmin long term.
Some of the benefits I‘ve realized include:
More Restful Nights – Less late night alert wakes ups because services are automatically recovered.
Increased On-Call Morale – Less scramble during on-call shifts since services self-heal.
Happier Users – Fast auto restarts means less user complaints about downtime.
Proactive Improvement – Using metrics and logs around restarts to address root causes.
Reduced Fatigue – Spending less time firefighting crashes with auto restart safety nets.
Lower Stress Levels – The peace of mind that comes with having redundancy for failures.
While nothing replaces building stable and resilient services in the first place, auto restarts provide an essential last line of defense that every sysadmin needs.
I hope this guide has provided you with some useful techniques and insights to implement auto restarting in your own environment. As always, feel free to reach out if you have any other questions!