How to Find Linux Reboot Reason? The Ultimate Troubleshooting Guide

Have you ever come back to your Linux server after a weekend only to find it unexpectedly rebooted? Annoying isn‘t it? As a fellow Linux admin, I feel your pain.

Determining why an unplanned reboot occurred is critical to stop it from happening again. Downtime costs money – we need to get to the bottom of this!

In this comprehensive troubleshooting guide, I‘ll walk you through the tools and techniques I use to diagnose reboot causes on Linux systems.

By the end, you‘ll have a clear methodology to analyze reboot reasons and prevent future surprises. Get ready for some pro Linux log sleuthing!

Overview: How Do We Tackle This?

Here‘s a high-level overview of the reboot forensic process we‘ll follow:

Confirm exact reboot time
Check relevant system logs
Verify audit logs
Inspect systemd journal
Analyze kernel crash dumps
Review hardware logs
Check monitoring system alerts
Correlate clues from multiple sources

Let‘s discuss each step to shed light on why your server restarted unexpectedly.

Confirm When the Reboot Happened

First, we need to confirm the last reboot time. This acts as the starting point for our log analysis.

Use the who -b command to show last system boot time:

$ who -b
system boot  2022-02-15 17:03

Alternatively, last reboot shows the detailed reboot history:

$ last reboot
reboot   system boot  5.4.0-1069-aws Tue Feb 15 17:03 - 17:05 (00:01)   
reboot   system boot  5.4.0-1069-aws Mon Feb 14 22:34 - 22:35 (00:00)

This quickly tells you when the system last booted up without a proper shutdown.

Pro Tip: I prefer using who -b for quick checks and last reboot for detailed analysis. You can also verify against syslog timestamps.

Okay, let‘s bookmark the reboot time. Now we know when to focus our log investigation around.

Analyze System Logs for Clues

Linux systems log extensive information about events, warnings and errors on the OS.

Checking relevant log files leading up to the reboot can reveal valuable clues. Let‘s examine the critical ones.

Review /var/log/messages on RHEL/CentOS

On RHEL and CentOS systems, the /var/log/messages file records major system events.

Use grep to filter for common reboot related keywords:

$ sudo grep -i "reboot\|shutdown" /var/log/messages

This will display any reboot or shutdown messages from the log file around your time frame.

For example:

Feb 14 22:34:01 server1 systemd: Reached target Shutdown.  
Feb 14 22:34:01 server1 systemd: Starting Shutdown.
Feb 14 22:34:01 server1 systemd: Reached target Final Step.
Feb 14 22:34:01 server1 systemd: System going down for reboot NOW!

These sequential messages indicate a clean OS shutdown was initiated. We‘re on the right track!

But also look for any error or warning events leading up to the reboot. They provide critical troubleshooting context.

Check /var/log/syslog on Debian/Ubuntu

On Debian or Ubuntu systems, the /var/log/syslog file records kernel, system and application logs.

Search for reboot clues like before:

$ sudo grep -i "reboot\|shutdown" /var/log/syslog

Focus on auth, daemon, kern and user log sections.

For example:

Feb 15 17:03:12 server2 kernel: [ 34.061550] reboot: Power down  
Feb 15 17:03:13 server2 systemd[1]: systemd-logind.service: Succeeded.
Feb 15 17:03:13 server2 systemd[1]: Stopping User Manager for UID 1000...
Feb 15 17:03:13 server2 systemd[1]: Started Shutdown.

The "Power down" kernel message indicates a reboot call. systemd logs shows services stopped gracefully.

Again, also check for any warnings or failures near reboot time.

Pro Tip: I like to load syslog in a text editor and browse ± 30 minutes around the reboot. This quickly gives me context before diving deep.

Don‘t Forget Application Logs!

In addition to kernel and system logs, many applications maintain their own log files as well:

Docker – /var/lib/docker/containers/*/*-json.log
MySQL – /var/log/mysql/error.log
Apache – /var/log/httpd/error_log
Nginx – /var/log/nginx/error.log

And so on…check logs of all critical services on your system.

View the last few lines of a service log like this:

$ sudo tail -n50 /var/log/mysql/error.log

If a service crashed or stopped abruptly, it could have triggered the reboot.

Pro Tip: I like to use Logrotate to archive old logs, so I can dig into history if needed.

Did Kernel Oops? Check dmesg Output

The dmesg command shows the kernel ring buffer – handy for hardware and driver issues.

To view kernel messages since last boot try:

$ dmesg --level=err | grep reboot

Any kernel errors flagged here could have led to a panic and reboot.

For example, a bad CPU/memory could generate "Machine Check Exception" logs.

Scan Cron Logs for Automated Reboots

Sometimes a cron job may have unexpectedly rebooted the server.

Double check for any evidence in the cron logs:

$ grep reboot /var/log/cron

Look for jobs running close to the reboot time.

Pro Tip: I once had a backup script accidentally reboot a production database server! Always check cron.

Spot Hardware Issues in BIOS & BMC Logs

Don‘t forget server hardware logs from BIOS and BMC for device issues:

$ sudo ipmitool sel list 
$ sudo ipmitool sel get {record_id}

And:

$ sudo dmidecode --type 16

Hardware faults like RAM errors can initiate a reboot.

Pro Tip: Experience has taught me that seemingly software issues can sometimes be hardware issues in disguise!

Correlate Warnings Across Logs

Don‘t view logs in isolation. Cross-correlate any warnings across application, system and hardware logs to identify common patterns.

For example, you may notice:

App logs showing crashes
System logs indicating memory pressure
BMC logs reporting correctable ECC memory errors

Together, they can zero-in on faulty RAM causing crashes and a reboot.

Verify the Audit Trail

The audit daemon auditd creates detailed logs of security events and system calls.

Search with the ausearch tool to view recorded reboots:

$ ausearch -i -m SYSTEM_BOOT,SYSTEM_SHUTDOWN | tail -4

A SYSTEM_SHUTDOWN followed by a SYSTEM_BOOT indicates a successful restart.

But two consecutive SYSTEM_BOOT events may signal an ungraceful halt:

----
type=SYSTEM_BOOT
msg=...
---- 

----
type=SYSTEM_BOOT
msg=...
----

No shutdown logged between the reboots likely means a crash.

Pro Tip: Always check audit logs even if no suspicious security events. Reboots create log entries.

Inspect the systemd Journal

The systemd journal stores logs from system services, kernel and boots.

First, list all recorded historical boots:

$ journalctl --list-boots

This shows reboot IDs and timestamps.

Now view logs from your specific reboot:

$ journalctl -b {reboot-id}

Replace {reboot-id} with ID from --list-boots output.

For example:

$ journalctl -b b4d934a4090a40bda231fa9b4bac353a   

Feb 15 17:03:13 server1 systemd[1]: Reached target Shutdown.
Feb 15 17:03:13 server1 systemd[1]: Starting Shutdown.  
Feb 15 17:03:13 server1 kernel: registers:
Feb 15 17:03:13 server1 kernel: cpsr: 60000010

The stack or register dump indicates the kernel crashed!

Pro Tip: Always check journal for detailed system messages leading up to a reboot.

Inspect any Kernel Crash Dumps

If the Linux kernel crashed hard, it may have left behind a memory crash dump.

Configure Kernel Crash Dumping:

Install and enable kdump service.
Set dump file path in /etc/kdump.conf.
Allocate dump storage size in /etc/default/kdump-tools.

Now if the kernel crashes again, dump file will be saved for analysis.

View crash report using the crash utility:

$ crash /var/crash/127.0.0.1-2022-02-15-17:13/vmcore

The stack trace and exception details can reveal why the kernel crashed.

Pro Tip: Kernel crash analysis requires expert Linux skills. Try asking on community forums for help.

Check Monitoring Systems for Alerts

Did your monitoring system detect any alerts around the reboot time?

Nagios, Zabbix, Datadog etc. may have set off warnings.
Notifications about service failures.
Custom metric thresholds crossed.
Server offline alerts post-reboot.

Analyze these monitoring events – they provide quick clues to investigate further using log data.

Pro Tip: I highly recommend using Nagios, OpsGenie or similar tools to proactively track server availability.

Eliminate Causes Methodically

Okay, by now we likely have several potential reasons for the reboot from different logs and alerts.

How do we pinpoint the exact root cause? Follow these best practices:

Note down all clues and timestamps in a spreadsheet.
Identify common patterns across clues.
Start troubleshooting software issues first.
Reproduce the issue if needed to capture more data.
Don‘t forget to check hardware – BMC and BIOS logs.
Kernel crashes require expert diagnosis – seek help!
If all else fails, enable debug logging and retry.

Methodically eliminating possible causes is key. Don‘t leave any stone unturned!

Conclusion

As you can see, Linux provides many tools to help troubleshoot unexpected server reboots – ranging from kernel logs to user space utilities.

Carefully analysing the evidence around reboot time can uncover the smoking gun. Correlating clues from different sources helps identify patterns.

Following a thorough forensic methodology gives you the best chance to determine why a reboot occurred. It will help you take corrective actions and prevent future disruption.

I hope these troubleshooting tips help you become a syslog super sleuth! Let me know if you have any other creative ways to investigate reboot causes.

Happy reboot hunting!

Overview: How Do We Tackle This?

Confirm When the Reboot Happened

Analyze System Logs for Clues

Review /var/log/messages on RHEL/CentOS

Check /var/log/syslog on Debian/Ubuntu

Don‘t Forget Application Logs!

Did Kernel Oops? Check dmesg Output

Scan Cron Logs for Automated Reboots

Spot Hardware Issues in BIOS & BMC Logs

Correlate Warnings Across Logs

Verify the Audit Trail

Inspect the systemd Journal

Inspect any Kernel Crash Dumps

Check Monitoring Systems for Alerts

Eliminate Causes Methodically

Conclusion

Related