Automate Service Monitoring with a Bash Script and Systemd

Keeping your essential services running smoothly is crucial for any server. Ensuring that services like Nginx and MariaDB are always up and running can be a daunting task. In this article, we’ll explore a robust solution using a bash script combined with systemd to automate service monitoring and restarting, ensuring your services remain operational with minimal intervention.

Why Automate Service Monitoring?

Manual monitoring and restarting of services can be time-consuming and prone to human error. Automation not only ensures consistent uptime but also frees you up to focus on more critical tasks. By leveraging a bash script and systemd, you can create a reliable mechanism to monitor and restart services automatically.

Prerequisites

Before diving into the setup, ensure you have the following:

  1. A server running Debian 12 or a similar Linux distribution.
  2. Basic knowledge of bash scripting.
  3. Systemd installed on your server.

The Bash Script

Our script monitors specified services, restarts them if they fail, and sends email notifications using SendGrid. It also integrates with the OpenAI API to provide summarized log insights and suggested solutions.

Here’s the complete bash script:

#!/bin/bash

# Configuration
SENDGRID_API_KEY="YOUR_SENDGRID_API_KEY"
EMAIL_FROM="[email protected]"
EMAIL_TO="[email protected]"
CHECK_INTERVAL=60  # in seconds
LOG_FILE="/var/log/service_monitor.log"
OPENAI_API_KEY="YOUR_OPENAI_API_KEY"

# Function to log messages
log_message() {
    local message=$1
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $message" >> $LOG_FILE
}

# Function to sanitize content
sanitize_content() {
    local content=$1
    echo "$content" | python3 -c 'import json,sys; print(json.dumps(sys.stdin.read())[1:-1])'
}

# Function to get a summary and suggested solution from OpenAI
get_openai_summary() {
    local log_content=$1
    local prompt="Summarize the following error log and suggest a solution:\n\n$log_content"

    local payload=$(jq -n \
        --arg model "gpt-3.5-turbo" \
        --arg role1 "system" \
        --arg content1 "You are an assistant that summarizes error logs and provides suggested solutions." \
        --arg role2 "user" \
        --arg content2 "$prompt" \
        '{model: $model, messages: [{role: $role1, content: $content1}, {role: $role2, content: $content2}]}'
    )

    log_message "OpenAI payload: $payload"

    local response=$(curl --silent --request POST \
        --url https://api.openai.com/v1/chat/completions \
        --header "Content-Type: application/json" \
        --header "Authorization: Bearer $OPENAI_API_KEY" \
        --data "$payload")

    log_message "OpenAI response: $response"

    if [[ "$response" == *"choices"* ]]; then
        local summary=$(echo "$response" | python3 -c 'import sys, json; print(json.load(sys.stdin)["choices"][0]["message"]["content"])' 2>> $LOG_FILE)
        echo "$summary"
    else
        echo "Error: Failed to get a valid response from OpenAI."
        log_message "Error: Failed to get a valid response from OpenAI. Response: $response"
    fi
}

# Function to send an email using SendGrid
send_email() {
    local subject=$1
    local body=$2

    log_message "Sending email with subject: $subject"

    local payload=$(jq -n \
        --arg email_to "$EMAIL_TO" \
        --arg email_from "$EMAIL_FROM" \
        --arg subject "$subject" \
        --arg body "$body" \
        '{personalizations: [{to: [{email: $email_to}]}], from: {email: $email_from}, subject: $subject, content: [{type: "text/html", value: $body}]}'    
    )

    log_message "Email payload: $payload"

    local response=$(curl --silent --request POST \
      --url https://api.sendgrid.com/v3/mail/send \
      --header "Authorization: Bearer $SENDGRID_API_KEY" \
      --header 'Content-Type: application/json' \
      --data "$payload")

    log_message "SendGrid response: $response"
}

# Function to restart service and send emails based on the outcome
check_and_restart_service() {
    local service=$1
    local status
    local attempt=0
    local max_attempts=3
    local wait_times=(30 60 80)

    while [ $attempt -lt $max_attempts ]; do
        if ! systemctl is-active --quiet $service; then
            log_content=$(systemctl status $service --no-pager | head -c 1000)
            dmesg_content=$(dmesg | tail -n 10 | head -c 1000)
            sanitized_log_content=$(sanitize_content "$log_content")
            sanitized_dmesg_content=$(sanitize_content "$dmesg_content")

            log_message "$service is not active. Attempting to restart (Attempt $((attempt + 1)))..."

            local openai_summary=$(get_openai_summary "$sanitized_log_content")

            local email_body="<html>
            <body>
                <h2>The $service service is not active</h2>
                <p><strong>Service Log:</strong></p>
                <pre>$sanitized_log_content</pre>
                <p><strong>dmesg Output:</strong></p>
                <pre>$sanitized_dmesg_content</pre>
                <p><strong>Summary and Suggested Solution:</strong></p>
                <pre>$openai_summary</pre>
            </body>
            </html>"

            send_email "$service Service Failed" "$email_body"

            systemctl restart $service
            sleep ${wait_times[$attempt]}
            attempt=$((attempt + 1))

            if systemctl is-active --quiet $service; then
                log_message "$service has been restarted successfully after attempt $attempt."

                local restored_body="<html>
                <body>
                    <h2>The $service service has been restored</h2>
                </body>
                </html>"

                send_email "$service Service Restored" "$restored_body"
                break
            fi
        else
            log_message "$service is running normally."
            break
        fi
    done

    if [ $attempt -eq $max_attempts ]; then
        log_message "Failed to restart $service after $max_attempts attempts."

        local failed_body="<html>
        <body>
            <h2>Failed to restart $service after $max_attempts attempts</h2>
            <p><strong>Service Log:</strong></p>
            <pre>$sanitized_log_content</pre>
            <p><strong>dmesg Output:</strong></p>
            <pre>$sanitized_dmesg_content</pre>
        </body>
        </html>"

        send_email "$service Failed to Restart" "$failed_body"
    fi
}

# Main monitoring loop
monitor_services() {
    sleep 60  # Wait for 1 minute after reboot before starting checks
    while true; do
        check_and_restart_service "nginx"
        check_and_restart_service "mariadb"
        sleep $CHECK_INTERVAL  # Wait before starting checks again
    done
}

# Test email function
test_email() {
    send_email "Test Email" "<html><body><h2>This is a test email to verify SendGrid settings.</h2></body></html>"
    echo "Test email sent."
}

# Check command line argument
case "$1" in
    start)
        monitor_services
        ;;
    test-email)
        test_email
        ;;
    *)
        echo "Usage: $0 {start|test-email}"
        ;;
esac

Setting Up the Script as a Service

To ensure our monitoring script runs continuously and restarts automatically if it fails, we’ll set it up as a systemd service. This way, it will start on boot and handle any unexpected restarts gracefully, it will also install necessary packages.

Create the following setup script:

#!/bin/bash

# Configuration
SERVICE_MONITOR_SCRIPT="/path/to/your/service_monitor.sh"
SERVICE_MONITOR_DEST="/usr/local/bin/service_monitor.sh"
SERVICE_NAME="service_monitor"
SERVICE_FILE="/etc/systemd/system/${SERVICE_NAME}.service"

# Function to install necessary packages
install_packages() {
    echo "Installing necessary packages..."

    # Update the package list
    apt-get update

    # Install required packages
    apt-get install -y curl python3 python3-pip systemd jq

    # Install the OpenAI library if not already installed
    pip3 install openai

    echo "Packages installed successfully."
}

# Function to setup the service monitor
setup_service_monitor() {
    echo "Setting up Service Monitor..."

    # Copy the service monitor script to the destination
    cp $SERVICE_MONITOR_SCRIPT $SERVICE_MONITOR_DEST
    chmod +x $SERVICE_MONITOR_DEST

    # Create the systemd service file
    cat <<EOL > $SERVICE_FILE
[Unit]
Description=Service Monitor for Nginx and MariaDB
After=network.target

[Service]
ExecStart=$SERVICE_MONITOR_DEST start
Restart=always
User=root

[Install]
WantedBy=multi-user.target
EOL

    # Reload systemd, enable and start the service
    systemctl daemon-reload
    systemctl enable $SERVICE_NAME
    systemctl start $SERVICE_NAME

    echo "Service Monitor has been set up and started."
}

# Function to remove the service monitor
remove_service_monitor() {
    echo "Removing Service Monitor..."

    # Stop and disable the service
    systemctl stop $SERVICE_NAME
    systemctl disable $SERVICE_NAME

    # Remove the script and service file
    rm -f $SERVICE_MONITOR_DEST
    rm -f $SERVICE_FILE

    # Reload systemd
    systemctl daemon-reload

    echo "Service Monitor has been removed."
}

# Check command line argument
if [ "$1" == "setup" ]; then
    install_packages
    setup_service_monitor
elif [ "$1" == "remove" ]; then
    remove_service_monitor
elif [ "$1" == "test-email" ]; then
    $SERVICE_MONITOR_DEST test-email
else
    echo "Usage: $0 {setup|remove|test-email}"
fi

Setting Up the Service

  1. Save the Scripts: Save the monitoring script as service_monitor.sh and the setup script as setup_service_monitor.sh.
  2. Make Scripts Executable: Run chmod +x service_monitor.sh setup_service_monitor.sh.
  3. Run the Setup Script: Execute ./setup_service_monitor.sh setup to set up the service.

Monitoring and Testing

To ensure everything is working correctly, you can test the email functionality by running:

./setup_service_monitor.sh test-email

This should send a test email to verify that your SendGrid settings are configured correctly.

Conclusion

You can expand the script to include additional services.

Automating service monitoring and restarting using a bash script and systemd can greatly enhance the reliability and uptime of your critical services. This method ensures consistent service availability and offers detailed logs and email notifications for improved tracking and troubleshooting.

For more details on systemd and bash scripting, check out these resources:

Have any suggestions or feedback? We’d love to hear from you! Please share your thoughts in the comments below.

Leave a Reply

Your email address will not be published. Required fields are marked *