Cloud Monitoring, Alerting, and Observability for Compute Engine on GCP

Timeline: December 2025
Role: Cloud Engineer / Site Reliability Engineer
Skills: Google Cloud Monitoring, Cloud Logging, Uptime Checks, Alerting Policies, Dashboards, Compute Engine, Ops Agent, Observability

Project Summary

This project focused on implementing observability and operational monitoring for a Compute Engine virtual machine on Google Cloud Platform. The work involved provisioning a Linux VM, installing Apache, enabling metrics and log collection through the Google Cloud Ops Agent, configuring uptime checks, defining alerting policies, and building a custom monitoring dashboard.

The implementation demonstrated how to move from simple VM deployment to an observable and operationally monitored workload, using Google Cloud’s native monitoring and logging services to track health, availability, performance, and incidents. :contentReference[oaicite:1]{index=1}

Objectives

Deploy a Compute Engine VM running Apache
Install the Google Cloud Ops Agent for metrics and logs collection
Create an uptime check for external service availability
Configure an alerting policy for network traffic thresholds
Build a custom dashboard with performance charts
Inspect logs to validate VM lifecycle events and service health

Architecture Overview

The architecture consisted of:

A Compute Engine VM (lamp-1-vm) running Debian and Apache
The Google Cloud Ops Agent installed on the VM for telemetry collection
Cloud Monitoring collecting VM and agent-based metrics
Cloud Logging ingesting system and instance logs
An uptime check validating HTTP reachability over the VM’s external IP
An alerting policy configured on inbound traffic thresholds
A custom dashboard displaying CPU load and received packets

Architecture Diagram

Implementation & Highlights

1. Compute Engine and Web Service Setup

Created a Compute Engine instance named lamp-1-vm
Configured the VM with HTTP firewall access
Installed and started Apache HTTP Server
Verified successful service response through the instance’s external IP address

2. Monitoring and Logging Agent Installation

Installed the Google Cloud Ops Agent on the VM
Enabled collection of:
- system metrics
- infrastructure metrics
- VM and service logs
Prepared the instance for deeper observability beyond default platform telemetry

3. Uptime Check Configuration

Created an HTTP uptime check targeting the VM’s external IP
Configured frequent health checks to validate service accessibility
Used uptime monitoring to simulate availability validation from an external perspective

4. Alerting Policy Setup

Created an alerting policy based on network traffic metrics
Defined a threshold and retest window
Attached an email notification channel
Added alert documentation to support operational response and incident handling

5. Custom Dashboard Creation

Built a custom dashboard for workload visibility
Added charts for:
- CPU Load
- Received Packets
Centralized key VM performance metrics in a reusable observability view

6. Log Exploration and Operational Validation

Used Logs Explorer to inspect instance logs for lamp-1-vm
Observed operational events such as:
- service activity
- VM stop/start events
Correlated infrastructure actions with monitoring and logging outputs

7. Incident and Availability Review

Restarted the VM to observe:
- temporary uptime check failures
- recovery state changes
- alerting behavior
Verified that monitoring, logging, and alerting reflected service interruptions and recovery events accurately

Design Decisions

Used Apache on Compute Engine as a simple monitored workload
Installed the Ops Agent to extend observability with metrics and logs collection
Used uptime checks to validate service health from outside the VM
Added alerting to move from passive monitoring to active operational response
Created a dashboard to support quick performance visibility and ongoing monitoring workflows

Results & Impact

Successfully implemented a basic observability stack for a cloud-hosted VM
Demonstrated practical use of:
- metrics collection
- logging
- uptime monitoring
- alerting
- dashboards
Strengthened operational understanding of how infrastructure changes surface in monitoring systems
Built a foundation for production-style monitoring and incident response on GCP. :contentReference[oaicite:2]{index=2}

Tools & Technologies Used

Compute Engine – VM hosting
Apache HTTP Server – Web workload
Google Cloud Monitoring – Metrics and uptime monitoring
Google Cloud Logging – Log ingestion and exploration
Google Cloud Ops Agent – Metrics and logging collection
Alerting Policies – Threshold-based notifications
Custom Dashboards – Centralized observability views

Outcome

This project demonstrates the ability to implement monitoring, logging, and alerting for a cloud-hosted workload on Google Cloud. It highlights practical skills in observability, service health validation, incident awareness, and dashboard-driven operations, which are directly relevant to cloud engineering and site reliability roles.

Back to Cloud Projects

Felix Otieno Arogo