Operations¶
This section documents the operational procedures for maintaining the brennan.page homelab.
Overview¶
The homelab requires regular maintenance to ensure optimal performance, security, and reliability.
Maintenance Schedule¶
Daily Tasks 📅¶
Time: 5 minutes
Frequency: Every day
- System Health Check: Quick status verification
- Log Review: Check for critical errors
- Service Monitoring: Verify service availability
Weekly Tasks 📅¶
Time: 30 minutes
Frequency: Every Sunday
- System Updates: Apply security updates
- Backup Verification: Check backup integrity
- Performance Review: Monitor resource usage
Monthly Tasks 📅¶
Time: 1 hour
Frequency: First Sunday of month
- Security Audit: Review security settings
- Performance Optimization: Clean up resources
- Documentation Update: Update documentation
Quarterly Tasks 📅¶
Time: 2 hours
Frequency: Every quarter
- Major Updates: Apply major version updates
- Capacity Planning: Review resource usage
- Disaster Recovery: Test recovery procedures
Operational Procedures¶
Wiki Management¶
Wiki deployment, maintenance, and content management procedures.
Deployment¶
Service deployment and update procedures.
Backups¶
Backup and recovery procedures.
Monitoring¶
System and service monitoring.
Maintenance¶
Regular maintenance procedures.
Quick Commands¶
System Status¶
# Quick system health check
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
echo '=== System Status ==='
docker ps
echo -e '\n=== Resource Usage ==='
free -h
df -h
echo -e '\n=== Service Health ==='
curl -I https://brennan.page
"
Service Health¶
# Check critical services
curl -I https://docker.brennan.page
curl -I https://monitor.brennan.page
curl -I https://files.brennan.page
Log Review¶
# Check for critical errors
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
docker logs --tail 20 caddy | grep -i error
docker logs --tail 20 postgres | grep -i error
journalctl -n 50 --no-pager | grep -i error
"
Emergency Procedures¶
Service Outage¶
- Assess Impact: Check system status
- Restart Services:
docker compose restart - Check Logs: Review error logs
- Escalate: Contact support if needed
Data Recovery¶
- Stop Services:
docker compose down - Restore Backup: Use backup procedures
- Verify Data: Check data integrity
- Start Services:
docker compose up -d
Getting Help¶
Before Contacting Support¶
- Checked system status
- Reviewed error logs
- Attempted basic restart
- Checked documentation
Information to Include¶
- System status output
- Error messages
- Recent changes
- Steps already taken
References¶
- Services - Service documentation
- Infrastructure - Infrastructure documentation
- Configuration - Configuration management
- Troubleshooting - Troubleshooting guides ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 " apt update apt list --upgradable "
Review and apply updates¶
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 " apt upgrade -y docker system prune -f "
#### Service Updates
```bash
# Update Docker images
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
cd /opt/homelab/services
for service in */; do
echo "Updating \$service"
cd "\$service"
docker compose pull
docker compose up -d
cd ..
done
"
Backup Verification¶
# Verify backup integrity
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
ls -la /opt/homelab/backups/
find /opt/homelab/backups/ -name "*.tar.gz" -mtime +7 -exec ls -la {} \;
"
Monthly Tasks 📅¶
Time: 2 hours
Frequency: First Sunday of month
Security Audit¶
# Check security logs
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
echo '=== Failed Login Attempts ==='
grep 'Failed password' /var/log/auth.log | tail -20
echo -e '\n=== UFW Status ==='
ufw status numbered
echo -e '\n=== SSL Certificate Status ==='
docker exec caddy caddy list-certificates
"
Performance Review¶
# Check resource trends
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
echo '=== Memory Usage Trend ==='
free -h
echo -e '\n=== Disk Usage Trend ==='
df -h
echo -e '\n=== Docker Resource Usage ==='
docker stats --no-stream
"
Database Maintenance¶
# Database optimization
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
docker exec postgres psql -U homelab -d homelab -c 'VACUUM ANALYZE;'
docker exec postgres psql -U homelab -d vikunja -c 'VACUUM ANALYZE;'
docker exec postgres psql -U homelab -d hedgedoc -c 'VACUUM ANALYZE;'
docker exec postgres psql -U homelab -d linkding -c 'VACUUM ANALYZE;'
docker exec postgres psql -U homelab -d navidrome -c 'VACUUM ANALYZE;'
"
Operational Procedures¶
Service Management¶
Starting Services¶
# Start single service
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
cd /opt/homelab/services/service_name
docker compose up -d
"
# Start all services
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
cd /opt/homelab
docker compose up -d
"
Stopping Services¶
# Stop single service
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
cd /opt/homelab/services/service_name
docker compose down
"
# Stop all services (emergency only)
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
cd /opt/homelab
docker compose down
"
Restarting Services¶
# Restart single service
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
cd /opt/homelab/services/service_name
docker compose restart
"
# Graceful restart of all services
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
cd /opt/homelab
docker compose restart
"
Backup Operations¶
Manual Backup¶
# Create full backup
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
cd /opt/homelab
./scripts/backup.sh
"
# Backup specific service
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
cd /opt/homelab
./scripts/backup-service.sh service_name
"
Restore Operations¶
# Restore from backup
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
cd /opt/homelab
./scripts/restore.sh backup_file.tar.gz
"
# Restore specific service
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
cd /opt/homelab
./scripts/restore-service.sh service_name backup_file.tar.gz
"
Monitoring Operations¶
Health Checks¶
# Check all services
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
./scripts/health-check.sh
"
# Check specific service
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
curl -f https://service.brennan.page || echo 'Service DOWN'
"
Performance Monitoring¶
# Real-time monitoring
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
docker stats
"
# Historical performance
ssh -i ~/.omg-lol-keys/id_ed25519 -T -o BatchMode=yes root@159.203.44.169 "
docker stats --no-stream --format 'table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}'
"
Incident Response¶
Incident Classification¶
Critical (P1)¶
- Service completely down
- Data corruption or loss
- Security breach
- System unavailable
High (P2)¶
- Service degradation
- Performance issues
- Partial functionality loss
- Backup failures
Medium (P3)¶
- Minor bugs
- UI issues
- Documentation errors
- Non-critical features
Low (P4)¶
- Cosmetic issues
- Typos
- Minor improvements
- Feature requests
Response Procedures¶
P1 - Critical Response¶
-
Immediate Action (5 minutes)
-
Stabilization (15 minutes)
-
Communication (30 minutes)
- Document incident
- Update status page
- Notify stakeholders
P2 - High Response¶
-
Assessment (30 minutes)
-
Resolution (2 hours)
P3 - Medium Response¶
- Investigation (4 hours)
- Review logs
- Test in staging
-
Plan fix
-
Implementation (1 day)
- Deploy fix
- Test thoroughly
- Update documentation
P4 - Low Response¶
- Planning (1 week)
- Add to backlog
- Prioritize
-
Schedule
-
Implementation (2 weeks)
- Implement during regular maintenance
- Test and deploy
Operational Metrics¶
Key Performance Indicators¶
- Uptime: Target > 99.5%
- Response Time: Target < 2 seconds
- Backup Success: Target 100%
- Security Incidents: Target 0
Monitoring Dashboards¶
- System Overview: https://monitor.brennan.page
- Service Status: https://brennan.page
- Documentation: https://wiki.brennan.page
Reporting¶
- Daily: Health check summary
- Weekly: Performance report
- Monthly: Executive summary
- Quarterly: Strategic review
Operational Tools¶
Automation Scripts¶
- backup.sh: Automated backup procedures
- health-check.sh: Service health monitoring
- deploy-service.sh: Service deployment
- restore.sh: Disaster recovery
Monitoring Tools¶
- Enhanced Monitor: System monitoring
- Docker: Container monitoring
- Caddy: Web server logs
- PostgreSQL: Database monitoring
Management Tools¶
- Portainer: Docker management
- SSH: Remote management
- Git: Configuration management
- Wiki: Documentation
Operational Security¶
Access Control¶
- SSH Keys: Key-based authentication only
- User Accounts: Minimal user accounts
- Sudo: Limited sudo access
- Audit Trail: All actions logged
Security Procedures¶
- Password Management: Regular password rotation
- Certificate Management: Automated SSL renewal
- Firewall Rules: Regular review and updates
- Security Updates: Prompt security patching
Backup Security¶
- Encryption: Backup encryption
- Offsite: Offsite backup storage
- Testing: Regular backup testing
- Retention: Backup retention policy
Operational Documentation¶
Required Documentation¶
- Runbooks: Step-by-step procedures
- Service Docs: Service-specific documentation
- Network Diagrams: Infrastructure documentation
- Contact Lists: Emergency contact information
Documentation Standards¶
- Version Control: All docs in Git
- Review Process: Regular doc reviews
- Accessibility: Easy to find and use
- Accuracy: Regular updates
Training and Knowledge¶
Operator Training¶
- System Overview: Understanding the architecture
- Service Management: Service operations
- Troubleshooting: Problem resolution
- Emergency Procedures: Incident response
Knowledge Sharing¶
- Wiki: Central knowledge base
- Runbooks: Operational procedures
- Best Practices: Lessons learned
- Incident Reviews: Post-incident analysis
References¶
- Services - Service documentation
- Infrastructure - Infrastructure documentation
- Troubleshooting - Troubleshooting guides
- Configuration - Configuration management