Site Reliability Engineer at Nathan Digital

Nathan Digital View all jobs

  • Nairobi
  • Permanent
  • Full-time
  • 8 days ago
  • We are looking for a Site Reliability Engineer with 3–5 years of experience to ensure the reliability, performance and scalability of our cloud infrastructure and services. This role is perfect for someone passionate about monitoring, automation, and building resilient, observable, and highly available systems.
What You’ll Do
  • Design, implement, and maintain CI/CD pipelines to deliver software reliably and efficiently.
  • Containerize applications using Docker and manage deployments on AWS (ECS, EC2, ALB).
  • Monitor system performance, create dashboards, configure alerts, and analyze logs to proactively identify and resolve issues.
  • Manage infrastructure for scalability, cost optimization, and high availability.
  • Lead incident response, conduct root cause analysis, and implement improvements to prevent future issues.
  • Automate operational workflows using Python and Bash to enhance efficiency and reliability.
  • Collaborate closely with developers to optimize deployment processes and application instrumentation.
  • Plan and execute disaster recovery strategies, including backups, failover mechanisms, and resilience testing.
What We’re Looking For
  • 3–5 years of experience in DevOps, Site Reliability, or cloud operations roles.
  • Strong AWS experience (ECS, EC2, ALB) and cloud infrastructure management.
  • Hands-on expertise with monitoring and observability tools (Prometheus, Grafana, Loki/ELK).
  • Experience building and maintaining CI/CD pipelines.
  • Proficiency with Docker and container orchestration.
  • Skilled in scripting and automation using Python and Bash.
  • Strong problem-solving skills and the ability to troubleshoot complex production issues.
Nice to Have
  • Experience with Infrastructure as Code (Terraform).
  • Exposure to Kubernetes (EKS) environments.
  • Familiarity with MongoDB Atlas operations.
  • Experience with cloud cost optimization and performance tuning.
What Success Looks Like
  • Systems are highly reliable, scalable, and easy to operate.
  • Clear visibility into system health and performance across all services.
  • Reduced incident frequency and faster recovery times.
  • Deployment and operational workflows are automated and efficient.
Method of ApplicationInterested and qualified candidates should apply using the Apply Now button below.

Myjobmag