
Staff Site Reliability Engineer at Wikimedia Foundation
- Kenya
- Permanent
- Full-time
- Designing and implementing robust ML infrastructure used for training, deployment, monitoring, and scaling of machine learning models.
- Improving reliability, availability, and scalability of ML infrastructure, ensuring smooth and efficient workflows for internal ML engineers and researchers.
- Collaborating closely with ML engineers, product teams, researchers, SREs, and the Wikimedia volunteer community to identify infrastructure requirements, resolve operational issues, and streamline the ML lifecycle.
- Proactively monitoring and optimizing system performance, capacity, and security to maintain high service quality.
- Providing expert guidance and documentation to teams across Wikimedia to effectively utilize the ML infrastructure and best practices.
- Mentoring team members and sharing knowledge on infrastructure management, operational excellence, and reliability engineering.
- 7+ years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles, with substantial exposure to production-grade machine learning systems.
- Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker, GPU acceleration, distributed training systems).
- Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible, Helm, Argo CD).
- Experience implementing observability, monitoring, and logging for ML systems (e.g., Prometheus, Grafana, ELK stack).
- Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow, scikit-learn).
- Strong English communication skills and comfort working asynchronously across global teams.
Jobs in Kenya