➡️ Apply here: Senior Site Reliability Engineer

👩‍💼 Want to stand out? Improve your resume to appeal to recruiters, hiring managers, and Applicant Tracking Systems. ➡️ Improve your resume

We are seeking a Senior Site Reliability Engineer to join our team. You will bring technical expertise and a proactive approach to independently manage and optimize our production environments. In this role, you will focus on maintaining system reliability, scaling infrastructure, and supporting continuous integration and deployment processes. If you have a strong background in Kubernetes and monitoring tools, we encourage you to apply and contribute to our clients’ success.

Responsibilities:
Manage and maintain Kubernetes clusters including deployment, scaling, and troubleshooting
Develop and optimize Jenkins CI/CD pipelines
Implement and utilize Instana for observability and monitoring
Handle ELK stack for log management, alerting, and dashboard creation
Provide production support including incident management and root cause analysis
Perform performance tuning to enhance system reliability and availability
Ensure adherence to site reliability engineering best practices
Work independently as the sole SRE/DevOps specialist in the team
Collaborate with development and operations teams to improve system performance
Automate deployment and monitoring processes where possible
Monitor system health and respond to alerts promptly
Document processes and share knowledge with team members
Continuously evaluate tools and technologies to improve operational efficiency
Participate in on-call rotation to support production systems

Requirements:
Strong hands-on experience with Kubernetes including deployment, troubleshooting, scaling, and monitoring with 3+ years of experience
Proficiency in Jenkins for CI/CD pipeline development and optimization
Experience with Instana for observability, tracing, and monitoring
Background in using ELK stack for log management, alerting, and dashboarding
Solid application production support skills including incident management and root cause analysis
Strong understanding of site reliability engineering principles including reliability, availability, monitoring, and observability
Ability to work independently as the only site reliability engineer or DevOps specialist on a team
Experience with performance tuning in production environments
Strong written and verbal English communication skills (B2+)

Nice to have:
Experience with Amazon Web Services infrastructure setup and service integration
Knowledge of Terraform for infrastructure as code
Skills in Helm charts, templating, and deployment automation
Proficiency in scripting languages such as Python, Bash, or Groovy
Familiarity with Apache Kafka operations, monitoring, and troubleshooting

EPAM Systems is hiring Senior Site Reliability Engineer

Previous and next articles

Previous and next articles

Similar jobs