➡️ Apply here: Site Reliability Engineer
👩💼 Want to stand out? Improve your resume to appeal to recruiters, hiring managers, and Applicant Tracking Systems. ➡️ Improve your resume
**Department:** Tech Operations
**Location:** Tbilisi, Georgia
**Description:**
Intermedia is seeking a Site Reliability Engineer (SRE) to enhance service reliability and operational readiness, with a strong emphasis on metrics, alerting, and event management. The role involves building and maintaining monitoring platforms using Prometheus/VictoriaMetrics, integrating alerts and events with BigPanda, and participating in on-call rotations to ensure rapid incident response and continuous improvement across Windows and Linux environments.
**Key Responsibilities:**
* Build and operate metrics/monitoring platforms (Prometheus and/or VictoriaMetrics).
* Design and maintain an effective alerting strategy, including thresholds, anomaly detection, alert routing, deduplication, and noise reduction.
* Integrate monitoring, alerting, and events with BigPanda for correlation, enrichment, and incident workflow management.
* Create and maintain dashboards and operational visibility tools (e.g., Grafana).
* Develop and maintain runbooks, operational playbooks, and incident response procedures.
* Participate in on-call shifts, including triaging alerts, managing incidents, coordinating responses, and leading communication during outages.
* Conduct root-cause analysis, postmortems, and implement corrective/preventive actions.
* Improve service reliability through SLOs/SLIs, capacity planning, and automation to reduce toil.
* Support monitoring for core infrastructure and services on Windows and Linux.
* Collaborate with DevOps/Engineering teams to instrument applications and standardize telemetry (metrics, logs, traces).
**Skills, Knowledge, and Expertise:**
* Bachelor’s degree in Computer Science or a related field.
* Experience in SRE, Operations, or DevOps with production incident ownership.
* Hands-on experience with Prometheus and/or VictoriaMetrics.
* Experience integrating alerting/event pipelines with BigPanda or similar event correlation tools.
* Strong troubleshooting skills across Linux and Windows systems.
* Ability to build reliable alerting with minimal noise.
* Experience with Git-based workflows for monitoring-as-code and configuration management.
**Nice to have:**
* Grafana administration and dashboard design experience.
* Log management (ELK/EFK, Loki) and/or tracing (OpenTelemetry) experience.
* Automation skills (Python, PowerShell, Bash) and configuration management tools (Ansible).
* Experience with messaging/cache/proxy operations (RabbitMQ, Redis, Nginx).
* Experience with Windows clustering or HA environments.
* Experience defining SLOs/SLIs and operational KPIs.
* Experience managing VOIP components and protocols (SIP, FreeSwitch, OpenSIP, session border controllers).
* Experience with load balancing components (F5 LTM, F5 GTM).
* Experience with Virtualization platforms (VMWare, HyperV).
* Experience administering AWS or Azure tenants.
**On-call Expectations:**
* Participation in a rotating on-call schedule.
* Ownership of incident response, including triage, escalation, mitigation, and follow-up improvements.
* Commitment to improving monitoring quality to reduce alert fatigue and improve MTTR.
**Seniority Level:** Associate
**Employment Type:** Full-time
**Job Function:** Engineering and Information Technology
**Industries:** IT Services and IT Consulting
