Senior Operations Engineer (Acting Lead) | Production Support | SRE | DevOps Operations

Naga Durga Prasad Talla

Senior Operations Engineer | Site Reliability | Production Support | DevOps Operations

“Ensuring 24×7 production stability, observability, and operational excellence.”

Focus Area:
View Experience

About Me

Senior Operations and Production Support professional with 8+ years of hands-on experience ensuring high availability, incident resilience, and service continuity across telecom and gaming environments. Proven track record in managing critical production incidents, driving observability maturity, and enabling faster root cause resolution through proactive monitoring frameworks. Currently leading a 12-member operations team, collaborating with cross-functional engineering units to stabilize complex production ecosystems, strengthen CI/CD operational practices, and deliver consistent service reliability under business-critical workloads.

Core Skills

IT Service Management

Incident Management

Problem Management

Change Management

Production Support

Release Management

Monitoring & Observability

DevOps Operations

Team Leadership

Root Cause Analysis

Infrastructure Operations

Cross-team Coordination

Tech Stack

Monitoring & Observability

PrometheusGrafanaLokiDynatrace InstanaNagiosOpsGenieSplunk

DevOps

JenkinsGitDockerKubernetesHelm

Cloud

AWSGCP

Tools

ServiceNowJIRARancherPagerDuty PostmanSalesforce

Databases

SQLOracle

Professional Experience

Senior Support Engineer (Acting Lead)

May 2023 – Present

Qvantel – Hyderabad

  • Leading a 12-member production operations team
  • Managing telecom BSS applications
  • Incident and outage management
  • Kubernetes cluster operations
  • Monitoring with Dynatrace, Instana, OpsGenie, and Thruk
  • RCA and service stabilization
  • CI/CD operational support using Jenkins and Git

System Engineer

May 2019 – Feb 2022

ValueLabs – Hyderabad

  • Delivered 24×7 production support services
  • Resolved Sev1–Sev4 incidents across critical services
  • Handled API monitoring and job failure recovery
  • Coordinated infrastructure upgrade windows
  • Built dashboards and improved monitoring visibility

Game Tester & Customer Care Representative

Oct 2018 – Apr 2019

Glu Mobile – Hyderabad

  • Performed game testing and bug reporting
  • Validated gameplay quality and release-readiness
  • Provided player support and feedback analysis

Customer Service Associate

2017 – 2018

Amazon – Hyderabad

  • Supported customer service and helpdesk processes
  • Handled incidents and service escalations
  • Tracked service requests for process closure

Production Operations & Site Reliability

Responsible for maintaining highly critical telecom production systems, ensuring 24×7 uptime and service reliability under business-critical operating conditions.

  • Kubernetes cluster lifecycle management and workload stability
  • Observability implementation with Prometheus and Grafana
  • Centralized log monitoring pipelines with Loki
  • Incident alerting and on-call action orchestration through OpsGenie
  • CI/CD operations support using Jenkins pipelines
  • Deployment automation with Helm charts and YAML configuration
  • Pod troubleshooting and remediation using kubectl workflows
  • RCA-driven reliability improvements and service hardening

Leadership & Achievements

Leading a 12-member operations team
Supporting business-critical telecom applications
Handling critical production incidents effectively
Stabilizing services during transition to steady state
Designing monitoring dashboards and alert frameworks
Conducting technical knowledge transfer sessions

GitHub

wild-apache

Automation, DevOps experiments and infrastructure tools.

View GitHub