Site Reliability Engineer

5 - 10 years

10.0 - 15.0 Lacs P.A.

Shimla

Posted:2 months ago| Platform: Naukri logo

Apply Now

Skills Required

LinuxNetworkingMySQLDatabase administrationWindowsOracleTroubleshootingMiddlewareRHCESystem administration

Work Mode

Work from Office

Job Type

Full Time

Job Description

The Site Reliability Engineer (SRE) ensures the availability, reliability, performance, and security of applications and infrastructure at the State Data Center (SDC). This role involves proactive monitoring, incident response, system optimization, and process improvements to maintain high service levels and compliance with security standards. The SRE will work closely with IT teams to enhance system resilience and efficiency. Roles and Responsibilities Implement infrastructure monitoring (CPU, Memory, Disk, Network) using Zabbix, Prometheus, Grafana, or ELK Stack. Monitor database performance (PostgreSQL, MySQL, Oracle DB) and recommend optimizations. Establish log aggregation and alerting mechanisms to detect anomalies. Generate uptime and SLA compliance reports for management review. Diagnose system and network issues, escalate as required, and track resolution. Maintain a ticketing system for issue documentation and trend analysis. Conduct root cause analysis (RCA) and implement preventive measures. Perform post-incident reviews (PIRs) to improve system resilience. Ensure high availability and failover readiness for critical services. Optimize database indexing, query performance, and backup strategies. Perform capacity planning to ensure systems can handle peak loads. Implement automated scaling and load balancing for performance optimization. Enforce access control policies, including firewalls, SSH restrictions, and IAM. Ensure timely patching and hardening of OS, middleware, and databases. Monitor for security vulnerabilities and implement necessary mitigations. Ensure compliance with government security policies (CERT-In, ISO 27001). Ensure real-time replication of databases to the disaster recovery (DR) site. Conduct regular failover testing to validate DR readiness. Maintain documentation and runbooks for disaster recovery scenarios. Maintain incident reports, troubleshooting guides, and standard operating procedures (SOPs). Track service-level agreements (SLAs) and prepare compliance reports. Develop training sessions for internal teams on monitoring tools and processes. Desired Skills/Background 5+ years of experience in SRE, IT Operations, or System Administration. Strong Linux (Ubuntu, RHEL, CentOS) and Windows Server knowledge. Experience with monitoring tools (Zabbix, Prometheus, Grafana, ELK, Splunk). Knowledge of networking, VPNs, firewalls, and load balancers. Familiarity with cloud services and on-premises infrastructure. Experience in database administration (PostgreSQL, MySQL, Oracle). Strong troubleshooting and incident management skills. AWS Certified SysOps Administrator, RHCE, ITIL, or Zabbix Certified Specialist. Experience working with State Data Centers (SDCs) and government IT projects.

Supply Chain Management
Tech City

RecommendedJobs for You

Chennai, Pune, Delhi, Mumbai, Bengaluru, Hyderabad, Kolkata

Pune, Bengaluru, Mumbai (All Areas)

Chennai, Pune, Delhi, Mumbai, Bengaluru, Hyderabad, Kolkata

Bengaluru, Hyderabad, Mumbai (All Areas)

Hyderabad, Gurgaon, Mumbai (All Areas)