The Enterprise Computing (EC) Core Infrastructure Services organization is looking for a Site Reliability Engineering to manage the operations, reliability and services for Morgan Stanleys suite of Software Distribution product ecosystem products that are part of Artifact Curation and Distribution Control squad. This squad is responsible for providing lifecycle management and tooling for packaging, curation and distribution of runtime artifacts across the firm and building container based Software Deployment Pipeline to Hybrid Cloud (public and private). The Site Reliability Engineering (SRE) team drives the reliability, recoverability and operational efficiency of this product portfolio. SRE is expected to drive implementation of advanced observability, troubleshooting tools, automation and technical debt management working closely with the user community, development, engineering and the global support team that provide first line support. Candidate will have the technical skills required to support these products on a Kubernetes platform. Hands-on experience in automation and atleast one pillar of observability toolset is required with expertise in defining system monitoring, not just reacting to alerts. Cloud experience is not necessary, but it would be an advantage. Responsibilities include: Managing operations for the firms Artifactory based software distribution platform Maximizing the availability and performance of supported systems through optimized and automated plant management, ongoing problem management, and architecture reviews with engineering-side peers Reduction of the cost of support through the elimination of TOIL, operational issues, optimization and automation of tasks, development of operational tools and driving client self-service to minimize constraints Identification and prioritization of technical debt that is impacting client developer productivity, reliability or the efficiency of the ops team Complex troubleshooting in a Kubernetes and cloud environment Consult with clients (the Firms internal development community, IT service practitioners) to maximize their productivity, including troubleshooting the issues they have in using the Software distribution products Minimizing the escalation rate to the dev-side product delivery team members to ensure the department has the greatest possible flow of feature delivery Being operationally responsive, including sharing on-call rotation with the rest of the global team (with a time-off in lieu system) Required Qualifications / Skills Strong Linux or Kubernetes experience JFrog Artifactory experience Task automation experience in any programming language Experience of observability stack such as Prometheus, Grafana Effective communication and collaboration skills Exhibit working knowledge in at least ONE of the following areas SQL REST services (API) Load balancing and networking Performance troubleshooting and resolution Desired Skills Postgres experience Python development for task automation Experience with site reliability engineering practices, like service level objectives (SLOs), error budgets, blameless postmortems, toil reduction Prior experience creating operational dashboards (Splunk, Grafana, etc)

Central Business Solutions

www.cbsinfosys.com

IT Services and IT Consulting

Newark CA +9

Login to

Please Verify Your Phone or Email

Confirm Action

Search

Profile

Bookmarks

Linux SRE Engineer

Experience & Salary

Skills Required

Work Mode

Job Type

Job Description

RecommendedJobs for You

System Engineer - Presales

Python Developer

Senior Site Reliability Engineer

Manager - IT Infrastructure & Operations

Fullstack Developer(Nodejs,React js)

Senior System Engineer

Signalling Commissioning Engineer

Java Back-end Support Engineer

Systems Engineer

Senior Lead DFT Engineer