Keeping Twitter up and running 100% of the time is a challenging job. Accurately monitoring the health of every application that comprises the Twitter ecosystem 100% of the time is an order of magnitude more challenging. As a Site Reliability Engineer on the Observability Team at Twitter you will be working to improve the reliability and performance of the software systems that provide visibility into the services that run Twitter. Our monitoring, alerting, and visualization platform analyzes billions of metrics per minute and comprises the central nervous system of Twitter's architecture.
What You'll Do:
You will work shoulder-to-shoulder with our engineering teams to design and build the next generation of cloud and systems monitoring infrastructure, focusing on automation, availability, performance, and above all efficiency at 'reach every user on the planet' scale.
You will dive deep into gnarly operational issues; from the software, systems, automation, and process perspectives. You will understand the challenges around integrating disparate infrastructures into new facilities, processes and procedures.
Your responsibilities include but are not limited to:
* You will perform deep dives into both systemic and latent reliability issues; partner with software and systems engineers across the organization to produce and roll out fixes. * Solve issues across the entire stack: hardware, software, application and network. * You will drive standardization efforts across multiple disciplines and services in conjunction with embedded SREs throughout the organization. * You will mentor SREs on standard methodology for everything from monitoring to troubleshooting complex code issues. * Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services. * You will participate in code reviews for projects primarily written in Python, Java, and Scala, built on open source libraries such as Finagle, and running on both physical and containerized platforms. * You will represent the SRE organization in design reviews and operational readiness exercises for new and existing services.
Who You Are:
* Solid understanding of systems and application design, including the operational trade-offs of various designs. * Practical, solid knowledge of shell scripting and at least one higher-level language (Python preferred). * Practical knowledge of various aspects of service design like messaging protocols & behavior, caching strategies and software design practices. * Demonstrable knowledge of TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures. * Excellent understanding of Linux, specifically RHEL/CentOS. * Minimum 2+ years of handling services in a large scale environment. * Work well with and be able to influence a myriad of personalities at all levels. * Ability to prioritize tasks and work independently. * Be adaptable and able to focus on the simplest, most efficient & reliable solutions. * Track record of successful practical problem solving, excellent written and social communication, and documentation skills. * B.S. in computer science or similar field or equivalent experience. * Practical experience in Python, Java, and/or Scala. * Ability to lead technical teams through design and implementation across an organization.
We are committed to an inclusive and diverse Twitter. Twitter is an equal opportunity employer. We do not discriminate based on race, ethnicity, color, ancestry, national origin, religion, sex, sexual orientation, gender identity, age, disability, veteran status, genetic information, marital status or any other legally protected status.
Twitter is a company that provides a social networking platform.