Job Directory Uber Site Reliability Engineer

Site Reliability Engineer Uber
New York, NY

Uber is a provider of a mobile application connecting passengers with drivers for hire.

Companies like Uber
are looking for tech talent like you.

On Hired, employers apply to you with up-front salaries.
Sign up to start matching for free.

About Uber

Job Description

Uber Overview

At Uber, we ignite opportunity by setting the world in motion. We take on big problems to help drivers, riders, delivery partners, and eaters get moving in more than 600 cities around the world.

We welcome people from all backgrounds who seek the opportunity to help build a future where everyone and everything can move independently. If you have the curiosity, passion, and collaborative spirit, work with us, and let's move the world forward, together.

Job Description

About the Role

The Observability team builds the tools and systems that every engineering team at Uber uses to develop, scale, understand, and monitor their systems. These systems are absolutely critical to Uber - without them it would be impossible understand and debug problems in an environment with over three thousand microservices, hundreds of thousands of CPU cores within multiple data centers and cloud zones serving hundreds of thousands of concurrent trips around the world.

The Observability suite includes:

* M3, our open source distributed metrics stack, handles hundreds of millions of raw metrics per second, and is used to monitor and alert for every product and microservice at Uber.
* Jaeger, our open source enterprise tracing system, provides actionable insight into individual flows through our microservice architecture, and comprehension of the entirety of Uber's software ecosystem.
* Synoptic, our Uber-aware automatic dashboarding system which displays context-sensitive information from across the Uber ecosystem, enabling quick detection and mitigation of issues.
* Our deeply integrated On-Call Experience suite of tools, which gives on-call engineers everything they need to raise, track, and close outages incidents, to track the SNR of alerts, and to drive improvements in their team health by reduce alert load.
* Blackbox, our system for externally monitoring our critical business endpoints, via emulated workflows.
* A new system under development to provide enterprise logging, with deep integration into our Observability stack, including alerting, linkage to traces, etc.

What You'll Do / What You'll Need / Bonus Points / About the Team

What You'll Do

* Partner with fellow engineers to architect and build mission critical systems that can stand the test of scale and availability, while limiting operational overhead through automation and tooling
* Drive efficiencies in systems and processes: capacity planning, configuration management, performance tuning, monitoring and root cause analysis

What You'll Need

* Good programming skills in one of Go, Java, Python, or C++ and an ability to pick up new languages
* Drive and a strong feeling of ownership
* Expertise with Linux and a good understanding of its fundamentals and internals: filesystems and modern memory management, threads and processes, the user/kernel-space divide, etc
* Experience building cross-datacenter, highly-available distributed systems. We need engineers who think about fault-tolerance, durability, and scalability
* Experience shaving off cycles and bits for optimal performance: profiling and optimizing your code and runtime environment for the hardware that it runs on
* Working knowledge of the TCP/IP stack, internet routing and load balancing
* BS or MS in Computer Science, a related technical discipline, or equivalent experience