Credit Karma is taking the convoluted world of personal finances and making it easy to understand. We are uniquely positioned to help our 80 million (and counting!) users take control of their financial lives. The Site Reliability Engineering team is responsible for taking the complicated world of site operations and making it easy for teams to run reliable and performant services. You'll be expected to improve automation, tools, processes, and communications related to running systems. You will work closely with our network, cloud and platform engineering teams to deliver best in class service to the company and our members.
What the Job Entails:
* Manage services that keep our systems running including logging (Splunk), monitoring (Sensu) and configuration management (Salt, Terraform) * Developing automation and tools, to reduce toil and improve repeatability of processes. * Define reliability metrics(KPIs, SLOs), and work to ensure services meet them. * Develop runbooks and processes to reduce MTTR in incidents. * Collaborate with core infrastructure and service engineers to improve service reliability, scalability, and tooling. * Troubleshoot issues across the entire stack, software, hardware, cloud, and networking. * Participate in 24x7 on-call rotation.
What's Great About It?
* The changes you make will directly improve our customers experience by improving reliability. * You'll improve the life of everyone around you by helping to reduce the operational toil. * You'll get broad exposure to our stack of technologies such as docker, kubernetes, splunk, GCP, and AWS. * You'll learn a lot; we value continued learning and development. * And, of course, all those awesome company perks that you probably already read about.
Our Ideal Candidate:
* 2-4 years experience in systems administration or Dev Ops role * You have a solid understanding of at least two of the following: Linux/Windows Administration, networking, Docker/Kubernetes, or cloud infrastructure. * Experience with Splunk administration and a solid understanding of the internals. * You've built tooling to improve reliability of systems, automated remediation of issues, or improve scalability. * You have experience working in production environments at scale, and want to improve our availability and performance. * Systems often need to be reconfigured, so you should have experience with a configuration management system like Puppet, Chef or Salt. (We use Salt.) * You should be able to clearly communicate technical details when speaking or writing. * This position is part of a well established team, and you should be excited about working closely with them, and product development teams. * Working in the cloud is a little different, so it would be great if you have some experience with AWS or GCP. * Our environment often has new challenges and technologies, so we want a candidate who is excited to learn.
Let your dream job find you.
Sign up to start matching with top companies. It’s fast and free.