Our Engineering teams at Kelley Blue Book use the latest technologies to build solutions that transform the way people research, buy and sell cars. We are constantly looking for ways to improve what we offer and are energized about building a platform that is revolutionizing the industry. We are currently seeking ambitious engineers who are comfortable working with many different teams in a fast-paced environment and have the passion and skills to take our product offering to the next level.
Our Production Engineering team is obsessed with speed and availability, passionate about automation, and devoted to building meaningful relationships with our software engineers to help evolve and scale our platforms. As a Site Reliability Engineer, you will work as a member of an engineering team focused on building and running large-scale, distributed, fault-tolerant systems in the cloud. You will collaborate with some very smart people, focusing on performance and reliability, while increasing the development velocity of our software engineering teams. The ideal candidate for this position would be someone with deep understanding of systems architecture, tempered with knowledge of how applications interact with systems at scale. They will be comfortable working in AWS with the investigation into how the software performs, network traffic flows and service daemons interact with one another.
If you love to figure out how all the pieces are put together in a complex environment, have a passion for innovating in the newest technologies, and enjoy building solutions to manage applications, we want to talk to you.
What You Get To Do:
* Work with product managers and software engineers to increase the scalability, reliability, and performance of our systems
* Think at scale... big scale... with a focus on ensuring stability and maximizing the performance of services you own
* Take advantage of cloud computing capabilities using Amazon Web Services (AWS)
* Participate in service capacity planning and demand forecasting, software performance analysis and system tuning
* Write automation code for provisioning and operating infrastructure at large scale
* Partner with software engineering teams to make sure their applications fit nicely within the infrastructure and scalability/reliability is designed and implemented from the grounds up
* Roll up your sleeves to troubleshoot problems across the entire stack: hardware, network, datastores, and application - and build automation to prevent problem reoccurrence
* Identifying underlying root causes and provide recommendations or solutions for long term permanent fixes to critical production issues
* Take ownership and strive to do work you're proud of. You believe in spreading (and acquiring) knowledge through mentorship and collaboration
* Develop effective documentation, tooling, and alerts to both identify and address reliability risks.
* Participate in on-call rotation with other members of the Site Reliability Engineering team.
* Advanced proficiency with at least one scripting or programming language (we care that you know how to write code, not what language it is)
* Solid Windows and Linux experience
* Using distributed version control system experience (Git preferred) to check-in code, branching, merging, pull request, code review, etc.
* Hands-on experience building infrastructure and supporting applications in AWS using services such as Elastic Beanstalk, Lambda, EC2, ECS, S3, SNS, Aurora, RDS, DynamoDB, Route53
* Familiarity with configuration management and infrastructure as code (IaaC) tools such as Ansible, Terraform or Cloudformation
* Knowledge of CI/CD best practices and tools such as AWS CodeBuild, Jenkins and Team City
* Experience designing and delivering secure, high performance and highly-available cloud services
* Experience working with partners to define and track SLIs, SLOs and SLAs using metrics and monitoring to ensure the objectives are met or exceeded
* Strong understanding of networking and DNS
* Experience working with container technologies such as Docker
We'd Love to See:
* Someone who sees technology as a hobby and not work, you enjoy taking things apart and then putting them back together to improve them
* Experience working in an agile iterative developer-empowered environment where software delivery teams deploy and monitor their applications throughout the application lifecycle
* Experience with monitoring, analysis, and alerting tools like New Relic, Splunk and Dynatrace
* Understand and practice cost containment and Game Day activities
* Caching layer technologies (Elasticache / Memcached, Redis) and CDN services such as Akamai or CloudFront
* Database operations at scale (MySQL, MongoDB, Dynamo, Postgres)
* You thrive on technical challenges and take pride in solving them
* Deliver insightful recommendations in a concise, and persuasive manner
* Strong interpersonal and communication skills with a focus on customer service
* Bachelor's or Master's Degree in Computer Science or related field, or equivalent experience.