About
Job Description
Title: Sr. Site Reliability Engineer
Location: Boston, MA
Duration: 6 months+
Industry: Educational (Big 3)
Essential Accountabilities:
* Hands-on design, analysis and troubleshooting of highly-distributed large-scale production systems;
* Ownership of reliability, uptime, capacity, and performance analysis thereofEnsuring the repeatability, traceability, and transparency of our infrastructure automation
* Identifying highest-impact opportunities to optimize existing systems
* System design consulting for teams seeking to leverage or improve their production infrastructure
* Anticipate, build and plan capacity for upcoming product/feature launches
Required Skills:
* Mastery of AWS services (IAM, EC2, S3, EBS/EFS, ELB/ALB, AutoScaling, RDS and replication techniques, VPC, Subnets, Elastic IP, Route53, CloudWatch, CloudFront, Lambda, CloudFormation, ECS, SNS, ElastiCache)
* Expertise in container/container-fleet-orchestration technologies (like Docker,Kubernetes, AWS ECS)
* Expertise in designing and manage escalation response plans from monitoring, react,respond, remediate and retrospect in culturally aligned (proactive, customer focused,collaborative, data-driven and AUTOMATED) ways
* Mastery of infrastructure build and configuration automation technologies (like Terraform, Ansible, Puppet, CodeDeploy, Chef)
* Strong skills in reading, understanding and writing code in at least two of: Javascript, Python, PHP, Go, or Ruby
* Strong network engineering skills
* Cloud and container native Linux administration/build/management skills (AWS AMIs,Packer, etc.)
* Significant experience troubleshooting concurrent and distributed system interactions
* Expertise with continuous-deployment software development lifecycles in the Cloud (e.g.CI/CD);
* Cloud database operations and deployment experience (RDS MySQL/Postgres/Aurora), caching operations & deployments (Memcache, Redis)
* Expertise with Lean/Agile deployment processes (ZDT: Blue/Green, Canary, DNS strategies)
* Familiarity with site and infrastructure monitoring systems (CloudWatch, Datadog, New Relic, Sumologic, Thousand Eyes)
* Strong problem solving, root cause analysis and systems engineering skills; Good presentation and communication skills
* Expertise with SDLC branching, SCM, and code deployment systems (Git/Gitflow,Jenkins, CircleCI, etc.)
* BS Degree in Computer Science (or related technical field and/or equivalent industry experience).