About
Job Description
Overview
The Site Reliability Engineer role is responsible for the monitoring design, uptime and problem resolution automation of TravelClick production applications located within TravelClick data centers and in the cloud. Key technologies include Docker; Jboss; Tomcat; Kubernetes; SQL and multiple web server technologies. The candidate must be able to provide expert and prompt technology operations support in a high energy, fast paced environment.
The successful candidate will be bright, motivated, detailed orientated and willing to go the extra mile to ensure exceptional results for our customers. This is a great opportunity in technology operations at a growing company with opportunities for advancement for the right candidate.
Responsibilities
* Manage technical projects from conception to completion.
* Become SME of multiple production applications and operations tools.
* Provide expert level guidance during outage situations and production escalations.
* Create tools and automation to increase productivity across technology teams and to minimize downtime.
* Collaborate with development teams to optimize system availability, scalability, reliability, and resiliency.
* Instrument production systems, collecting metrics to improve observability and capacity planning.
* Participate during application releases.
* Scripted automation of application recovery and technical procedures.
* Analyze and interpret application logs to determine problem areas.
* Design new monitoring solutions.
* Enhance current application and device monitoring systems.
* Evaluates application performance statistics including application and system response times.
* Must be willing work in a 24x7x365 environment. This will include a 24/7 rotating on-call schedule.
Basic Qualifications
Basic Qualifications
* High School Diploma/ GED required.
* Advanced knowledge of the Linux and Windows operating systems, including extensive knowledge of systems engineer related tasks.
* Experience with troubleshooting web server technologies.
* Experience with any middleware such as Tomcat, Jboss or other.
* Experience with monitoring, alerting, and pipeline analysis tools such as Appdynamics, Splunk and Nagios.
* Experience with log analysis tools such as Splunk, ELK or custom tool.
* Understanding of network technology concepts and usage.
* Advanced knowledge of Linux shell scripting.
* Can troubleshoot issues throughout the application and infrastructure stack.
Additional Characteristics
* BA or BS degree preferred.
* Fluency in Python, Ruby or other common scripting language.
* Experience in problem solving and troubleshooting network latency and connectivity issues.
* Experience developing operational automation in a distributed environment.
* Knowledge of performing database queries across multiple database platforms.
* Knowledge of automated and centralized job scheduling.
* Experience in a mixed on-premises and cloud environment.
* Experience with a CDN such as Akamai, Cloudflare or other.
* Experience VMware.
* Experience with Docker and Kubernetes or other containerized solution.
#LI-KG1
#DICE
EEO Statement"All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or protected veteran status." Note to Applicants
IMPORTANT: We contact all applicants via email throughout the hiring process. It is recommended that you add iCIMS (@agents.icims.com) to your Approved/Safe Sender list to ensure that our emails are properly delivered to your inbox and not marked as spam. Please click here for instructions on whitelisting iCIMS.