Job Directory Scientific Computing Operations Expert

Scientific Computing Operations Expert
South San Francisco, CA

Companies like
are looking for tech talent like you.

On Hired, employers apply to you with up-front salaries.
Sign up to start matching for free.

About

Job Description

Scientific Computing Operations Expert

Job Overview

As part of the Global Operations team we are looking for a full-time infrastructure expert based in South San Francisco for our Scientific Computing Infrastructure environment. Your mission will be to provide multi-site operations support for scientific infrastructure environments supporting our Research partners in their needs for agility and performance. This includes installing, configuring, administering, monitoring and optimizing our High Performance Computing environment and related infrastructure components across the organization in a timely, cost-effective and efficient manner as well as managing the engagements with external services supporting this environment.

Key responsibilities:

* Responsible for the full operational support of the current scientific infrastructure and provide practical input into the evolution of the current environment.


* Ensure installation, configuration and operation of the environments to achieve the performance and agility of the diverse applications supported (in the range of several hundreds).


* Contribute to the concept, planning and execution of projects.


* Provide in-depth technical skills and experience to enhance the overall capability of the team.


* Point of escalation for managed services as well as more junior and less experienced members of the team and work with them to resolve complex incidents or associated underlying problems.


* Collaborate in solution design using an agile approach as required to meet specific business objectives.


* Create and implement automated solutions and manage the implementation of major change initiatives which will have institution-wide impact.


* Participate in the long-term strategy.


* Provides guidance to others on ways of increasing their contribution to the mission, objectives, and values of the organization.



Qualifications/Requirements:

* Bachelor's degree in Computer Science or equivalent work experience.


* ITSM or/and Agile/DevOps methodology


* Technical skills required:
* 10+ years of Scientific Computing Operations experience. Experience managing a Parallel File System (e.g. GPFS, Lustre).


* Experience with integration and utilization of centralized identity management (AD, LDAP, Centrify), overseeing the integration and utilization with all scientific infrastructure.


* Experienced in working with linux engineers to resolve performance issues through kernel tuning and optimizing kernel extensions for scientific infrastructure.


* Demonstrated ability to partner with network engineers to troubleshoot and resolve inter-device network performance issues.


* Strong Linux administration knowledge (e.g. RH/CentOS).


* Technical operational skills, such as troubleshooting, capacity planning and root cause analysis.


* Familiar with multiple computational devices (blades, SMP, GPU, etc.) and experienced in optimizing their configurations to support diverse software stacks.


* Support the health, patching and maintenance of environments utilizing Compute Cluster Management (e.g. Bright CM).


* Demonstrated experience in managing the operations of a workload management scheduler (e.g. SLURM, UGE, Torque, LSF).


* Experience in environments using tiered storage and data lifecycle management, including object storage (NetApp StorageGRID), and data transfer (AFM with GPFS).


* Familiarity with the integration and operation of cloud-based scientific computing resources in a hybrid infrastructure model.


* Responsible for hardware vendor management, including SLA's and delivery quality.


* Scripting experience: Bash, PowerShell, Perl, Ruby, Python.


* DevOps approach: IaC - configuration management (e.g. Ansible, Puppet), automated build/release test & deployment (e.g. Jenkins, Git) and testing frameworks (Pytest, testinfra etc.).


* Monitoring tools/frameworks (e.g.Grafana, Ganglia, ELK, Nagios, Zabbix).


* Virtualization and containerization knowledge (e.g. Docker, Mesos-Marathon, Kubernetes, Singularity).


* Desired understanding of Computer Systems Validation and ITIL concepts.


*



Other Requirements:

* Excellent customer orientation and delivery focus with good end user perspective. Cares and can drill down into a conversation with the developer or customer, to reach agreements and solve problems.


* Proactively supports peers in own and other functions. Cares about people and can mentor others.


* Collaboration. Provides relevant and timely communication to teams.


* Lead people and get people thinking together about solving problems.


* Demonstrated experience in providing oversight and quality outcomes through managed service engagements


* Ability to communicate with emotional intelligence.


* Experience in fast changing environments where solutions are deployed and retired at high pace. Clear goal orientation and supportive of change.


* Advanced English language skill is a must, and a second language (Spanish or German) would be considered an advantage


* Ability to work effectively alone or within a team, including virtual teams


* Proactivity, with a clear ability to think beyond boundaries, take controlled risks and assume responsibilities.


* Works to remove roadblocks and barriers to progress for the team.


* Experience in a global organization, working in an international and multicultural environment is considered a valuable asset.


* Active open source contributor.


* Working experience in pharmaceutical or scientific/research area is an advantage.


* Senior level technical operational skills, such as troubleshooting, capacity planning, and root cause analysis.


Let your dream job find you.

Sign up to start matching with top companies. It’s fast and free.