The Ecole polytechnique fédérale de Lausanne (EPFL) is one of the most dynamic university campuses in Europe and ranks among the top 20 universities worldwide. The EPFL employs 6,000 people supporting the three main missions of the institutions: education, research and innovation. The EPFL campus offers an exceptional working environment at the heart of a community of 16,000 people, including over 10,000 students and 3,500 researchers from 120 different countries.
Site Reliability Engineer (W/M)
Your mission :
The EPFL Blue Brain Project (BBP), situated on the Campus Biotech in Geneva, Switzerland, applies advanced neuroinformatics, data analytics, high-performance computing and simulation-based approaches to the challenge of understanding the structure and function of the mammalian brain in health and disease. The BBP provides the community with regular releases of data, models and tools to accelerate neuroscience discovery and clinical translation through open science and global collaboration.Main duties and responsibilities include :
BBP’s Core Services section is now looking for an experienced Site Reliability Engineer (W/M) to work on BBP’s High Performance Computing (HPC) and other mission-critical IT systems. This opportunity presents you with a chance to:
Ensure IT service reliability for our critically important IT services e.g. by implementing SRE best practices for availability, performance, utilisation, change management, emergency response and capacity planning
Develop monitoring, logging and metrics tools to manage risk
Automate IT processes - in order to get rid of toil, technical debt and manual work - using modern software engineering practices
Ensure reliable product launches and system upgrades upon our IT platforms like HPC, virtualization, containerization, and storage using modern software development, configuration management, CI/CD and infrastructure-as-code approaches
Contribute to IT security e.g. by establishing modern and clever system update/upgrade methodologies
Your profile :
What you'll need to succeed:
Deep understanding of IT operations using software engineering practices
Practical & recent hands-on experience with full-lifecycle config management, provisioning and CI tools (e.g. Puppet, Git, Jenkins, Foreman)
Practical & deep experience of using Linux (e.g. RedHat/CentOS, Ubuntu) in server environments
Extensive knowledge of monitoring infrastructure (e.g. Icinga, Prometheus, ELK)
Experience in programming and scripting (e.g. Python, Ruby, shell)
Understanding of networking fundamentals (e.g. HTTPS, DNS, TCP/IP & load balancing) with ability to implement changes and diagnose issues with routing, network protocols, subnets and DNS
Knowledge of industry best practices to run secure infrastructure
We count as an advantage any experience with:
Administering HPC clusters, cloud and container platforms (e.g. OpenStack, OpenShift/Kubernetes)
Administering storage systems (e.g. NetApp, GPFS, CEPH)
Streamlining processes and interest towards process development
You:
Dislike cutting corners and sweeping technical debt under the rug
Have a Bachelor or Master degree in computer science - or similar working experience
Are experienced in working in a collaborative and multi-cultural environment
Have excellent interpersonal and communication skills, written and oral
Are a self-starter, fast learner and eager to expand your domains of expertise
Have a proven ability to work both independently and in team-based environments
Are fluent in communication in English (written and spoken)
We offer :
An internationally recognized research project using state-of-the-art HPC infrastructure
A dynamic, inter-disciplinary and international working environment
An opportunity to get your hands dirty with new technologies as they emerge
Great colleagues and excellent coffee
Start date :
As soon as possible Term of employment :
Unlimited (CDI)Duration :
Negotiable : CDI or 1 year CDD (renewable)
Remark :
Only candidates who applied through EPFL website or our partner Jobup’s website will be considered.apply online