SRE / HPC Engineer – remote jobs london

  • Full Time
  • United Kingdom
  • 41,000 - 78,000 GBP / Year

Website FluidStack

Job description

Fluidstack is an AI cloud. We work with many of the top AI companies on the planet, including Poolside, Meta, Modal, Reka, and many more.

About the role:

Our HPC Engineers make sure our GPU infrastructure is working at peak performance and offer top tier support to our customers.

At its core, you will have three main responsibilities:

• Deployment. We will be onboarding new clusters at least monthly – you will help take bare-metal servers and deploy them for our customers as high performance compute as a service.
• Automation. Our GPU fleet is large and growing. You will help us to automate many of our processes and systems to allow us to support Fluidstack continuing to scale.
• Support. This will be a client facing role – you will work closely with our customers to make sure that they are able to utilize our infrastructure to achieve their goals. You will work on everything from GPU debugging, Slurm management, to training performance optimization.

Skills & Experience

• Experience with HPC systems, System Administration, SRE, or DevOps
• Experience with large scale workloads utilizing orchestrators like Slurm or Kubernetes.
• Experience with automation of bare-metal machines and containers, using tools such as Ansible, Bash, or Python.
• Experience with shared storage on platforms such as NFS, DDN ,Vast, CephFS, etc.
• Experience provisioning large scale clusters and networks with e.g. BCM, UFM
• Experience with large-scale GPU systems, working with Nvidia GPUs and Infiniband networks.
• Fast learner, adaptable, and passionate about Fluidstack’s mission!

If any of the above bullets resonate with you, please reach out!

To apply for this job please visit uk.indeed.com.