About the role
High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality. Culture centred on pragmatic problem-solving, cross-functional collaboration, and full lifecycle responsibility.
Apply below after reading through all the details and supporting information regarding this job opportunity.
Role Overview:
- Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform.
- Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.
- Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.
Responsibilities
- Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements
- Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors
- Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost
- Troubleshooting across the full stack, including hardware, networking, and distributed systems
- Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency
Participation in an on-call rotation required (approximately one week per month).
Key Attributes
- Strong ownership mindset with focus on delivery and accountability
- Experience building maintainable, well-documented systems in complex environments
- Ability to operate effectively in ambiguous and rapidly evolving contexts
- Clear and effective communication skills with collaborative, low-ego approach
Minimum Requirements
- 5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing
- Strong written and verbal communication skills in English
- Experience deploying and operating container orchestration or workload scheduling systems (e.g. Kubernetes or similar)
- Programming or scripting experience in Go, Python, or xwzovoh Bash
- Familiarity with infrastructure automation and infrastructure-as-code tools
- Strong technical foundation in computing or related discipline
Preferred Experience
- Experience operating large-scale machine learning or AI-compute workloads
- Background in multi-tenant distributed systems at scale
- Hands-on experience with data centre or bare-metal infrastructure
- Knowledge of high-performance networking technologies
- Experience managing large-scale storage systems (commercial or open-source)
Compensation & Benefits
- Competitive salary and equity package
- Retirement or pension contributions aligned with local standards
- Health coverage including medical, dental, and vision
- Generous paid time off policy
About this listing
This role passed our automated spam and quality filters and was active in our feed when last checked. Joboru is an aggregator — here is how we screen listings. If anything looks off, tell us.
Similar jobs you may like
Security Engineer (CCTV, Access Contol, Intruder Alarms)
TodayIC2 CCTV and Security Specialists (UK) Ltd
Senior Technology Learning & Adoption Training Specialist
TodayDgh Recruitment
AI Enablement Lead L&D Training
TodayClient Server
Investigations Officer
TodayJOB SWITCH LTD
Databricks SC Cleared Data Engineer
TodayIO Associates
Salesforce Developer - OmniScript
TodayDamia Group Ltd
Java Engineer
TodayProfile 29
Technical & Principal Designer - Building Regulations
TodayRainford Berry
Senior AI Solution Architect
TodayStott & May Professional Search Limited