Senior Site Reliability Engineer

Realm

Screened

London

Posted 3 days ago

Apply Now

About the role

High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality. Culture centred on pragmatic problem-solving, cross-functional collaboration, and full lifecycle responsibility.

Apply below after reading through all the details and supporting information regarding this job opportunity.

Role Overview:

Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform.
Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.
Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.

Responsibilities

Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements
Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors
Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost
Troubleshooting across the full stack, including hardware, networking, and distributed systems
Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency

Participation in an on-call rotation required (approximately one week per month).

Key Attributes

Strong ownership mindset with focus on delivery and accountability
Experience building maintainable, well-documented systems in complex environments
Ability to operate effectively in ambiguous and rapidly evolving contexts
Clear and effective communication skills with collaborative, low-ego approach

Minimum Requirements

5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing
Strong written and verbal communication skills in English
Experience deploying and operating container orchestration or workload scheduling systems (e.g. Kubernetes or similar)
Programming or scripting experience in Go, Python, or xwzovoh Bash
Familiarity with infrastructure automation and infrastructure-as-code tools
Strong technical foundation in computing or related discipline

Preferred Experience

Experience operating large-scale machine learning or AI-compute workloads
Background in multi-tenant distributed systems at scale
Hands-on experience with data centre or bare-metal infrastructure
Knowledge of high-performance networking technologies
Experience managing large-scale storage systems (commercial or open-source)

Compensation & Benefits

Competitive salary and equity package
Retirement or pension contributions aligned with local standards
Health coverage including medical, dental, and vision
Generous paid time off policy

About this listing

Screened by Joboru

Stott & May Professional Search Limited

ScreenedNew

See more jobs