About the role
Role summary
This role sits at the centre of how we measure and improve AI systems in production.
You’ll define what good performance means across LLMs, ASR, TTS, and full speech-to-speech pipelines, and build the datasets, metrics, and evaluation systems that make AI quality measurable and comparable in the real world.
You’ll work closely with engineering and product teams to ensure model changes lead to real improvements in user experience, not just better offline benchmarks.
Check below to see if you have what is needed for this opportunity, and if so, make an application asap.
What you’ll do
- Design and run evaluations across LLM, ASR, TTS, and speech-to-speech systems
- Build real-world datasets and test cases from production behaviour and edge cases
- Define metrics and scorecards for model and system quality
- Benchmark internal models against external and frontier systems
- Evaluate full pipelines (ASR → LLM → TTS), not just individual models
- Build Python tools to automate evaluation workflows
- Create internal leaderboards, red-teaming setups, and regression tests
- Work with engineers and product teams to diagnose system failures
- Turn vague product goals into measurable evaluation frameworks
What this role is about
- Defining and measuring AI quality in production systems
- Turning real user behaviour into structured evaluation signals
- Ensuring model changes improve real-world performance
- Understanding why AI systems fail, not just whether they do
What good looks like
- You can translate improved quality into measurable metrics
- You think in terms of system impact (before vs after), not just accuracy
- You’re comfortable working across code, data, and production systems
- You care about real-world behaviour, not just benchmarks
Core skills
- Strong Python (scripting, data analysis, tooling)
- Experience with ML systems, evaluation, or experimentation
- Understanding of LLMs or speech systems (ASR / TTS)
- Ability to design test cases and structured datasets
- Comfortable working with engineers and product teams
Nice to have
- Experience with LLM evaluation or benchmarking
- Exposure to speech or multimodal systems
- Familiarity with production APIs or ML systems
- Experience with automated testing or CI-style workflows
About this listing
This role passed our automated spam and quality filters and was active in our feed when last checked. Joboru is an aggregator — here is how we screen listings. If anything looks off, tell us.
Similar jobs you may like
Junior ICT Technician
1 day agoIT Talent Solutions Ltd
Infrastructure Engineer
1 day agoYolk Recruitment Ltd
SAP Role
1 day agoOwen Daniels
Technical Engineer - Security Systems
1 day agoJohnson Controls
Senior RF Test Engineer
1 day agoMASS Consultants
Business Development Manager
1 day agoSysco
Automation Tester
1 day agoYolk Recruitment Ltd
Automation Engineer
1 day agoKP Snacks
Senior COBOL Developer
1 day agoPartnerscale