Case Study 03

AI Voice Product

Turns out, AI can handle that call.

AI / VoiceEnterpriseIVRHealthcare40%Agent load ↓

TL;DR

2024 was the year voice AI went from experiment to enterprise reality. Conversive moved fast to capture the opportunity — shipping an AI voice product covering IVR modernisation, automated inbound and outbound call handling, and vertical-specific use cases for healthcare and recruitment. I owned the research, use case definition, quality requirements, and the evaluation framework that made the product enterprise-grade. The result: 40% reduction in agent handling time across 10+ enterprise accounts.

Context

Company: Conversive by SMS-Magic
My Role: Associate Product Manager — owned voice AI product workstream
Verticals: Healthcare, Recruitment
Outcome: 40% reduction in agent handling time, 10+ enterprise accounts

For most of its history, Conversive was a text-first platform — SMS, WhatsApp, Email. Voice was a gap. In 2024, that gap became urgent. OpenAI debuted GPT-4o voice in May. ElevenLabs launched Conversational AI in November. 1-800-CHATGPT went live in December. Companies building voice AI represented 22% of the most recent YC class. The market was moving fast and enterprise buyers were starting to ask questions we couldn't answer.

Conversive acquired Voxgenie — a voice AI solution — and the decision was made to integrate it into the platform and ship a voice product. I was brought in to define what that product would actually be.

The Problem

What enterprise customers in healthcare and recruitment were dealing with

Both verticals share the same core pain: high-volume, high-stakes phone interactions that are expensive to staff and hard to scale.

In recruitment, staffing firms were running outbound screening calls at volume — calling candidates to verify availability, qualify for roles, schedule interviews. These are structured, repeatable conversations. Every one of them was being handled by a recruiter whose time was worth far more than reading from a screening script.

In healthcare, clinics and providers were managing appointment reminders, patient follow-ups, insurance verifications, and front-desk overflow. These calls followed predictable flows but were consuming significant staff time. After-hours calls were going to voicemail, meaning missed appointments and lost revenue.

Across both verticals, IVR (Interactive Voice Response) systems existed but were outdated — rigid, touch-tone menus that frustrated callers and couldn't handle natural language. Customers wanted to replace them with something that actually understood what callers were saying.

Why voice AI was the right moment

The timing mattered. By mid-2024, three things had converged:

Model quality had crossed the threshold. Conversational quality — latency, interruptibility, naturalness — was now largely a solved problem. Voice agents were equalling or outperforming call centers on constrained, structured calls.
Cost had dropped dramatically. OpenAI dropped realtime API pricing by 60% for input and 87.5% for output in December 2024. Per-minute costs were no longer a blocker for enterprise ROI.
The wedge was clear. Enterprises rarely went from full human call-taking to full AI overnight — but after-hours overflow, net-new outbound, and structured screening calls were low-risk entry points with measurable value.

My Role

Led market and competitive research across the voice AI landscape (model companies, horizontal platforms, vertical specialists)
Defined the use case strategy: which call types to target first and why
Wrote the quality requirements PRD covering the foundational issues that had to be solved for enterprise readiness
Built the AI agent evaluation framework: scenarios, testcases, and evaluators for systematic quality testing
Worked with engineering on the Voxgenie integration and capability roadmap
Supported rollout to 10+ enterprise accounts across healthcare and recruitment

Discovery

The market research

Before defining what to build, I mapped the entire voice AI landscape. The market had three layers:

Model companies (ElevenLabs, Cartesia, Hume, PlayAI) — building the underlying voice models
Horizontal platforms (Vapi, Bland, Retell AI, Synthflow) — developer tools and no-code builders for deploying agents across use cases
Vertical specialists (Wayfaster for recruitment, Hyro and Hippocratic for healthcare, HappyRobot for logistics) — purpose-built for specific industries

I also mapped how the major CRMs were responding. Salesforce had Agentforce but required Amazon Connect or ISV partners for voice and significant customisation to make it work. HubSpot supported voice channels via Breeze AI but had no automation capabilities — customers needed a second RetellAI subscription. Zoho Voice had simple workflows but no AI capabilities. None of them had a clean, native voice AI story.

This created a clear opening: Conversive could be the voice AI layer that CRM-native customers didn't have to stitch together themselves.

The build vs. partner decision

One of the first decisions was whether to build on existing platforms (Vapi, Retell, Synthflow) for faster time-to-market, or build on voice model platforms (ElevenLabs, Hume, Cartesia) for more control and differentiation.

Building on existing platforms meant faster launch but commoditised capabilities and pricing pressure as those platforms scaled. Building on model platforms meant more work but a genuine product layer we could own. Given the Voxgenie acquisition, we had a head start on the latter — and the ability to integrate voice directly into the Conversive conversation object (rather than treating it as a separate channel) was a real architectural advantage.

What enterprise customers actually needed

The wedge insight from market research proved out in customer conversations: enterprises didn't want to replace all calls with AI immediately. They wanted to start with:

After-hours and overflow calls — calls that would have gone to voicemail. Even if the AI just collects information and arranges a callback, that's captured intent that would have been lost.
Structured outbound calls — candidate screening in recruitment, appointment reminders and patient follow-ups in healthcare. Predictable flow, measurable outcome, easy to validate AI performance.
IVR modernisation — replacing touch-tone menu systems with natural language handling. Lower stakes than full call automation, high frustration-reduction for callers.

These three wedges became our Phase 1 use case strategy.

Key Decisions

1. Target constrained, high-volume call types first

Voice AI works best when calls have a clear structure and a measurable outcome. Recruitment screening and healthcare appointment flows are exactly that — a defined set of questions, a binary outcome (qualified/not qualified, confirmed/rescheduled), and enough volume that the economics work easily.

We explicitly deprioritised open-ended, high-complexity calls (escalations, complaints, negotiations) for Phase 1. The risk/trust bar for those was too high and the success criteria too fuzzy.

2. Quality as a prerequisite, not an afterthought

Shipping a voice product that felt robotic or unreliable would be worse than not shipping at all — especially in healthcare and recruitment, where a bad call experience directly damages trust with end customers.

I defined seven foundational quality requirements that had to be met before we could go to enterprise accounts:

Immediate agent response — agent must respond within 0.5–2 seconds after the user's first utterance. Silence at the start of a call reads as a dropped line.
Accurate live transcription during interruptions — when a user interrupts mid-response, the transcript must reflect only what was actually spoken. The LLM generates its next response from the transcript; any inaccuracy cascades.
End-of-turn detection — the agent should not interrupt prematurely or wait too long. Target: <5% premature interruptions, <5% over-pauses.
Ambient sound — calls should not feel sterile. Complete silence makes callers think the line has dropped.
Backchanneling — natural listening cues ("uh-huh", "okay") during long user speech. Without them the agent feels inattentive.
p95 end-to-end latency under 2 seconds — LLM generation latency was the main bottleneck; outliers had to be visible in monitoring.
Background noise handling — the agent must remain stable in real-world environments: traffic, office chatter, speakerphone. False interruptions from background noise break trust fast.

None of these are features. They're table stakes. Without them, enterprise customers won't trust the product with their customers.

3. Build a systematic evaluation framework

Manually testing voice AI conversations doesn't scale. I built a structured evaluation pipeline so we could test agent quality systematically rather than ad hoc.

The framework had four layers:

Capability extraction — testable rules pulled from the agent's system prompt (what it must do, must not do, flow constraints)
Scenario generation — real-world situation buckets that expose failure modes: happy path, hesitant users, edge cases, knowledge boundary tests
Testcase generation — 3–5 realistic multi-turn conversation variations per scenario, covering different user behaviours and phrasings
Three independent evaluators — scenario outcome (did the call achieve its goal?), capability assertions (did the agent follow its rules?), and KB grounding (did the agent stay within what it actually knows?)

This separation mattered: an agent can pass its scenario outcome but still violate a capability rule. Evaluating them independently surfaces different classes of failure.

What We Built

IVR Modernisation
Replaced touch-tone menu systems with natural language call handling. Callers speak naturally; the agent understands intent and routes accordingly — no more "press 1 for..."

Automated Inbound Call Handling
AI agent handles incoming calls, collects structured information, routes to human agents when needed, or resolves the call independently. After-hours calls no longer go to voicemail.

Outbound Call Automation
Structured outbound flows for candidate screening (recruitment) and appointment reminders and patient follow-ups (healthcare). Agent calls down a list, completes the interaction, logs the outcome to the CRM.

Vertical Use Cases

Recruitment: Candidate availability screening, role qualification, interview scheduling
Healthcare: Appointment reminders, patient follow-up calls, front desk overflow

AI Agent Quality Framework
Systematic evaluation pipeline covering scenario coverage, capability assertion testing, and KB grounding evaluation — enabling repeatable quality measurement across agent versions and use cases.

Outcome

40% reduction in agent handling time across 10+ enterprise accounts in healthcare and recruitment.

The structured, repeatable call types we targeted in Phase 1 — candidate screening, appointment reminders, IVR modernisation — were exactly the ones where AI handled the most volume with the least human escalation. Recruiters and healthcare staff shifted time from routine calling to higher-value work.

The quality framework also changed how we shipped. Instead of testing manually before each release, we had a repeatable pipeline that caught regressions in agent behaviour before they reached customers.

What I'd Do Differently

I'd define success metrics per use case earlier. "40% reduction in agent handling time" is a portfolio metric. It doesn't tell you which use cases are working and which aren't. Recruitment screening and healthcare appointment reminders have different success shapes — one is about qualification accuracy, the other is about show rates. I'd instrument per-use-case from the start so we could double down on what's working and fix what isn't, rather than optimising for the blended number.

I'd invest more in the handoff experience. When a voice AI call escalates to a human agent, the transition is a critical moment. If the agent doesn't pass context cleanly — what was asked, what was said, where in the flow the caller is — the customer has to repeat themselves and trust collapses. We solved for it, but I underestimated how much friction that handoff could still create in real deployments. I'd treat escalation quality as a first-class product requirement from day one, not a polish task.