NEXA 247

Technical Deep Dive — Voice AI Infrastructure for Real-Time Call Orchestration at Scale
For engineering leaders, CTOs, and technical evaluators
SLIDE 1

System Overview — What We've Built

Platform scope and core value proposition

Nexa 247 is a multi-tenant voice AI orchestration platform that dynamically provisions and manages AI receptionists for trade businesses. Each inbound call triggers a real-time pipeline: telephony ingestion, tenant resolution, dynamic prompt assembly, tool-augmented conversation, and post-call intelligence extraction — all at sub-second latency.


This is not a chatbot with a phone number. This is a stateful, tool-calling voice agent that books appointments, qualifies leads, and escalates emergencies — in real-time, mid-conversation.

Real-Time Voice AI

  • STT → LLM → TTS in <300ms
  • Streaming turn-by-turn pipeline
  • Deepgram + Claude Sonnet 4 + ElevenLabs

Tool-Augmented Actions

  • Live calendar booking mid-call
  • Lead capture & urgency scoring
  • Emergency escalation via SMS
  • Knowledge base retrieval (RAG)

Multi-Tenant at Core

  • N tenants, one VAPI account
  • Per-tenant prompt, tools, voice
  • PostgreSQL RLS + Redis isolation
  • Metered Stripe billing per tenant

Intelligent Knowledge

  • Web scrape → embed → pgvector
  • Hybrid BM25 + semantic search
  • PDF, URL, raw text sources
  • Auto-service extraction (Claude Haiku)
SLIDE 2

Architecture — The Call Lifecycle

End-to-end call flow from PSTN to post-call intelligence
INBOUND CALL (PSTN) │ ▼ ┌─────────────┐ GSM Conditional Forwarding │ Twilio │◄─── (*61* no-answer, *67* busy, *62* unreachable) │ Gateway │ Tenant keeps their existing number └──────┬──────┘ │ ▼ ┌─────────────┐ Real-time voice pipeline │ VAPI │ STT (Deepgram) → LLM (Claude Sonnet 4) → TTS (ElevenLabs) │ Voice AI │ Streaming turn-by-turn, <300ms end-to-end latency └──────┬──────┘ │ assistant-request / function-call webhooks ▼ ┌──────────────────────────────────────────────────────┐ │ ORCHESTRATION LAYER (FastAPI) │ │ │ │ ┌─────────────────┐ ┌──────────────────────┐ │ │ │ Tenant Resolver │ │ Prompt Assembler │ │ │ │ phone_number_id │───▶│ skills + KB block + │ │ │ │ → tenant context │ │ config → system prompt│ │ │ └─────────────────┘ └──────────────────────┘ │ │ │ │ ┌─────────────────┐ ┌──────────────────────┐ │ │ │ Tool Router │ │ Post-Call Pipeline │ │ │ │ • calendar │ │ transcript → summary │ │ │ │ • sms │ │ → sentiment → lead │ │ │ │ • lead_capture │ │ → usage billing │ │ │ │ • kb_retrieval ✦ │ │ │ │ │ └─────────────────┘ └──────────────────────┘ │ │ │ └──────────────────────┬───────────────────────────────┘ │ ┌───────────────┼──────────────┬────────────────┐ ▼ ▼ ▼ ▼ ┌────────────┐ ┌────────────┐ ┌──────────┐ ┌──────────────┐ │ PostgreSQL │ │ Redis │ │ Stripe │ │ pgvector ✦ │ │ (Supabase) │ │ (Cache) │ │ (Billing)│ │ KB chunks │ │ tenants, │ │ tenant cfg │ │ metered │ │ HNSW index │ │ calls, │ │ ingest lock│ │ usage │ │ 384-dim │ │ leads │ │ sessions │ │ tracking │ │ embeddings │ └────────────┘ └────────────┘ └──────────┘ └──────────────┘
SLIDE 3

The Critical Path — assistant-request (<1s)

The make-or-break webhook that bootstraps every call
Key invariant: This must return in <1 second or the caller hears silence. No DB hit on the hot path — everything is Redis-cached.
# Hot path: phone_number_id → full assistant config
async def handle_assistant_request(payload):
    phone_id = payload["phone_number_id"]

    # 1. Tenant resolution — Redis-cached, <0.5ms hit rate >99%
    tenant = await cache.get_tenant_by_phone(phone_id)

    # 2. Dynamic prompt assembly — skills-based + KB block
    system_prompt = build_prompt(
        base_skills=["greeting", "booking", "lead_qualification"],
        tenant_config=tenant.ai_config,
        services=tenant.services,
        business_hours=tenant.hours,
        personality=tenant.voice_personality,
        smb_knowledge=tenant.smb_knowledge,  # KB mode routing
    )

    # 3. Tool injection — only tools this tenant has enabled
    tools = resolve_tools(tenant)  # calendar, sms, lead_capture, kb_retrieval

    # 4. Return complete assistant config
    return AssistantConfig(
        model="claude-sonnet-4-20250514",
        voice={"provider": "11labs", "voice_id": "australian_male"},
        system_prompt=system_prompt,
        tools=tools,
        metadata={"tenant_id": tenant.id}
    )

Key design decisions

SLIDE 4

Multi-Tenant Isolation at the Voice Layer

N:1 architecture — one VAPI account, thousands of tenants
┌──────────────────────────────────────────┐ │ VAPI Platform (Single Acct) │ │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Phone #1 │ │ Phone #2 │ │ Phone #N │ │ │ │ (Twilio) │ │ (Twilio) │ │ (Twilio) │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ └─────────────┼─────────────┘ │ │ assistant-request webhook │ └─────────────────────┼───────────────────────┘ ▼ ┌───────────────────────┐ │ Nexa Backend │ │ │ │ phone_number │ │ → tenant_id │ │ → unique prompt │ │ → unique KB context ✦ │ │ → unique tools │ │ → unique voice │ └───────────────────────┘

Isolation guarantees

LayerMechanism
Data isolationRow-level security in PostgreSQL (tenant_id on every table)
Prompt isolationSystem prompts assembled per-tenant, never shared
KB isolationsmb_knowledge_chunks filtered by tenant_id on every query
Tool isolationOAuth credentials encrypted per-tenant (Fernet symmetric encryption)
Billing isolationStripe customer/subscription per-tenant, metered independently
Call isolationEach call tagged with tenant_id in metadata, traced in Langfuse
Ingest isolationRedis distributed lock per-tenant prevents concurrent KB ingests
SLIDE 5

Tool-Augmented Conversation — Mid-Call Actions

The AI doesn't just talk — it acts

check_availability

  • Trigger: caller asks about times
  • Google Calendar FreeBusy API
  • ~800ms (cached 5min in Redis)
  • Filters by business hours + rules

create_booking

  • Trigger: caller confirms a slot
  • Calendar event + SMS to owner
  • Stores booking in PostgreSQL
  • Extracts name/phone/service/address

capture_lead

  • Trigger: interested but not booking
  • Hot / Warm / Cold urgency scoring
  • Lead creation + notification
  • Emergency escalation path

answer_product_service_question NEW

  • Trigger: caller asks about services/pricing
  • Hybrid BM25 + vector search on KB chunks
  • Returns top-5 relevant passages
  • Only injected when tenant has vector-mode KB
  • Never hallucinates — retrieves, then answers
// answer_product_service_question — RAG retrieval mid-call
{
  "trigger": "Caller asks about pricing, services, or specific products",
  "flow": [
    "1. LLM detects information question (not a booking action)",
    "2. VAPI sends function-call webhook to Nexa backend",
    "3. Backend runs hybrid_search(tenant_id, question, top_k=5)",
    "   ├─ BM25 leg: PostgreSQL tsvector FTS (GIN index)",
    "   ├─ Vector leg: pgvector cosine ANN (HNSW index)",
    "   └─ Fuse both legs with Reciprocal Rank Fusion (k=60)",
    "4. Return top-5 chunk contents to VAPI",
    "5. LLM synthesises answer from retrieved context",
    "6. AI responds to caller with accurate, grounded information"
  ],
  "latency": "~50–120ms (both legs run concurrently via asyncio.gather)",
  "fallback": "If KB empty or inactive, AI answers from form-only profile"
}
SLIDE 6

Prompt Engineering — Skills-Based Composition

Modular markdown skill files assembled dynamically per-tenant
backend/skills/
├── greeting.md           # G'day opener, business intro
├── booking.md            # Calendar availability + booking flow
├── lead_qualification.md # Urgency scoring, service matching
├── emergency_handling.md # Escalation protocols
├── closing.md            # Call wrap-up, confirmation
└── personality/
    ├── tradie_casual.md  # Warm, Australian, mate-oriented
    ├── professional.md   # Formal business tone
    └── custom/           # Per-tenant personality overrides
def build_system_prompt(tenant: Tenant, smb_knowledge: dict) -> str:
    sections = []

    # Core identity
    sections.append(f"You are the AI receptionist for {tenant.business_name}.")

    # Load skill modules
    for skill in tenant.enabled_skills:
        sections.append(load_skill(skill))  # Cached markdown

    # Inject business context
    sections.append(format_services(tenant.services))
    sections.append(format_hours(tenant.business_hours))

    # ── KB block — mode-routed ────────────────────────────────
    kb_block = _build_knowledge_base_block(smb_knowledge, tenant.vertical)
    if kb_block:
        sections.append(kb_block)
    # markdown mode → full text embedded in <knowledge_base> tags
    # vector mode   → advisory + "call answer_product_service_question"
    # form_only     → no block; AI answers from services list only

    # Personality layer
    sections.append(load_personality(tenant.voice_personality))

    return "\n\n".join(sections)

KB block modes

ModeWhenPrompt injection
form_only No KB ingested No <knowledge_base> block. AI answers from <services> only.
markdown ≤ 4,000 tokens Full text embedded inline. AI reads directly — no tool call needed.
vector > 4,000 tokens Advisory block + instruction to call answer_product_service_question.
SLIDE 7

Knowledge Base RAG Architecture NEW SLIDE

Hybrid retrieval pipeline powering contextual answers during live calls
Fully built, not yet in pitch deck. This is a core differentiator — the AI can answer detailed questions about any tenant's services, pricing, and business info without hallucinating. The entire pipeline runs during live calls in ~50–120ms.

Ingest Pipeline

1
Source Ingestion — Multi-modal

Accepts URLs, PDFs (with OCR fallback via pdfplumber/pytesseract), and raw text. All sources accumulate — new sources merge with existing, replacing only matching identities.

2
Web Scraping — 3-Layer Static + Playwright Fallback

Layer 1: trafilatura (clean article/prose extraction). Layer 2: custom HTML visible-text parser (recovers Elementor/Divi service grids that trafilatura drops as "boilerplate"). Layer 3: JSON-LD/OpenGraph/meta structured data parser. If static extraction yields <800 chars → escalates to Playwright for JS-heavy SPAs.

3
Security — SSRF Guard + Prompt Injection Sanitization

Two-layer SSRF check: string-pattern match on private ranges + DNS resolution to block rebinding attacks. All content sanitized against prompt injection before storage.

4
Token Count → Mode Routing

tiktoken cl100k_base counts tokens across all sources. ≤ 4,000 → markdown mode (full text in JSONB). > 4,000 → vector mode (chunk + embed + pgvector).

5
Chunking + Embedding (vector mode only)

Text split into 256-token chunks (tiktoken). Each chunk embedded with all-MiniLM-L6-v2 via sentence-transformers (384-dim, L2-normalised). Singleton embedder with ThreadPoolExecutor(max_workers=1) — async-safe, serialised sync encode() calls.

6
Persist to pgvector

Atomic transaction: DELETE old chunks → INSERT new chunks with vector(384) embeddings → UPDATE smb_knowledge JSONB. HNSW index (m=16, ef_construction=64, cosine ops) for fast ANN search. Generated tsvector column (GENERATED ALWAYS … STORED) + GIN index for BM25.

7
Service Auto-Suggestion (tradies only)

Haiku forced-tool-use extracts a clean list of bookable services from KB text. Pre-fills the onboarding services form — the tradie never has to type their services manually.

8
Cache Invalidation

After every ingest, assistant_cache.invalidate(tenant_id) ensures the next call gets the updated KB block in its system prompt.


Hybrid Search at Query Time

Caller asks: "Do you do hot water installations?" │ ▼ answer_product_service_question(question=...) │ ├──────────────────────┬─────────────────────── ▼ ▼ ┌─────────────┐ ┌─────────────────────────┐ │ BM25 Leg │ │ Vector Leg │ │ │ │ │ │ tsvector │ │ embed question │ │ FTS query │ │ (all-MiniLM-L6-v2) │ │ GIN index │ │ 384-dim cosine ANN │ │ ts_rank_cd │ │ HNSW index (pgvector) │ │ │ │ │ │ top-20 IDs │ │ top-20 IDs │ └──────┬──────┘ └───────────┬─────────────┘ └──────────────────────────┘ │ ▼ Reciprocal Rank Fusion (k=60) score = Σ 1/(60 + rank) │ ▼ Top-5 chunk contents │ ▼ LLM synthesises grounded answer

Why hybrid over pure vector?

ScenarioBM25 winsVector wins
"What's your ABN?" Exact keyword match — "ABN" token May not find exact short string
"Do you fix leaky taps?" Misses if website says "tap repairs" Semantic match on "tap repairs" ↔ "leaky taps"
"Hot water system" Hits exact match on page copy Catches "HWS", "water heater", "hot water unit"
RRF fusion Both legs contribute — chunks appearing in both rank highest. Best of both worlds.

Infrastructure & concurrency controls

ControlMechanismValue
Distributed lockRedis SETNX on kb_ingest_lock:{tenant_id}5-min TTL
Rate limitRedis key on first ingest only24h cooldown for first-ever scrape
Stale-on-failureOn re-ingest failure, marks may_be_stalePrior KB preserved
Entitlement gatekb_for_call(tier, trial_expires_at)Trial window or paid tier required
SLIDE 8

Post-Call Intelligence Pipeline

Every call generates structured intelligence
Call Ends (VAPI webhook: call-ended) │ ├─▶ Store raw transcript + recording URL │ ├─▶ AI Summary Generation (Claude) │ └─ 2-3 sentence summary of call purpose & outcome │ ├─▶ Sentiment Analysis │ └─ positive / neutral / negative + confidence score │ ├─▶ Outcome Classification │ └─ booking_created | lead_qualified | information_provided | │ callback_requested | wrong_number | no_answer │ ├─▶ Usage Metering │ └─ Stripe usage record (increment call counter) │ └─ Overage detection + metered billing │ └─▶ Langfuse Trace └─ Full call trace for quality evaluation └─ Latency metrics per turn └─ Tool call success/failure rates (incl. KB retrieval hits)
SLIDE 9

Evaluation Framework — Continuous Quality Assurance

Automated eval framework measuring AI conversation quality
DimensionWhat It MeasuresMethod
Greeting QualityDid it use tenant's business name? Natural opener?LLM-as-judge
Information CaptureDid it collect name, phone, service, address?Field extraction check
Booking AccuracyDid it check real availability? Book correct slot?Calendar API verification
Lead QualificationCorrect urgency scoring? Appropriate follow-up?Rubric-based scoring
Emergency HandlingDid it escalate correctly? Appropriate urgency?Scenario-based testing
Conversation FlowNatural transitions? No hallucinations?LLM-as-judge + human review
Australian EnglishNatural Aussie phrasing? No Americanisms?Linguistic pattern matching
KB Retrieval Accuracy NEWDid RAG return relevant chunks? Did AI answer correctly?Ground-truth QA set per tenant
KB Grounding NEWDid AI avoid hallucinating info not in KB?Faithfulness check (LLM-as-judge)

Eval infrastructure

SLIDE 10

Tech Stack — Production-Grade Choices

Every layer chosen for async-native, type-safe, production reliability

Backend (Python Async)

LayerChoiceRationale
FrameworkFastAPIAsync-native, Pydantic validation, OpenAPI docs
RuntimePython 3.11+ / UvicornAsync/await throughout, zero blocking I/O
ORMSQLAlchemy 2.0 (async)Async session management, type-safe queries
DB DriverasyncpgNative PostgreSQL async driver, connection pooling
MigrationsAlembicVersion-controlled schema evolution
CacheRedis (aioredis)Tenant config, calendar availability, KB ingest locks
EncryptionFernet (cryptography)OAuth token encryption at rest
ObservabilitySentry + LangfuseError tracking + AI conversation tracing
Vector DBpgvector (PostgreSQL)HNSW index, cosine similarity, no separate vector service needed
Embeddingssentence-transformers
all-MiniLM-L6-v2
384-dim, fast, runs in-process; no external API call or cost per embed
Tokenizertiktoken cl100k_baseAccurate token counting for mode routing & chunking
Web scrapingtrafilatura + httpx
+ Playwright (fallback)
Static-first (cheap); Playwright only for JS-heavy SPAs
PDF extractionpdfplumber + OCR fallbackStructured PDFs direct; scanned PDFs via pytesseract

Frontend (React SPA)

LayerChoiceRationale
FrameworkReact 19 + TypeScriptType-safe, hooks-first architecture
BuildVite 8Sub-second HMR, optimised production builds
StateTanStack QueryServer-state caching, auto-refetch, stale-while-revalidate
StylingTailwind CSS 4Utility-first, zero runtime CSS overhead
AnimationFramer MotionGPU-accelerated, declarative transitions
AuthSupabase JS SDKJWT session management, auto-refresh

External Services

ServiceRoleIntegration
VAPIVoice AI orchestrationWebhook-driven, assistant-request pattern
Claude Sonnet 4Conversation intelligenceVia VAPI (primary) + direct API (post-call)
Claude HaikuKB service extractionForced tool-use on ingested content
TwilioTelephony infrastructurePhone number provisioning, SMS delivery
Google CalendarAppointment managementOAuth 2.0, FreeBusy + Events API
StripeBilling & meteringCheckout, webhooks, metered subscriptions
SupabaseAuth + PostgreSQL + RLSManaged Postgres + pgvector extension
LangfuseAI observabilityTrace collection, eval scoring, latency analysis

Infrastructure

ComponentPlatformConfig
BackendRailway.appAuto-deploy from git, horizontal scaling
FrontendVercelEdge-deployed SPA, CDN-cached
Database + pgvectorSupabase (managed PG)pgvector extension enabled; connection pooling via PgBouncer
Cache / Ingest LocksRedisTenant config TTL, calendar cache, KB distributed locks
DNSGoDaddy + VercelSplit: www → Astro marketing, app → React portal
SLIDE 11

Security & Compliance Model

Data protection, Australian compliance, and API hardening

Data Protection

Australian Compliance

API Security

SLIDE 12

Scaling Characteristics

Current headroom and known bottlenecks

Current architecture supports

Scaling bottlenecks & mitigations

BottleneckCurrentAt Scale
Tenant resolutionRedis single-instanceRedis Cluster
Calendar APIPer-tenant rate limitsRequest batching + aggressive caching
Stripe webhooksSingle endpointQueue-based processing (SQS/Bull)
Database writesSingle PG instanceRead replicas + write sharding by tenant
Langfuse tracesSelf-hostedManaged Langfuse Cloud
Embedding inference NEWIn-process sentence-transformers, 1 worker threadDedicated embedding service (GPU) or OpenAI batch API for ingest
pgvector HNSW build NEWInline on ingest, <1s for typical KBAsync background reindex for very large KBs (>100K chunks)
KB web scraping NEWSequential URL fetch per ingestParallel scrape workers + Playwright pool

Cost at scale (per 1,000 tenants)

ComponentMonthly Cost (est.)
VAPI (voice AI)$3,000–$8,000
Twilio (phone numbers + SMS)$1,500–$3,000
Railway (backend compute)$200–$500
Supabase (database + pgvector)$75–$200
Redis$50–$100
Embeddings (in-process, no API cost)$0 additional (CPU on Railway instance)
Total COGS~$5K–$12K
Revenue (1K tenants @ $120 ARPU)$120,000
Gross margin~90–95%
SLIDE 13

What's Next — Technical Roadmap

Near, medium, and long-term engineering priorities

Near-Term (Q2–Q3 2026)

Medium-Term (Q4 2026–Q1 2027)

Long-Term Vision