NEXA 247 — Technical Deep Dive

SLIDE 1

System Overview — What We've Built

Platform scope and core value proposition

Nexa 247 is a multi-tenant voice AI orchestration platform that dynamically provisions and manages AI receptionists for trade businesses. Each inbound call triggers a real-time pipeline: telephony ingestion, tenant resolution, dynamic prompt assembly, tool-augmented conversation, and post-call intelligence extraction — all at sub-second latency.

This is not a chatbot with a phone number. This is a stateful, tool-calling voice agent that books appointments, qualifies leads, and escalates emergencies — in real-time, mid-conversation.

Real-Time Voice AI

STT → LLM → TTS in <300ms
Streaming turn-by-turn pipeline
Deepgram + Claude Sonnet 4 + ElevenLabs

Tool-Augmented Actions

Live calendar booking mid-call
Lead capture & urgency scoring
Emergency escalation via SMS
Knowledge base retrieval (RAG)

Multi-Tenant at Core

N tenants, one VAPI account
Per-tenant prompt, tools, voice
PostgreSQL RLS + Redis isolation
Metered Stripe billing per tenant

Intelligent Knowledge

Web scrape → embed → pgvector
Hybrid BM25 + semantic search
PDF, URL, raw text sources
Auto-service extraction (Claude Haiku)

SLIDE 2

Architecture — The Call Lifecycle

End-to-end call flow from PSTN to post-call intelligence

INBOUND CALL (PSTN) │ ▼ ┌─────────────┐ GSM Conditional Forwarding │ Twilio │◄─── (*61* no-answer, *67* busy, *62* unreachable) │ Gateway │ Tenant keeps their existing number └──────┬──────┘ │ ▼ ┌─────────────┐ Real-time voice pipeline │ VAPI │ STT (Deepgram) → LLM (Claude Sonnet 4) → TTS (ElevenLabs) │ Voice AI │ Streaming turn-by-turn, <300ms end-to-end latency └──────┬──────┘ │ assistant-request / function-call webhooks ▼ ┌──────────────────────────────────────────────────────┐ │ ORCHESTRATION LAYER (FastAPI) │ │ │ │ ┌─────────────────┐ ┌──────────────────────┐ │ │ │ Tenant Resolver │ │ Prompt Assembler │ │ │ │ phone_number_id │───▶│ skills + KB block + │ │ │ │ → tenant context │ │ config → system prompt│ │ │ └─────────────────┘ └──────────────────────┘ │ │ │ │ ┌─────────────────┐ ┌──────────────────────┐ │ │ │ Tool Router │ │ Post-Call Pipeline │ │ │ │ • calendar │ │ transcript → summary │ │ │ │ • sms │ │ → sentiment → lead │ │ │ │ • lead_capture │ │ → usage billing │ │ │ │ • kb_retrieval ✦ │ │ │ │ │ └─────────────────┘ └──────────────────────┘ │ │ │ └──────────────────────┬───────────────────────────────┘ │ ┌───────────────┼──────────────┬────────────────┐ ▼ ▼ ▼ ▼ ┌────────────┐ ┌────────────┐ ┌──────────┐ ┌──────────────┐ │ PostgreSQL │ │ Redis │ │ Stripe │ │ pgvector ✦ │ │ (Supabase) │ │ (Cache) │ │ (Billing)│ │ KB chunks │ │ tenants, │ │ tenant cfg │ │ metered │ │ HNSW index │ │ calls, │ │ ingest lock│ │ usage │ │ 384-dim │ │ leads │ │ sessions │ │ tracking │ │ embeddings │ └────────────┘ └────────────┘ └──────────┘ └──────────────┘

SLIDE 3

The Critical Path — assistant-request (<1s)

The make-or-break webhook that bootstraps every call

Key invariant: This must return in <1 second or the caller hears silence. No DB hit on the hot path — everything is Redis-cached.

# Hot path: phone_number_id → full assistant config
async def handle_assistant_request(payload):
    phone_id = payload["phone_number_id"]

    # 1. Tenant resolution — Redis-cached, <0.5ms hit rate >99%
    tenant = await cache.get_tenant_by_phone(phone_id)

    # 2. Dynamic prompt assembly — skills-based + KB block
    system_prompt = build_prompt(
        base_skills=["greeting", "booking", "lead_qualification"],
        tenant_config=tenant.ai_config,
        services=tenant.services,
        business_hours=tenant.hours,
        personality=tenant.voice_personality,
        smb_knowledge=tenant.smb_knowledge,  # KB mode routing
    )

    # 3. Tool injection — only tools this tenant has enabled
    tools = resolve_tools(tenant)  # calendar, sms, lead_capture, kb_retrieval

    # 4. Return complete assistant config
    return AssistantConfig(
        model="claude-sonnet-4-20250514",
        voice={"provider": "11labs", "voice_id": "australian_male"},
        system_prompt=system_prompt,
        tools=tools,
        metadata={"tenant_id": tenant.id}
    )

Key design decisions

No DB hit on the hot path — tenant config cached in Redis with TTL invalidation
Skills-based prompt composition — markdown skill files loaded at boot, composed per-tenant
KB mode routing — markdown KB embedded inline; vector KB injects advisory + retrieval tool
Tool injection — tenants only get tools they've configured
Stateless webhook handler — any backend instance can serve any call

SLIDE 4

Multi-Tenant Isolation at the Voice Layer

N:1 architecture — one VAPI account, thousands of tenants

┌──────────────────────────────────────────┐ │ VAPI Platform (Single Acct) │ │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Phone #1 │ │ Phone #2 │ │ Phone #N │ │ │ │ (Twilio) │ │ (Twilio) │ │ (Twilio) │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ └─────────────┼─────────────┘ │ │ assistant-request webhook │ └─────────────────────┼───────────────────────┘ ▼ ┌───────────────────────┐ │ Nexa Backend │ │ │ │ phone_number │ │ → tenant_id │ │ → unique prompt │ │ → unique KB context ✦ │ │ → unique tools │ │ → unique voice │ └───────────────────────┘

Isolation guarantees

Layer	Mechanism
Data isolation	Row-level security in PostgreSQL (`tenant_id` on every table)
Prompt isolation	System prompts assembled per-tenant, never shared
KB isolation	`smb_knowledge_chunks` filtered by `tenant_id` on every query
Tool isolation	OAuth credentials encrypted per-tenant (Fernet symmetric encryption)
Billing isolation	Stripe customer/subscription per-tenant, metered independently
Call isolation	Each call tagged with `tenant_id` in metadata, traced in Langfuse
Ingest isolation	Redis distributed lock per-tenant prevents concurrent KB ingests

SLIDE 5

Tool-Augmented Conversation — Mid-Call Actions

The AI doesn't just talk — it acts

check_availability

Trigger: caller asks about times
Google Calendar FreeBusy API
~800ms (cached 5min in Redis)
Filters by business hours + rules

create_booking

Trigger: caller confirms a slot
Calendar event + SMS to owner
Stores booking in PostgreSQL
Extracts name/phone/service/address

capture_lead

Trigger: interested but not booking
Hot / Warm / Cold urgency scoring
Lead creation + notification
Emergency escalation path

answer_product_service_question NEW

Trigger: caller asks about services/pricing
Hybrid BM25 + vector search on KB chunks
Returns top-5 relevant passages
Only injected when tenant has vector-mode KB
Never hallucinates — retrieves, then answers

// answer_product_service_question — RAG retrieval mid-call
{
  "trigger": "Caller asks about pricing, services, or specific products",
  "flow": [
    "1. LLM detects information question (not a booking action)",
    "2. VAPI sends function-call webhook to Nexa backend",
    "3. Backend runs hybrid_search(tenant_id, question, top_k=5)",
    "   ├─ BM25 leg: PostgreSQL tsvector FTS (GIN index)",
    "   ├─ Vector leg: pgvector cosine ANN (HNSW index)",
    "   └─ Fuse both legs with Reciprocal Rank Fusion (k=60)",
    "4. Return top-5 chunk contents to VAPI",
    "5. LLM synthesises answer from retrieved context",
    "6. AI responds to caller with accurate, grounded information"
  ],
  "latency": "~50–120ms (both legs run concurrently via asyncio.gather)",
  "fallback": "If KB empty or inactive, AI answers from form-only profile"
}

SLIDE 6

Prompt Engineering — Skills-Based Composition

Modular markdown skill files assembled dynamically per-tenant

backend/skills/
├── greeting.md           # G'day opener, business intro
├── booking.md            # Calendar availability + booking flow
├── lead_qualification.md # Urgency scoring, service matching
├── emergency_handling.md # Escalation protocols
├── closing.md            # Call wrap-up, confirmation
└── personality/
    ├── tradie_casual.md  # Warm, Australian, mate-oriented
    ├── professional.md   # Formal business tone
    └── custom/           # Per-tenant personality overrides

def build_system_prompt(tenant: Tenant, smb_knowledge: dict) -> str:
    sections = []

    # Core identity
    sections.append(f"You are the AI receptionist for {tenant.business_name}.")

    # Load skill modules
    for skill in tenant.enabled_skills:
        sections.append(load_skill(skill))  # Cached markdown

    # Inject business context
    sections.append(format_services(tenant.services))
    sections.append(format_hours(tenant.business_hours))

    # ── KB block — mode-routed ────────────────────────────────
    kb_block = _build_knowledge_base_block(smb_knowledge, tenant.vertical)
    if kb_block:
        sections.append(kb_block)
    # markdown mode → full text embedded in <knowledge_base> tags
    # vector mode   → advisory + "call answer_product_service_question"
    # form_only     → no block; AI answers from services list only

    # Personality layer
    sections.append(load_personality(tenant.voice_personality))

    return "\n\n".join(sections)

KB block modes

Mode	When	Prompt injection
form_only	No KB ingested	No `<knowledge_base>` block. AI answers from `<services>` only.
markdown	≤ 4,000 tokens	Full text embedded inline. AI reads directly — no tool call needed.
vector	> 4,000 tokens	Advisory block + instruction to call `answer_product_service_question`.

SLIDE 7

Knowledge Base RAG Architecture NEW SLIDE

Hybrid retrieval pipeline powering contextual answers during live calls

Ingest Pipeline

1

Source Ingestion — Multi-modal

Accepts URLs, PDFs (with OCR fallback via pdfplumber/pytesseract), and raw text. All sources accumulate — new sources merge with existing, replacing only matching identities.

2

Web Scraping — 3-Layer Static + Playwright Fallback

Layer 1: trafilatura (clean article/prose extraction). Layer 2: custom HTML visible-text parser (recovers Elementor/Divi service grids that trafilatura drops as "boilerplate"). Layer 3: JSON-LD/OpenGraph/meta structured data parser. If static extraction yields <800 chars → escalates to Playwright for JS-heavy SPAs.

3

Security — SSRF Guard + Prompt Injection Sanitization

Two-layer SSRF check: string-pattern match on private ranges + DNS resolution to block rebinding attacks. All content sanitized against prompt injection before storage.

4

Token Count → Mode Routing

tiktoken cl100k_base counts tokens across all sources. ≤ 4,000 → markdown mode (full text in JSONB). > 4,000 → vector mode (chunk + embed + pgvector).

5

Chunking + Embedding (vector mode only)

Text split into 256-token chunks (tiktoken). Each chunk embedded with all-MiniLM-L6-v2 via sentence-transformers (384-dim, L2-normalised). Singleton embedder with ThreadPoolExecutor(max_workers=1) — async-safe, serialised sync encode() calls.

6

Persist to pgvector

Atomic transaction: DELETE old chunks → INSERT new chunks with vector(384) embeddings → UPDATE smb_knowledge JSONB. HNSW index (m=16, ef_construction=64, cosine ops) for fast ANN search. Generated tsvector column (GENERATED ALWAYS … STORED) + GIN index for BM25.

7

Service Auto-Suggestion (tradies only)

Haiku forced-tool-use extracts a clean list of bookable services from KB text. Pre-fills the onboarding services form — the tradie never has to type their services manually.

8

Cache Invalidation

After every ingest, assistant_cache.invalidate(tenant_id) ensures the next call gets the updated KB block in its system prompt.

Hybrid Search at Query Time

Caller asks: "Do you do hot water installations?" │ ▼ answer_product_service_question(question=...) │ ├──────────────────────┬─────────────────────── ▼ ▼ ┌─────────────┐ ┌─────────────────────────┐ │ BM25 Leg │ │ Vector Leg │ │ │ │ │ │ tsvector │ │ embed question │ │ FTS query │ │ (all-MiniLM-L6-v2) │ │ GIN index │ │ 384-dim cosine ANN │ │ ts_rank_cd │ │ HNSW index (pgvector) │ │ │ │ │ │ top-20 IDs │ │ top-20 IDs │ └──────┬──────┘ └───────────┬─────────────┘ └──────────────────────────┘ │ ▼ Reciprocal Rank Fusion (k=60) score = Σ 1/(60 + rank) │ ▼ Top-5 chunk contents │ ▼ LLM synthesises grounded answer

Why hybrid over pure vector?

Scenario	BM25 wins	Vector wins
"What's your ABN?"	Exact keyword match — "ABN" token	May not find exact short string
"Do you fix leaky taps?"	Misses if website says "tap repairs"	Semantic match on "tap repairs" ↔ "leaky taps"
"Hot water system"	Hits exact match on page copy	Catches "HWS", "water heater", "hot water unit"
RRF fusion	Both legs contribute — chunks appearing in both rank highest. Best of both worlds.

Infrastructure & concurrency controls

Control	Mechanism	Value
Distributed lock	Redis SETNX on `kb_ingest_lock:{tenant_id}`	5-min TTL
Rate limit	Redis key on first ingest only	24h cooldown for first-ever scrape
Stale-on-failure	On re-ingest failure, marks `may_be_stale`	Prior KB preserved
Entitlement gate	`kb_for_call(tier, trial_expires_at)`	Trial window or paid tier required

SLIDE 8

Post-Call Intelligence Pipeline

Every call generates structured intelligence

Call Ends (VAPI webhook: call-ended) │ ├─▶ Store raw transcript + recording URL │ ├─▶ AI Summary Generation (Claude) │ └─ 2-3 sentence summary of call purpose & outcome │ ├─▶ Sentiment Analysis │ └─ positive / neutral / negative + confidence score │ ├─▶ Outcome Classification │ └─ booking_created | lead_qualified | information_provided | │ callback_requested | wrong_number | no_answer │ ├─▶ Usage Metering │ └─ Stripe usage record (increment call counter) │ └─ Overage detection + metered billing │ └─▶ Langfuse Trace └─ Full call trace for quality evaluation └─ Latency metrics per turn └─ Tool call success/failure rates (incl. KB retrieval hits)

SLIDE 9

Evaluation Framework — Continuous Quality Assurance

Automated eval framework measuring AI conversation quality

Dimension	What It Measures	Method
Greeting Quality	Did it use tenant's business name? Natural opener?	LLM-as-judge
Information Capture	Did it collect name, phone, service, address?	Field extraction check
Booking Accuracy	Did it check real availability? Book correct slot?	Calendar API verification
Lead Qualification	Correct urgency scoring? Appropriate follow-up?	Rubric-based scoring
Emergency Handling	Did it escalate correctly? Appropriate urgency?	Scenario-based testing
Conversation Flow	Natural transitions? No hallucinations?	LLM-as-judge + human review
Australian English	Natural Aussie phrasing? No Americanisms?	Linguistic pattern matching
KB Retrieval Accuracy NEW	Did RAG return relevant chunks? Did AI answer correctly?	Ground-truth QA set per tenant
KB Grounding NEW	Did AI avoid hallucinating info not in KB?	Faithfulness check (LLM-as-judge)

Eval infrastructure

Langfuse integration — trace-level evaluation including KB tool call outcomes
Automated test scenarios — synthetic callers with scripted personas
Regression suite — every prompt change triggers eval re-run
Per-tenant quality scores — identify underperforming configurations

SLIDE 10

Tech Stack — Production-Grade Choices

Every layer chosen for async-native, type-safe, production reliability

Backend (Python Async)

Layer	Choice	Rationale
Framework	FastAPI	Async-native, Pydantic validation, OpenAPI docs
Runtime	Python 3.11+ / Uvicorn	Async/await throughout, zero blocking I/O
ORM	SQLAlchemy 2.0 (async)	Async session management, type-safe queries
DB Driver	asyncpg	Native PostgreSQL async driver, connection pooling
Migrations	Alembic	Version-controlled schema evolution
Cache	Redis (aioredis)	Tenant config, calendar availability, KB ingest locks
Encryption	Fernet (cryptography)	OAuth token encryption at rest
Observability	Sentry + Langfuse	Error tracking + AI conversation tracing
Vector DB	pgvector (PostgreSQL)	HNSW index, cosine similarity, no separate vector service needed
Embeddings	sentence-transformers `all-MiniLM-L6-v2`	384-dim, fast, runs in-process; no external API call or cost per embed
Tokenizer	tiktoken `cl100k_base`	Accurate token counting for mode routing & chunking
Web scraping	trafilatura + httpx + Playwright (fallback)	Static-first (cheap); Playwright only for JS-heavy SPAs
PDF extraction	pdfplumber + OCR fallback	Structured PDFs direct; scanned PDFs via pytesseract

Frontend (React SPA)

Layer	Choice	Rationale
Framework	React 19 + TypeScript	Type-safe, hooks-first architecture
Build	Vite 8	Sub-second HMR, optimised production builds
State	TanStack Query	Server-state caching, auto-refetch, stale-while-revalidate
Styling	Tailwind CSS 4	Utility-first, zero runtime CSS overhead
Animation	Framer Motion	GPU-accelerated, declarative transitions
Auth	Supabase JS SDK	JWT session management, auto-refresh

External Services

Service	Role	Integration
VAPI	Voice AI orchestration	Webhook-driven, assistant-request pattern
Claude Sonnet 4	Conversation intelligence	Via VAPI (primary) + direct API (post-call)
Claude Haiku	KB service extraction	Forced tool-use on ingested content
Twilio	Telephony infrastructure	Phone number provisioning, SMS delivery
Google Calendar	Appointment management	OAuth 2.0, FreeBusy + Events API
Stripe	Billing & metering	Checkout, webhooks, metered subscriptions
Supabase	Auth + PostgreSQL + RLS	Managed Postgres + pgvector extension
Langfuse	AI observability	Trace collection, eval scoring, latency analysis

Infrastructure

Component	Platform	Config
Backend	Railway.app	Auto-deploy from git, horizontal scaling
Frontend	Vercel	Edge-deployed SPA, CDN-cached
Database + pgvector	Supabase (managed PG)	pgvector extension enabled; connection pooling via PgBouncer
Cache / Ingest Locks	Redis	Tenant config TTL, calendar cache, KB distributed locks
DNS	GoDaddy + Vercel	Split: www → Astro marketing, app → React portal

SLIDE 11

Security & Compliance Model

Data protection, Australian compliance, and API hardening

Data Protection

OAuth credential encryption: Fernet symmetric encryption for all stored tokens
Webhook verification: HMAC-SHA256 signature validation on all inbound webhooks (VAPI, Stripe)
Row-level security: PostgreSQL RLS ensures tenant data isolation (including smb_knowledge_chunks)
JWT authentication: Supabase-issued JWTs with session refresh
No PII in logs: Call transcripts stored in database only, not in application logs
SSRF protection: Two-layer SSRF guard on all KB URL ingests (string-pattern + DNS resolution)
Prompt injection sanitization: All scraped/uploaded KB content sanitized before storage

Australian Compliance

GST handling: Built-in Australian GST calculation for all invoices
ABN support: Tenant ABN storage for tax invoice generation
Data residency: Supabase region selection for Australian data sovereignty
Privacy Act alignment: Call recording consent handling in AI greeting

API Security

Rate limiting: Redis-backed rate limiting on all public endpoints (including KB ingest)
CORS configuration: Strict origin allowlisting (app.nexa247.ai only)
Input validation: Pydantic models on every endpoint — no raw dict access
Dependency scanning: Automated vulnerability checks in CI

SLIDE 12

Scaling Characteristics

Current headroom and known bottlenecks

Current architecture supports

Concurrent calls: Limited only by VAPI account tier (not our backend)
Webhook throughput: Async FastAPI handles 10K+ req/s per instance
Tenant count: Horizontal scaling — add backend instances, no shared state
KB retrieval: pgvector HNSW — sub-linear query time, scales to millions of chunks
Database: Supabase managed PG with connection pooling (PgBouncer)

Scaling bottlenecks & mitigations

Bottleneck	Current	At Scale
Tenant resolution	Redis single-instance	Redis Cluster
Calendar API	Per-tenant rate limits	Request batching + aggressive caching
Stripe webhooks	Single endpoint	Queue-based processing (SQS/Bull)
Database writes	Single PG instance	Read replicas + write sharding by tenant
Langfuse traces	Self-hosted	Managed Langfuse Cloud
Embedding inference NEW	In-process sentence-transformers, 1 worker thread	Dedicated embedding service (GPU) or OpenAI batch API for ingest
pgvector HNSW build NEW	Inline on ingest, <1s for typical KB	Async background reindex for very large KBs (>100K chunks)
KB web scraping NEW	Sequential URL fetch per ingest	Parallel scrape workers + Playwright pool

Cost at scale (per 1,000 tenants)

Component	Monthly Cost (est.)
VAPI (voice AI)	$3,000–$8,000
Twilio (phone numbers + SMS)	$1,500–$3,000
Railway (backend compute)	$200–$500
Supabase (database + pgvector)	$75–$200
Redis	$50–$100
Embeddings (in-process, no API cost)	$0 additional (CPU on Railway instance)
Total COGS	~$5K–$12K
Revenue (1K tenants @ $120 ARPU)	$120,000
Gross margin	~90–95%

SLIDE 13

What's Next — Technical Roadmap

Near, medium, and long-term engineering priorities

Near-Term (Q2–Q3 2026)

Outbound calling engine: Appointment reminders, follow-up calls, no-show re-engagement
Real-time WebSocket dashboard: Live call monitoring with transcript streaming
Multi-calendar support: Outlook, Calendly, ServiceM8 integration
Conversation memory: Cross-call context (repeat callers recognised)
KB re-scrape scheduling: Auto-refresh KB on configurable interval (weekly/monthly)

Medium-Term (Q4 2026–Q1 2027)

Fine-tuned voice model: Custom Australian voice model trained on trade conversations
Predictive scheduling: ML-based optimal time slot recommendations
Quote generation: AI extracts job scope from call → generates preliminary quote
Multi-language support: Mandarin, Vietnamese, Arabic for diverse Australian communities
KB analytics: Which questions hit the KB? Which miss? Drive content gap suggestions to tenants

Long-Term Vision

Agentic workflow orchestration: Voice AI as the entry point to a full trade business automation platform
Edge inference: On-device STT for <100ms first-word latency
Federated learning: Cross-tenant conversation quality improvements without sharing data