Nexa 247 is a multi-tenant voice AI orchestration platform that dynamically provisions and manages AI receptionists for trade businesses. Each inbound call triggers a real-time pipeline: telephony ingestion, tenant resolution, dynamic prompt assembly, tool-augmented conversation, and post-call intelligence extraction — all at sub-second latency.
This is not a chatbot with a phone number. This is a stateful, tool-calling voice agent that books appointments, qualifies leads, and escalates emergencies — in real-time, mid-conversation.
# Hot path: phone_number_id → full assistant config
async def handle_assistant_request(payload):
phone_id = payload["phone_number_id"]
# 1. Tenant resolution — Redis-cached, <0.5ms hit rate >99%
tenant = await cache.get_tenant_by_phone(phone_id)
# 2. Dynamic prompt assembly — skills-based + KB block
system_prompt = build_prompt(
base_skills=["greeting", "booking", "lead_qualification"],
tenant_config=tenant.ai_config,
services=tenant.services,
business_hours=tenant.hours,
personality=tenant.voice_personality,
smb_knowledge=tenant.smb_knowledge, # KB mode routing
)
# 3. Tool injection — only tools this tenant has enabled
tools = resolve_tools(tenant) # calendar, sms, lead_capture, kb_retrieval
# 4. Return complete assistant config
return AssistantConfig(
model="claude-sonnet-4-20250514",
voice={"provider": "11labs", "voice_id": "australian_male"},
system_prompt=system_prompt,
tools=tools,
metadata={"tenant_id": tenant.id}
)
Key design decisions
Isolation guarantees
| Layer | Mechanism |
|---|---|
| Data isolation | Row-level security in PostgreSQL (tenant_id on every table) |
| Prompt isolation | System prompts assembled per-tenant, never shared |
| KB isolation | smb_knowledge_chunks filtered by tenant_id on every query |
| Tool isolation | OAuth credentials encrypted per-tenant (Fernet symmetric encryption) |
| Billing isolation | Stripe customer/subscription per-tenant, metered independently |
| Call isolation | Each call tagged with tenant_id in metadata, traced in Langfuse |
| Ingest isolation | Redis distributed lock per-tenant prevents concurrent KB ingests |
// answer_product_service_question — RAG retrieval mid-call
{
"trigger": "Caller asks about pricing, services, or specific products",
"flow": [
"1. LLM detects information question (not a booking action)",
"2. VAPI sends function-call webhook to Nexa backend",
"3. Backend runs hybrid_search(tenant_id, question, top_k=5)",
" ├─ BM25 leg: PostgreSQL tsvector FTS (GIN index)",
" ├─ Vector leg: pgvector cosine ANN (HNSW index)",
" └─ Fuse both legs with Reciprocal Rank Fusion (k=60)",
"4. Return top-5 chunk contents to VAPI",
"5. LLM synthesises answer from retrieved context",
"6. AI responds to caller with accurate, grounded information"
],
"latency": "~50–120ms (both legs run concurrently via asyncio.gather)",
"fallback": "If KB empty or inactive, AI answers from form-only profile"
}
backend/skills/
├── greeting.md # G'day opener, business intro
├── booking.md # Calendar availability + booking flow
├── lead_qualification.md # Urgency scoring, service matching
├── emergency_handling.md # Escalation protocols
├── closing.md # Call wrap-up, confirmation
└── personality/
├── tradie_casual.md # Warm, Australian, mate-oriented
├── professional.md # Formal business tone
└── custom/ # Per-tenant personality overrides
def build_system_prompt(tenant: Tenant, smb_knowledge: dict) -> str:
sections = []
# Core identity
sections.append(f"You are the AI receptionist for {tenant.business_name}.")
# Load skill modules
for skill in tenant.enabled_skills:
sections.append(load_skill(skill)) # Cached markdown
# Inject business context
sections.append(format_services(tenant.services))
sections.append(format_hours(tenant.business_hours))
# ── KB block — mode-routed ────────────────────────────────
kb_block = _build_knowledge_base_block(smb_knowledge, tenant.vertical)
if kb_block:
sections.append(kb_block)
# markdown mode → full text embedded in <knowledge_base> tags
# vector mode → advisory + "call answer_product_service_question"
# form_only → no block; AI answers from services list only
# Personality layer
sections.append(load_personality(tenant.voice_personality))
return "\n\n".join(sections)
KB block modes
| Mode | When | Prompt injection |
|---|---|---|
| form_only | No KB ingested | No <knowledge_base> block. AI answers from <services> only. |
| markdown | ≤ 4,000 tokens | Full text embedded inline. AI reads directly — no tool call needed. |
| vector | > 4,000 tokens | Advisory block + instruction to call answer_product_service_question. |
Ingest Pipeline
Accepts URLs, PDFs (with OCR fallback via pdfplumber/pytesseract), and raw text. All sources accumulate — new sources merge with existing, replacing only matching identities.
Layer 1: trafilatura (clean article/prose extraction). Layer 2: custom HTML visible-text parser (recovers Elementor/Divi service grids that trafilatura drops as "boilerplate"). Layer 3: JSON-LD/OpenGraph/meta structured data parser. If static extraction yields <800 chars → escalates to Playwright for JS-heavy SPAs.
Two-layer SSRF check: string-pattern match on private ranges + DNS resolution to block rebinding attacks. All content sanitized against prompt injection before storage.
tiktoken cl100k_base counts tokens across all sources. ≤ 4,000 → markdown mode (full text in JSONB). > 4,000 → vector mode (chunk + embed + pgvector).
Text split into 256-token chunks (tiktoken). Each chunk embedded with all-MiniLM-L6-v2 via sentence-transformers (384-dim, L2-normalised). Singleton embedder with ThreadPoolExecutor(max_workers=1) — async-safe, serialised sync encode() calls.
Atomic transaction: DELETE old chunks → INSERT new chunks with vector(384) embeddings → UPDATE smb_knowledge JSONB. HNSW index (m=16, ef_construction=64, cosine ops) for fast ANN search. Generated tsvector column (GENERATED ALWAYS … STORED) + GIN index for BM25.
Haiku forced-tool-use extracts a clean list of bookable services from KB text. Pre-fills the onboarding services form — the tradie never has to type their services manually.
After every ingest, assistant_cache.invalidate(tenant_id) ensures the next call gets the updated KB block in its system prompt.
Hybrid Search at Query Time
Why hybrid over pure vector?
| Scenario | BM25 wins | Vector wins |
|---|---|---|
| "What's your ABN?" | Exact keyword match — "ABN" token | May not find exact short string |
| "Do you fix leaky taps?" | Misses if website says "tap repairs" | Semantic match on "tap repairs" ↔ "leaky taps" |
| "Hot water system" | Hits exact match on page copy | Catches "HWS", "water heater", "hot water unit" |
| RRF fusion | Both legs contribute — chunks appearing in both rank highest. Best of both worlds. | |
Infrastructure & concurrency controls
| Control | Mechanism | Value |
|---|---|---|
| Distributed lock | Redis SETNX on kb_ingest_lock:{tenant_id} | 5-min TTL |
| Rate limit | Redis key on first ingest only | 24h cooldown for first-ever scrape |
| Stale-on-failure | On re-ingest failure, marks may_be_stale | Prior KB preserved |
| Entitlement gate | kb_for_call(tier, trial_expires_at) | Trial window or paid tier required |
| Dimension | What It Measures | Method |
|---|---|---|
| Greeting Quality | Did it use tenant's business name? Natural opener? | LLM-as-judge |
| Information Capture | Did it collect name, phone, service, address? | Field extraction check |
| Booking Accuracy | Did it check real availability? Book correct slot? | Calendar API verification |
| Lead Qualification | Correct urgency scoring? Appropriate follow-up? | Rubric-based scoring |
| Emergency Handling | Did it escalate correctly? Appropriate urgency? | Scenario-based testing |
| Conversation Flow | Natural transitions? No hallucinations? | LLM-as-judge + human review |
| Australian English | Natural Aussie phrasing? No Americanisms? | Linguistic pattern matching |
| KB Retrieval Accuracy NEW | Did RAG return relevant chunks? Did AI answer correctly? | Ground-truth QA set per tenant |
| KB Grounding NEW | Did AI avoid hallucinating info not in KB? | Faithfulness check (LLM-as-judge) |
Eval infrastructure
Backend (Python Async)
| Layer | Choice | Rationale |
|---|---|---|
| Framework | FastAPI | Async-native, Pydantic validation, OpenAPI docs |
| Runtime | Python 3.11+ / Uvicorn | Async/await throughout, zero blocking I/O |
| ORM | SQLAlchemy 2.0 (async) | Async session management, type-safe queries |
| DB Driver | asyncpg | Native PostgreSQL async driver, connection pooling |
| Migrations | Alembic | Version-controlled schema evolution |
| Cache | Redis (aioredis) | Tenant config, calendar availability, KB ingest locks |
| Encryption | Fernet (cryptography) | OAuth token encryption at rest |
| Observability | Sentry + Langfuse | Error tracking + AI conversation tracing |
| Vector DB | pgvector (PostgreSQL) | HNSW index, cosine similarity, no separate vector service needed |
| Embeddings | sentence-transformersall-MiniLM-L6-v2 | 384-dim, fast, runs in-process; no external API call or cost per embed |
| Tokenizer | tiktoken cl100k_base | Accurate token counting for mode routing & chunking |
| Web scraping | trafilatura + httpx + Playwright (fallback) | Static-first (cheap); Playwright only for JS-heavy SPAs |
| PDF extraction | pdfplumber + OCR fallback | Structured PDFs direct; scanned PDFs via pytesseract |
Frontend (React SPA)
| Layer | Choice | Rationale |
|---|---|---|
| Framework | React 19 + TypeScript | Type-safe, hooks-first architecture |
| Build | Vite 8 | Sub-second HMR, optimised production builds |
| State | TanStack Query | Server-state caching, auto-refetch, stale-while-revalidate |
| Styling | Tailwind CSS 4 | Utility-first, zero runtime CSS overhead |
| Animation | Framer Motion | GPU-accelerated, declarative transitions |
| Auth | Supabase JS SDK | JWT session management, auto-refresh |
External Services
| Service | Role | Integration |
|---|---|---|
| VAPI | Voice AI orchestration | Webhook-driven, assistant-request pattern |
| Claude Sonnet 4 | Conversation intelligence | Via VAPI (primary) + direct API (post-call) |
| Claude Haiku | KB service extraction | Forced tool-use on ingested content |
| Twilio | Telephony infrastructure | Phone number provisioning, SMS delivery |
| Google Calendar | Appointment management | OAuth 2.0, FreeBusy + Events API |
| Stripe | Billing & metering | Checkout, webhooks, metered subscriptions |
| Supabase | Auth + PostgreSQL + RLS | Managed Postgres + pgvector extension |
| Langfuse | AI observability | Trace collection, eval scoring, latency analysis |
Infrastructure
| Component | Platform | Config |
|---|---|---|
| Backend | Railway.app | Auto-deploy from git, horizontal scaling |
| Frontend | Vercel | Edge-deployed SPA, CDN-cached |
| Database + pgvector | Supabase (managed PG) | pgvector extension enabled; connection pooling via PgBouncer |
| Cache / Ingest Locks | Redis | Tenant config TTL, calendar cache, KB distributed locks |
| DNS | GoDaddy + Vercel | Split: www → Astro marketing, app → React portal |
Data Protection
smb_knowledge_chunks)Australian Compliance
API Security
Current architecture supports
Scaling bottlenecks & mitigations
| Bottleneck | Current | At Scale |
|---|---|---|
| Tenant resolution | Redis single-instance | Redis Cluster |
| Calendar API | Per-tenant rate limits | Request batching + aggressive caching |
| Stripe webhooks | Single endpoint | Queue-based processing (SQS/Bull) |
| Database writes | Single PG instance | Read replicas + write sharding by tenant |
| Langfuse traces | Self-hosted | Managed Langfuse Cloud |
| Embedding inference NEW | In-process sentence-transformers, 1 worker thread | Dedicated embedding service (GPU) or OpenAI batch API for ingest |
| pgvector HNSW build NEW | Inline on ingest, <1s for typical KB | Async background reindex for very large KBs (>100K chunks) |
| KB web scraping NEW | Sequential URL fetch per ingest | Parallel scrape workers + Playwright pool |
Cost at scale (per 1,000 tenants)
| Component | Monthly Cost (est.) |
|---|---|
| VAPI (voice AI) | $3,000–$8,000 |
| Twilio (phone numbers + SMS) | $1,500–$3,000 |
| Railway (backend compute) | $200–$500 |
| Supabase (database + pgvector) | $75–$200 |
| Redis | $50–$100 |
| Embeddings (in-process, no API cost) | $0 additional (CPU on Railway instance) |
| Total COGS | ~$5K–$12K |
| Revenue (1K tenants @ $120 ARPU) | $120,000 |
| Gross margin | ~90–95% |
Near-Term (Q2–Q3 2026)
Medium-Term (Q4 2026–Q1 2027)
Long-Term Vision