CareVision — Technical Decisions

Design principles

The rules behind the choices.

These six principles guided every decision from technology selection to scope boundaries. When two approaches seemed equally valid, we used these to break the tie.

Privacy by architecture

Video analysis runs on-device. No frames are uploaded. No cloud ML. The phone does the thinking, and only structured events leave the device. Privacy is not a policy; it is how the system is built.

Speed is clinical

In a care setting, latency is not a performance metric. It is a safety metric. We optimized for time-to-first-frame and event-to-alert latency because seconds matter when a resident is at risk.

Thin backend, smart edges

The backend orchestrates. It does not think. Intelligence lives on the device (Vision, state machine) and in the caregiver's hands (decision-making with live context). The server is a relay, not a brain.

Demo honesty

This is a showcase, and we never pretend otherwise. In-memory storage, demo auth tokens, and simplified detection are all clearly documented. We prove the architecture works without faking production readiness.

One app, two roles

Instead of building separate sensor and caregiver apps, we built one binary with a role picker. This makes the demo simpler, the codebase smaller, and the shared models guaranteed to stay in sync.

Prove the hard parts first

We started with the riskiest technical problems: real-time video latency, on-device CV accuracy, and SSE delivery speed. Polish and persistence come after the core loop is proven.

Key decisions

What we chose and why.

Each of these decisions had alternatives. We chose the option that best served the demo goal while staying architecturally honest about what a production system would need.

WebRTC for live video, not HLS

We need sub-second latency. HLS cannot deliver that.

✓ Why WebRTC

Sub-second latencyTypically 200-400ms on LAN, which makes the live view feel truly live
Native iOS supportLiveKit provides a well-maintained Swift SDK with AVFoundation integration
Data channelsWe can send the motion overlay data alongside video without a separate transport
Proven at scaleLiveKit handles the SFU complexity so we focus on the product

✗ Why not HLS

3-10 second latencyEven low-latency HLS adds seconds of delay, which defeats the purpose of "live context"
Segment-basedThe chunking model is designed for broadcast, not interactive triage
No data channelWould need a separate WebSocket for overlay data

Apple Vision on-device, not cloud ML

Privacy and latency both demand local processing.

✓ Why on-device

No frames leave the deviceVideo stays on the phone. Only structured events (person detected, pose classified) are sent to the backend
Zero network dependencyDetection works even if the network is degraded, because it never needs the network
Lower latencyNo round-trip to a cloud endpoint means detection happens in the same frame cycle
Free with iOSApple Vision is a system framework with no per-inference cost

✗ Why not cloud ML

Privacy riskUploading resident video to a cloud endpoint is a non-starter for a care product
Network dependentCloud inference fails when connectivity is poor, which is exactly when you need it most
Cost at scalePer-frame inference costs add up fast with 24/7 monitoring

SSE for alerts, not WebSockets

Alerts flow one direction. Use the simpler protocol.

✓ Why SSE

Unidirectional is correctAlerts and timeline updates only flow server to client. SSE matches the data flow exactly
Auto-reconnectThe EventSource API reconnects automatically on disconnect, which matters for mobile
Simpler server codeNo upgrade handshake, no ping/pong, no frame parsing. Just write to the response stream
HTTP-nativeWorks with standard proxies, load balancers, and CDNs without special configuration

✗ Why not WebSockets

Bidirectional overheadWe do not need client-to-server push on this channel. Actions go through REST
Connection managementWebSockets require explicit ping/pong and reconnect logic
Proxy complexitySome corporate/hospital networks handle WebSocket upgrades poorly

Native iOS, not React Native

Camera, Vision, and WebRTC all demand platform-level control.

✓ Why native Swift

AVFoundation accessDirect camera control with no bridge layer. Frame processing at the buffer level
Vision frameworkNative VNRequest pipeline with zero serialization overhead
LiveKit Swift SDKFirst-class WebRTC integration without React Native bridging
SwiftUIModern declarative UI with good performance for real-time updates

Context

In a production app, React Native could own the standard product screens (settings, history, profile) while native iOS handles the camera, Vision, and media-critical paths. This showcase demonstrates the native side specifically because that is where the hard engineering problems live.

In-memory storage, not a database

This is a demo. Durability is a solved problem; we are proving the real-time loop.

✓ Why in-memory

Zero setupNo database to install, configure, or migrate. Clone and run
Fast iterationChange the data model, restart, and it is applied. No migration scripts
Honest about scopeIn-memory storage forces everyone to understand this is a demo, not a deployable product

Production path

A production system would use PostgreSQL (matching Inspiren's actual stack) with WAL-based replication for durability, plus Redis for the pub/sub fanout layer. The store interface is already abstracted behind a class, so swapping in a real database is a mechanical change, not an architectural one.

Trade-offs

What we accepted to move fast.

Every prototype involves trade-offs. We made them explicitly and documented the path to production for each one.

Accepted State resets on restart

All alerts, timelines, and sessions are lost when the backend restarts. This is fine for a demo that runs for 10 minutes. Production would persist to PostgreSQL and recover state from the database on startup.

Accepted No push notifications

Alerts arrive only when the app is open and connected via SSE. A production system would use APNs for background delivery so caregivers are notified even when the app is not in the foreground.

Accepted Demo authentication only

Bearer tokens are issued without real identity verification. Production would use OAuth or SAML, integrate with facility identity providers, and enforce role-based access with proper session management.

Accepted Single room, single sensor

The demo supports one room with one sensor and one caregiver. Production needs multi-room orchestration, sensor health monitoring, and alert routing based on assignments and proximity.

Mitigated CV false positives

The bed-exit detector uses persistence windows (0.6-0.8s) and cooldown windows (8-10s) to reduce noise, but clinical-grade accuracy would require a trained ML model, not just pose heuristics.

Mitigated No offline support

The app requires network connectivity to function. In production, critical events would be queued locally with durable storage and synced when connectivity returns, following a local-first architecture.

Intentional scope

What we left out on purpose.

These features are solvable problems with known solutions. We omitted them because they do not help prove the core thesis: that the real-time detection-to-response loop can work with low latency and good architecture.

Database persistence

Swap the in-memory store for PostgreSQL. The interface is already abstracted.

Push notifications

APNs integration for background alert delivery. Standard iOS capability.

Real authentication

OAuth/SAML with facility identity providers. Well-understood integration.

Video recording

Clip recording and replay for post-incident review. LiveKit supports this natively.

Multi-room management

Room registry, sensor health, caregiver assignment, and alert routing.

HIPAA infrastructure

Encryption at rest, audit logging, access controls, BAA compliance. Important but orthogonal to the demo.

Why we built it this way.