Voice Feature - architectural options

Option 1: Public API with Backend Token
- Pros
  - Simplest implementation
  - Direct streaming from browser to MS Speech
  - Lowest latency
  - Token exposure is negligible risk given limitations
- Cons
  - Using MS shared Cognitive Services
  - Access token is exposed to client

Option 2: Azure Tenant MS STT with AAD
- Pros
  - Better security with AAD authentication - Ad hoc app registration with Cognitive Services User Role (read-only) + ST endpoint
  - Direct streaming from browser to MS Speech
  - Lowest latency
  - Data processing isolation
  - Token exposure is negligible risk given limitations
- Cons
  - More complex AAD setup required
  - Access token is exposed to client

Option 3: Backend Processing
- Pros
  - Credentials never leave backend
  - Complete control over audio processing
- Cons:
  - Significantly higher latency (full file upload)
  - Unnecessary server load
  - More complex error handling
  - No real security benefit given token limitations

Recommendation

Based on the analysis, Option 2 provides the best balance between security and performance. It allows for real-time streaming while maintaining proper security through AAD.

Public Documentation

Voice Feature - architectural options

Analytics

Recommendation

Related content