Voice Feature - architectural options
Option 1: Public API with Backend Token
Pros
Simplest implementation
Direct streaming from browser to MS Speech
Lowest latency
Token exposure is negligible risk given limitations
Cons
Using MS shared Cognitive Services
Access token is exposed to client
Option 2: Azure Tenant MS STT with AAD
Pros
Better security with AAD authentication - Ad hoc app registration with Cognitive Services User Role (read-only) + ST endpoint
Direct streaming from browser to MS Speech
Lowest latency
Data processing isolation
Token exposure is negligible risk given limitations
Cons
More complex AAD setup required
Access token is exposed to client
Option 3: Backend Processing
Pros
Credentials never leave backend
Complete control over audio processing
Cons:
Significantly higher latency (full file upload)
Unnecessary server load
More complex error handling
No real security benefit given token limitations
Recommendation
Based on the analysis, Option 2 provides the best balance between security and performance. It allows for real-time streaming while maintaining proper security through AAD.
© 2024 Unique AG. All rights reserved. Privacy Policy – Terms of Service