/
Voice Feature - architectural options

Voice Feature - architectural options

 

 

image-20250120-130452.png

 

  1. Option 1: Public API with Backend Token

    • Pros

      • Simplest implementation

      • Direct streaming from browser to MS Speech

      • Lowest latency

      • Token exposure is negligible risk given limitations

    • Cons

      • Using MS shared Cognitive Services

      • Access token is exposed to client

 

  1. Option 2: Azure Tenant MS STT with AAD

    • Pros

      • Better security with AAD authentication - Ad hoc app registration with Cognitive Services User Role (read-only) + ST endpoint

      • Direct streaming from browser to MS Speech

      • Lowest latency

      • Data processing isolation

      • Token exposure is negligible risk given limitations

    • Cons

      • More complex AAD setup required

      • Access token is exposed to client

 

  1. Option 3: Backend Processing

    • Pros

      • Credentials never leave backend

      • Complete control over audio processing

    • Cons:

      • Significantly higher latency (full file upload)

      • Unnecessary server load

      • More complex error handling

      • No real security benefit given token limitations

Recommendation

Based on the analysis, Option 2 provides the best balance between security and performance. It allows for real-time streaming while maintaining proper security through AAD.

 

© 2024 Unique AG. All rights reserved. Privacy PolicyTerms of Service