Voice Feature - Architecture

This article outlines the streamlined architecture of a voice feature that integrates Microsoft Speech-to-Text services, emphasizing secure and efficient communication without introducing new protocols or security requirements.

System Architecture

User Interaction: The user speaks into the microphone, and the browser records the audio natively.
Communication Flow:
- The browser transmits the audio data via HTTPS and WebSocket to the backend.
- This process uses existing protocols, ensuring no new communication patterns are introduced.
- Websockets over Https are already in use in the browser for streaming the LLM answers and are no new requirements.
Backend Integration:
- The backend connects to Microsoft Speech-to-Text services for transcription. It connects through workload identity as the other Microsoft Services do like OpenAI.
- The service is deployed internally within the Azure subscription, maintaining security and efficiency.
Security and Protocols:
- No new ports need to be opened, and security requirements remain unchanged.
- WebSocket is already utilized for communication with large language models, ensuring consistency.

Whats new:

deployment of Text To Speech of Microsoft.
Inside of Uniques backend a new Microservice handles the communication.

What not changing:

No new ports or cummunication patterns than we already use.
No internal services of Microsoft are directly exposed.

Author	@Andreas Hauri