Voice Infrastructure
End user documentation can be found in Voice.
Technical product administration in Voice Administration.
This feature (even though implemented with generic web sockets) is only supported and available for clients with access to Azure AI Speech Services.
Architecture Overview
System Context
The diagram illustrates the system context for Unique AI Voice, highlighting the high-level interactions between a human end user and the underlying software systems. The end user speaks into a microphone, and this audio is streamed to the Unique AI system via secure HTTPS or WSS protocols. Unique AI, which provides the voice features of a chat application, acts as an intermediary that processes the incoming audio and forwards it to the Azure Speech to Text service. This service, part of Microsoft Azure’s AI Speech Services, converts the audio stream into text and sends the transcribed output back to Unique AI. Finally, the transcribed text is delivered from Unique AI to the end user. The entire system operates within a secure communication framework and relies on Azure’s speech recognition capabilities to power voice interactions.
Container Overview
This container diagram for Unique AI Voice depicts how audio input from an end user is processed to generate text. The end user speaks into a microphone, and the audio is streamed securely via HTTPS or WSS to the Unique AI Chat interface, a React-based frontend with voice input capabilities. This frontend communicates with a Speech backend service built using NodeJS/NestJS, which authenticates with Azure through Workload Identity (only knowing the URL, no credentials) and manages connectivity to Azure’s speech models. The Speech service retrieves an endpoint URL from Azure Key Vault, which stores configuration values for end-to-end automation (but not secrets), and streams the audio to Azure Speech To Text, a cloud-based service that transcribes the audio and returns the resulting text. The transcribed text is then routed back through the Speech service to the frontend and ultimately displayed to the user and rendered into the prompt input field. All communication between components uses secure streaming protocols, ensuring a seamless and protected voice-to-text experience.
Connectivity remarks
The Voice feature never interacts with the rest of Unique AI backend services, it purely transforms audio to text from the browser and back. The result though is put into the prompt field and can be edited and then sent by the user.
Depending on the clients implementation of the network setup, the traffic is either routed via the Internet (discouraged) or Virtual Network internally via Azure Private DNS, Endpoints and Links (recommended).
Planning Instructions
The feature relies on Azure Speech Service. You will have to provide a Speech Endpoint that the Unique AI Voice backend service can connect to.
Depending on your company’s policies, this can involve multiple considerations on all OSI Network layers.
Unique offers a terraform module (azure-speech-service) including a deployment example which you can either leverage or use as an inspiration to deploy your own resources.
Your considerations should include:
Should the speech account and deployment use private endpoints
Unique recommends: Yes.
Would your deployment use a custom subdomain name
Unique recommends: Yes.
Would you want to enable diagnostic and audit settings
Unique recommends: Yes.
Would you like to make implicit role assignments for the workload identity
Unique recommends: No, role assignments are privileged actions and should be done explicitly in a separately governed body.
You find all these options and descriptions in the readme.
Once all decisions were made and internally documented, head on.
Budget planning
Microsoft Azure Speech-to-Text is billed based on the duration of audio processed. The rate is CHF 0.298 per hour, which breaks down as:
CHF 0.00497 per minute
CHF 0.0000828 per second
Linear Scaling
Costs scale linearly with time. For example:
10 minutes = CHF 0.0497
100 minutes = CHF 0.497
1,000 minutes = CHF 4.97
Important Note
Charges apply whenever the microphone is open and streaming audio to Azure, even during silence. To avoid unnecessary costs, ensure the mic is only active when needed.
For more details, refer to the official Microsoft pricing table
Provisioning Instructions
The Voice feature is called or referred to as speech
in code and images.
Pre-requisites
To your existing workload identities, add a new one matching the namespace and service account name of the service to deploy (by default the service account name matches the helm release name).
Terraform modules
Add the module to your terraform configuration and adapt the example values as well as backing resources.
The module will preferably store the endpoint to connect to in the given key vault.
A reference deployment can be seen in https://github.com/Unique-AG/hello-azure/blob/0a6c899136bb8b99ced8964219780a5a18dbe8d9/terraform-modules/workloads/openai.tf#L62. Make sure you adapt it to your design decisions above regarding networking and security.
Available regions for fast transcription https://learn.microsoft.com/en-us/azure/ai-services/speech-service/regions?tabs=geographies
* For some unsupported regions it may still return the transcription result correctly but with slower speed
Assign role
Once both the Azure Speech resources as well as the Workload Identity are known, ensure within your setup that a matching RBAC role assignment is made that allows the Client ID to use Cognitive Services User
on either the resource or any ancestor resource.
If your terraform setup does not feature a place to perform the assignment, you can leverage the modules inbuilt role assignment variable as well (example).
Compare your setup with the full example to cross-check the implementation.
Deploy the speech service
The complete release pull request can be seen in https://github.com/Unique-AG/hello-azure/pull/101/files.
Depending on the setup you must add one additional deployment:
Set environment specific values
Use the workload identity created with the role assignment
Load the correct endpoint value from the Key Vault
Extend the chat app configuration to feature the backend service
Resource Calculation Example
Resource Tuning Table
Concurrent Users | Recommended Pods | Pod CPU/Memory |
---|---|---|
10 | 1 | 100–200m / 300–400Mi |
100 | 1 | 500m / 512Mi |
1000 | 10 | 1000m / 1Gi each |
4000 | 40 | 1000m / 1Gi each |
You can tune the per-pod resource size up or down based on your load testing.
Environment variables and secrets
Chat Frontend
Variable | Description |
---|---|
SPEECH_BACKEND_API_URL | (Required) – The full WebSocket URL the client uses to connect to the transcription backend. |
format:
<protocol>://<service_endpoint>/ws/<version>/<namespace>
Example:
wss://api.domain.com/speech/ws/v1/speech-to-text
Breakdown:
wss://
— secure WebSocket protocolapi.domain.com/speech
— your service domain and optional base path/ws
— path defined inSocket adaptor
(path: '/ws'
)/v1/speech-to-text
— API version and logical namespace for the transcription route
Notes:
protocol
should bewss
(WebSocket Secure) in production environments.<version>
allows for API versioning (e.g.,v1
).<namespace>
corresponds to the logical route configured for speech (e.g.,speech-to-text
).
Speech
Depending on your choices made above, different variables can be set on speech backend service.
If using regional endpoint
Variable | Description |
---|---|
SPEECH_SERVICE_RESOURCE_ID | (Required) Either set manually as code or automatically populate it using a secrets provider and the mentioned Azure Key Vault. |
SPEECH_SERVICE_REGION | (Optional) You can pass it to overwrite the Defaults to |
If using Private endpoint
Variable | Description |
---|---|
SPEECH_PRIVATE_ENDPOINT_HOST | (required) Either set manually as code or automatically populate it using a secrets provider and the mentioned Azure Key Vault. |
Other environment variables
Variable | Description |
---|---|
LOG_LEVEL | (Optional) Sets the logging verbosity (e.g., debug, info, warn, error). Defaults to |
Operational Guide
Authentication methods
There are no other supported methods than Azure Entra Workload Identity.
Troubleshooting
If your speech recognition feature isn’t working, here are some steps you can follow to investigate and resolve the issue:
Check Your Internet Connection
Make sure your device is connected to the internet.
Try opening a few websites to confirm your connection is stable.
Test with Other Applications
See if your microphone works with other apps (like voice recording or video chat). This helps rule out hardware problems.
Browser or Device Permissions
Ensure your browser or device has permission to access your microphone.
You may see a prompt asking for microphone access—make sure to allow it.
Custom Security Software and Firewalls
If your company uses special security software, such as custom SSL interceptors or advanced firewalls, these can sometimes block speech services:
SSL Interceptors: These tools inspect secure traffic and can sometimes block or interfere with the secure connection needed for speech recognition.
Firewalls: Make sure the firewall allows connections to Microsoft’s speech services. The service may require outbound connections to specific domains or IPs.
Ask your IT team if any security software or settings could be interfering.
Try a Different Network
If possible, connect to a different Wi-Fi or mobile network. Some office or corporate networks may block certain services.
Check for Error Messages
If there are any error messages on the screen or in the browser console, take note of them. These can help technical support identify the problem.
Update Your Software
Make sure your browser and operating system are up to date.
Upgrade/Migration Notes
Since this is a initial installation, no migration or upgrades are currently needed.
Author | PTFCORE |
---|
© 2025 Unique AG. All rights reserved. Privacy Policy – Terms of Service