Voice Infrastructure

Voice Infrastructure

End user documentation can be found in Voice.

Technical product administration in Voice Administration.

This feature (even though implemented with generic web sockets) is only supported and available for clients with access to Azure AI Speech Services.

Architecture Overview

System Context

The diagram illustrates the system context for Unique AI Voice, highlighting the high-level interactions between a human end user and the underlying software systems. The end user speaks into a microphone, and this audio is streamed to the Unique AI system via secure HTTPS or WSS protocols. Unique AI, which provides the voice features of a chat application, acts as an intermediary that processes the incoming audio and forwards it to the Azure Speech to Text service. This service, part of Microsoft Azure’s AI Speech Services, converts the audio stream into text and sends the transcribed output back to Unique AI. Finally, the transcribed text is delivered from Unique AI to the end user. The entire system operates within a secure communication framework and relies on Azure’s speech recognition capabilities to power voice interactions.

Container Overview

This container diagram for Unique AI Voice depicts how audio input from an end user is processed to generate text. The end user speaks into a microphone, and the audio is streamed securely via HTTPS or WSS to the Unique AI Chat interface, a React-based frontend with voice input capabilities. This frontend communicates with a Speech backend service built using NodeJS/NestJS, which authenticates with Azure through Workload Identity (only knowing the URL, no credentials) and manages connectivity to Azure’s speech models. The Speech service retrieves an endpoint URL from Azure Key Vault, which stores configuration values for end-to-end automation (but not secrets), and streams the audio to Azure Speech To Text, a cloud-based service that transcribes the audio and returns the resulting text. The transcribed text is then routed back through the Speech service to the frontend and ultimately displayed to the user and rendered into the prompt input field. All communication between components uses secure streaming protocols, ensuring a seamless and protected voice-to-text experience.

Connectivity remarks

  • The Voice feature never interacts with the rest of Unique AI backend services, it purely transforms audio to text from the browser and back. The result though is put into the prompt field and can be edited and then sent by the user.

  • Depending on the clients implementation of the network setup, the traffic is either routed via the Internet (discouraged) or Virtual Network internally via Azure Private DNS, Endpoints and Links (recommended).

Planning Instructions

The feature relies on Azure Speech Service. You will have to provide a Speech Endpoint that the Unique AI Voice backend service can connect to.

Depending on your company’s policies, this can involve multiple considerations on all OSI Network layers.

Unique offers a terraform module (azure-speech-service) including a deployment example which you can either leverage or use as an inspiration to deploy your own resources.

Your considerations should include:

  • Should the speech account and deployment use private endpoints

    • Unique recommends: Yes.

  • Would your deployment use a custom subdomain name

    • Unique recommends: Yes.

  • Would you want to enable diagnostic and audit settings

    • Unique recommends: Yes.

  • Would you like to make implicit role assignments for the workload identity

    • Unique recommends: No, role assignments are privileged actions and should be done explicitly in a separately governed body.

You find all these options and descriptions in the readme.

Once all decisions were made and internally documented, head on.

Budget planning

Microsoft Azure Speech-to-Text is billed based on the duration of audio processed. The rate is CHF 0.298 per hour, which breaks down as:

  • CHF 0.00497 per minute

  • CHF 0.0000828 per second

Linear Scaling
Costs scale linearly with time. For example:

  • 10 minutes = CHF 0.0497

  • 100 minutes = CHF 0.497

  • 1,000 minutes = CHF 4.97

Important Note
Charges apply whenever the microphone is open and streaming audio to Azure, even during silence. To avoid unnecessary costs, ensure the mic is only active when needed.

For more details, refer to the official Microsoft pricing table

Provisioning Instructions

The Voice feature is called or referred to as speech in code and images.

Pre-requisites

To your existing workload identities, add a new one matching the namespace and service account name of the service to deploy (by default the service account name matches the helm release name).

Terraform modules

Add the module to your terraform configuration and adapt the example values as well as backing resources.

The module will preferably store the endpoint to connect to in the given key vault.

A reference deployment can be seen in https://github.com/Unique-AG/hello-azure/blob/0a6c899136bb8b99ced8964219780a5a18dbe8d9/terraform-modules/workloads/openai.tf#L62. Make sure you adapt it to your design decisions above regarding networking and security.

Available regions for fast transcription https://learn.microsoft.com/en-us/azure/ai-services/speech-service/regions?tabs=geographies
* For some unsupported regions it may still return the transcription result correctly but with slower speed

Assign role

Once both the Azure Speech resources as well as the Workload Identity are known, ensure within your setup that a matching RBAC role assignment is made that allows the Client ID to use Cognitive Services User on either the resource or any ancestor resource.

If your terraform setup does not feature a place to perform the assignment, you can leverage the modules inbuilt role assignment variable as well (example).

Compare your setup with the full example to cross-check the implementation.

Deploy the speech service

The complete release pull request can be seen in https://github.com/Unique-AG/hello-azure/pull/101/files.

Depending on the setup you must add one additional deployment:

  1. Define defaults

  2. Set environment specific values

    1. Use the workload identity created with the role assignment

    2. Load the correct endpoint value from the Key Vault

  3. Add the deployment

  4. Extend the chat app configuration to feature the backend service

Resource Calculation Example

Resource Tuning Table

Concurrent Users

Recommended Pods

Pod CPU/Memory

Concurrent Users

Recommended Pods

Pod CPU/Memory

10

1

100–200m / 300–400Mi

100

1

500m / 512Mi

1000

10

1000m / 1Gi each

4000

40

1000m / 1Gi each

You can tune the per-pod resource size up or down based on your load testing.

Environment variables and secrets

Chat Frontend

Variable

Description

Variable

Description

SPEECH_BACKEND_API_URL

(Required) – The full WebSocket URL the client uses to connect to the transcription backend.

format:

<protocol>://<service_endpoint>/ws/<version>/<namespace>

Example:

wss://api.domain.com/speech/ws/v1/speech-to-text

Breakdown:

  • wss:// — secure WebSocket protocol

  • api.domain.com/speech — your service domain and optional base path

  • /ws — path defined in Socket adaptor (path: '/ws')

  • /v1/speech-to-text — API version and logical namespace for the transcription route

Notes:

  • protocol should be wss (WebSocket Secure) in production environments.

  • <version> allows for API versioning (e.g., v1).

  • <namespace> corresponds to the logical route configured for speech (e.g., speech-to-text).

Speech

Depending on your choices made above, different variables can be set on speech backend service.

If using regional endpoint

Variable

Description

Variable

Description

SPEECH_SERVICE_RESOURCE_ID

(Required) Either set manually as code or automatically populate it using a secrets provider and the mentioned Azure Key Vault.

SPEECH_SERVICE_REGION

(Optional) You can pass it to overwrite the Defaults to switzerlandnorth

If using Private endpoint

Variable

Description

Variable

Description

SPEECH_PRIVATE_ENDPOINT_HOST

(required) Either set manually as code or automatically populate it using a secrets provider and the mentioned Azure Key Vault.

Other environment variables

Variable

Description

Variable

Description

LOG_LEVEL

(Optional) Sets the logging verbosity (e.g., debug, info, warn, error).

Defaults to info.

Debugging web socket connectivity issues is cumbersome and tricky. In order to debug these you can enable the following flags.

They will log sensitive and classified data and are only recommended to be run on test / development setups.

ENABLE_LOG_TRAFFIC_CID

(Optional) Enables logging of WebSocket traffic per client ID for debugging purposes.

MICROSOFT_SDK_DEBUG_MODE_ENABLED_CID

(Optional) Enables detailed debug mode for Microsoft SDK for a specific client ID.

Operational Guide

Authentication methods

There are no other supported methods than Azure Entra Workload Identity.

Troubleshooting

If your speech recognition feature isn’t working, here are some steps you can follow to investigate and resolve the issue:

Check Your Internet Connection

  • Make sure your device is connected to the internet.

  • Try opening a few websites to confirm your connection is stable.

Test with Other Applications

  • See if your microphone works with other apps (like voice recording or video chat). This helps rule out hardware problems.

Browser or Device Permissions

  • Ensure your browser or device has permission to access your microphone.

  • You may see a prompt asking for microphone access—make sure to allow it.

Custom Security Software and Firewalls

If your company uses special security software, such as custom SSL interceptors or advanced firewalls, these can sometimes block speech services:

  • SSL Interceptors: These tools inspect secure traffic and can sometimes block or interfere with the secure connection needed for speech recognition.

  • Firewalls: Make sure the firewall allows connections to Microsoft’s speech services. The service may require outbound connections to specific domains or IPs.

  • Ask your IT team if any security software or settings could be interfering.

Try a Different Network

If possible, connect to a different Wi-Fi or mobile network. Some office or corporate networks may block certain services.

Check for Error Messages

If there are any error messages on the screen or in the browser console, take note of them. These can help technical support identify the problem.

Update Your Software

Make sure your browser and operating system are up to date.

Upgrade/Migration Notes

Since this is a initial installation, no migration or upgrades are currently needed.

 


Author

PTFCORE

 

© 2025 Unique AG. All rights reserved. Privacy PolicyTerms of Service