Document Translator Service

The document translator translates documents from one language to another. Currently the following file formats are supported

Microsoft Word (.docx)
Microsoft Excel (.xlsx)

Additionaly a glossary can be configured via the GlossaryService and different post processors may be applied via a TextPipeLine.

Configuration

The document translator service has two configurations on for the prompts and one for the settings

Default Settings

{
    "languageModelName": "AZURE_GPT_4_0613",
    "maxTokensPerTranlationRequest": 1000,
    "maxTokenPerMinute": 40000,
    "allowedInputLanguages": [
            "Afrikaans", "Albanian", "Arabic", "Aragonese", "Armenian", "Azeri", "Bashkir",
            "Basque", "Belarusian", "Bengali", "Bislama", "Bosnian", "Breton", "Bulgarian",
            "Burmese", "Catalan", "Chamorro", "Chechen", "Chinese", "Cornish", "Corsican",
            "Croatian", "Czech", "Danish", "Dutch", "English", "Esperanto", "Estonian", "Ewe",
            "Faroese", "Fijian", "Finnish", "French", "Galician", "Georgian", "German", "Greek",
            "Greenlandic", "Guaraní", "Haitian Creole", "Hausa", "Hebrew", "Hindi", "Hungarian",
            "Icelandic", "Ido", "Indonesian", "Interlingua", "Interlingue", "Inuktitut", "Irish",
            "Italian", "Japanese", "Javanese", "Kannada", "Kazakh", "Khmer", "Korean", "Kurdish",
            "Kyrgyz", "Lao", "Latin", "Latvian", "Limburgish", "Lingala", "Lithuanian", "Luxembourgish",
            "Macedonian", "Malagasy", "Malay", "Malayalam", "Maltese", "Manx", "Maori", "Marathi",
            "Marshallese", "Mongolian", "Navajo", "Nepali", "Northern Sami", "Norwegian", "Norwegian Bokmål",
            "Norwegian Nynorsk", "Occitan", "Ojibwe", "Old Church Slavonic", "Ossetian", "Pashto", "Persian",
            "Polish", "Portuguese", "Punjabi", "Quechua", "Romanian", "Romansch", "Russian", "Samoan", "Sanskrit",
            "Sardinian", "Scottish Gaelic", "Serbian", "Serbo-Croatian", "Sichuan Yi", "Sindhi", "Slovak",
            "Slovene", "Somali", "Spanish", "Sundanese", "Swahili", "Swedish", "Tagalog", "Tahitian", "Tajik",
            "Tamil", "Tatar", "Telugu", "Thai", "Tibetan", "Tongan", "Tswana", "Turkish", "Turkmen", "Ukrainian",
            "Urdu", "Uyghur", "Uzbek", "Vietnamese", "Volapük", "Walloon", "Welsh", "West Frisian", "Yiddish",
            "Yoruba", "Zhuang", "Zulu"]
}

Parameter Description

Parameter	Description	Default Value

Parameter	Description	Default Value
`languageModelName`	The model that will be used to translate between languages.	"AZURE_GPT_4_0613"
`maxTokensPerTranlationRequest`	The maximum number of tokens that will be translated at once. If the model cannot handle more than this many tokens in a single request then it will be split into multiple requests.	1000
`maxTokenPerMinute`	The maximum number of tokens available for translation tasks per minute.	40000
`allowedInputLanguages`	Languages that can be recognized to use correspondingly configured few-shot examples, glossary for translation and postprocessing of text.	See below

allowedInputLanguages

This parameter is relevant when using the GlossaryService and PostProcessingService as for these service to work the input language must be recognized unambiguously.

Supported are:

"Afrikaans", "Albanian", "Arabic", "Aragonese", "Armenian", "Azeri", "Bashkir",
"Basque", "Belarusian", "Bengali", "Bislama", "Bosnian", "Breton", "Bulgarian",
"Burmese", "Catalan", "Chamorro", "Chechen", "Chinese", "Cornish", "Corsican",
"Croatian", "Czech", "Danish", "Dutch", "English", "Esperanto", "Estonian", "Ewe",
"Faroese", "Fijian", "Finnish", "French", "Galician", "Georgian", "German", "Greek",
"Greenlandic", "Guaraní", "Haitian Creole", "Hausa", "Hebrew", "Hindi", "Hungarian",
"Icelandic", "Ido", "Indonesian", "Interlingua", "Interlingue", "Inuktitut", "Irish",
"Italian", "Japanese", "Javanese", "Kannada", "Kazakh", "Khmer", "Korean", "Kurdish",
"Kyrgyz", "Lao", "Latin", "Latvian", "Limburgish", "Lingala", "Lithuanian", "Luxembourgish",
"Macedonian", "Malagasy", "Malay", "Malayalam", "Maltese", "Manx", "Maori", "Marathi",
"Marshallese", "Mongolian", "Navajo", "Nepali", "Northern Sami", "Norwegian", "Norwegian Bokmål",
"Norwegian Nynorsk", "Occitan", "Ojibwe", "Old Church Slavonic", "Ossetian", "Pashto", "Persian",
"Polish", "Portuguese", "Punjabi", "Quechua", "Romanian", "Romansch", "Russian", "Samoan", "Sanskrit",
"Sardinian", "Scottish Gaelic", "Serbian", "Serbo-Croatian", "Sichuan Yi", "Sindhi", "Slovak",
"Slovene", "Somali", "Spanish", "Sundanese", "Swahili", "Swedish", "Tagalog", "Tahitian", "Tajik",
"Tamil", "Tatar", "Telugu", "Thai", "Tibetan", "Tongan", "Tswana", "Turkish", "Turkmen", "Ukrainian",
"Urdu", "Uyghur", "Uzbek", "Vietnamese", "Volapük", "Walloon", "Welsh", "West Frisian", "Yiddish",
"Yoruba", "Zhuang", "Zulu"

Prompt configuration

❗Only adjust prompts if you are fully familiar with the code logic. Small changes can break the module or reduce the output quality.

Parameter Description

Parameter	Description	Default Value

Parameter	Description	Default Value
`systemPromptInstruction`	System prompt instruction for the document translation service.	See below
`userMessageTemplate`	A jinja2 template for the user message	See below

systemPromptInstruction


"You are a helpful AI designed to to translate text to a specified language.
Do it even if the target language is the same as the source language.
Make sure the translated text contains the same amount of carriage returns '\\n' as the original text block.
Try to keep the translated text as close to the original as possible and having approximately the same lenght.",

userMessageTemplate

"Please translate the following text pieces in {{format_style}} {% if input_language %}from {{input_language}} {% endif %}to {{output_language}}

{% if glossary %}Use the following translation rules 

{{ glossary_text }}{% endif %}

{{formatted_text_pieces}}"

Prompting instructions
On templating with jinja2. The userMessageTemplate will be rendered with a specific set of variables in the code. The below table lists them so a user defined template can optionally use them.

Parameter	Description

Parameter	Description
`input_language`	The input language if it was detectable else `None`
`output_language`	The output language as a string
`glossary`	Bolean if the glossary is available
`glossary_text`	The glossary text
`format_style`	The style of how the text pieces of a document will be presented to the LLM
`formatted_text_pieces`	Text pieces formatted in e.g. a html structure

Public Documentation

Document Translator Service

Configuration

Default Settings

Prompt configuration

Related content