Smile news

From multilingual speech to subtitles

  • Date de l’événement Jan. 18 2024
  • Temps de lecture min.

How to build an automatic subtitle system for multilingual videos

Recently, we participated in a public tender related to the need for innovative automatic video transcription and subtitle generation.


A toolchain for automatically creating a speech transcription is quite common today, and many solutions exist.


However, for this project, a special request adds a significant challenge. The need is to transcribe multilingual videos, including the Luxembourgish language, to generate subtitles.


Luxembourgish is mainly spoken in Luxembourg. Around 400,000 people speak Luxembourgish worldwide. Luxembourgish is quite close to German, but it is characterized by a high degree of multilingualism : words and phrases mainly from French and German are part of the regular lexicon and occur quite frequently.


This is a challenge because not all common tools available are always designed to support multilingual audio, but also mainly because Luxembourgish is definitely not as common a language as English or French.


And to top it all off, we had to do it with a budget constraint, for this proof of concept phase, which was about 10 days to work on a concept to validate our ideas.

State-of-the-art text for speech

For this project, we looked at open source "text-to-speech" models that can handle such tasks. To date, two candidates clearly stand out:

 

Both models use an encoder/decoder model architecture, and are therefore comparable. Their training data and training methods are different, which definitely affects the accuracy and supported languages.


They both offer different model sizes to choose the accuracy/throughput ratio suited to your needs. They also support many languages, with interesting precision for widely spoken languages. Luxembourgish is among the supported languages, but with poor accuracy due to a very limited amount of training data for this particular language.


Regarding the optimization of these models to better support Luxembourgish, we can mention several local initiatives. These initiatives all come from the academic world, as they require the collection and organization of specific training data (recordings of Luxembourgish speech with textual truth). To cite two, both using the Meta model (wav2vec):

 

Their results are interesting, because all of them proved a significant reduction in the error rate on the audio transcription of Luxembourgish. However, for unknown reasons, the available fine-tuning models completely lack support for other languages.


As one can easily understand that the success of such a project will require this type of fine-tuning of the model to achieve appropriate accuracy, it is also important to consider the constraints of subtitles: the transcription must be synchronized with the video stream and should be organized into sentences, not just a list of words without any punctuation.


All these criteria must be taken into account to obtain an appropriate result. In this context, it is important to see what the models are capable of.


In our testing, we found that the Meta model was only able to give timestamps of words, without sentence structure or timestamps. In comparison, the Whisper model is able to accurately transcribe sentences with their timestamp.

Which model should you choose for this proof of concept?

In summary, we need a model with multilingual transcription capabilities, supporting Luxembourgish, and providing transcription with sentence structures and timestamps.


At the time of writing this article, there is no model that meets all of these criteria. As we faced a short time frame to build this proof of concept, and as the training dataset for Luxembourgish already exists to some extent, we chose to focus on the most suitable model to generate the subtitles rather than obtaining the best Luxembourgish transcription accuracy.


Fine-tuning a model is time-consuming and can be expensive, so we chose to build a complete toolchain to validate each step rather than focusing on early accuracy. Fine-tuning could be done later, with a validated toolchain architecture.


Additionally, Luxembourgish is not the only language to be supported in this proof of concept. English, German and French are also expected to be common. And these tongues have very good precision without any fine tuning.


So, the Whisper model approach is interesting:

  • We can get a good transcription of the sentences.
  • Major languages have a very good word error rate.
  • The fine-tuning stage for better support of Luxembourgish will be possible.

 

An alternative solution could have been to continue using the Meta XLS-R model with fine tuning and build an algorithm to reconstruct sentences from the timestamps of the words. We tried this method, but the results were either unsatisfactory or would have involved other AI models.


In addition to this, you also need to consider the aspect of multilingual content.


This is why we chose to go with the Whisper model.

Challenge and our idea

Beyond the compromise made on the choice of a model, the main challenge of this project is to process multilingual video/sound content.


The Whisper model is capable of understanding different languages, but its internal logic is not capable of switching from one language to another within the same audio transcription.


With multilingual content, this model will transcribe the content using the first language it detects at the very beginning of the audio stream. Its multilingual capabilities could detect language differences and produce a good enough transcription, but we will lose the context information (current language), and the word error rate will certainly increase.


One idea we thought about was to process each speaker's transcription individually: the Whisper model would treat each of them as individual audio, then repeating the language detection and transcription.


The detection and division of an audio stream by speaker is what is called "speaker diarization".


Several solutions exist to apply this "speaker diarization", in our case, we have chosen to opt for an open source solution, based on an AI model: Pyannote audio .


This Python library is a toolkit specifically designed for diarization tasks:

  • voice activity detection,
  • speaker change detection,
  • superimposed speech detection,
  • speaker incorporation.

 

In our case, we included this tool to be able to divide an audio stream into several parts, each representing a speech from an identified speaker, with timestamps relative to the original audio file.

After this step, we then obtain a list of audio files, representing one track for each speaker.

For each of these, we can then use the Whisper model which will return a verbatim transcription of the audio, in sentence form, with timestamps.


The final step will be to apply the offset of each track to each transcription timestamp to obtain a complete transcription in sync with the original audio file.

Transcription into subtitles

The last step is to use all this information to produce a subtitle file, according to the expected syntax (for example: SRT , VTT ).
Since we get structured information from Pyannote and Whisper (JSON or CSV), it is quite easy to use a regular programming language to transform this information into a subtitle file.


But here we wanted to go a little further to introduce automation and transcription adjustments, which are essential, often forgotten parts of the caption creation job:

  • To correct possible typos in the transcription or rephrase a sentence more concisely
  • To rearrange sentences
  • To automatically format the transcript in the desired format
  • To apply any automatic translation

 

A perfect tool for such a task is a broad language model (LLM). For example, we can create a request to ask OpenAI GPT to transform our transcription (information structured in JSON or CSV) into a subtitle file, with a review/adjustment of the text, and if necessary translate them into no matter what target language.
 

An example query we used:
 

You are an AI assistant that automates CSV to ${format} conversion. To remember you some languages codes: - lb = Luxembourgish - en = English - fr = French - de = German For the conversion from the CSV content : - For each line transcribe the text in "Text" column as a whole text, using the language code available in the language column of the same line - Convert time code in the Start and End columns into dubbing time code. - Use the speaker column to add the speaker identification in the output format (${format}) CSV content: Speaker;language;Start;End;Text ${content} ${format} content:

What we built

To demonstrate our idea, we had to build a minimal toolchain that will build on the technologies we just described.

Our prototype was based on a simple web application capable of:

  • Download or save an audio file
  • Send the audio file to a remote API to perform diarization, transcription, and subtitle formatting steps
  • Show the result of each step
multilingue 5

The remote API ran on a server equipped with a GPU (Nvidia T4/16GB) to accelerate the AI steps.

Does this work?

We have carried out numerous tests to understand the performance of this prototype, in particular to compare the accuracy of transcription with and without diarization. Of course, we knew that the word error rate for Luxembourgish would not be good, because our Whisper model was not trained enough for this language.


However, we observed a clear improvement with diarization, both for multilingual and monolingual content.

 

Audio with only Luxembourgish

For this test, we used the audio track of this video: GovJobs - Job presentation: Ministry of Justice.

 

Without diarization:

multilingue 6

The detected language is German, and therefore the full audio is transcribed into German, with average accuracy.

 

With diarization:

multilingue 7

Diarization allows Whisper to obtain better results, both in terms of language detection accuracy. Language detection is not perfect: even if all speakers speak Luxembourgish, not all their dialogues are identified as Luxembourgish, but rather as German or Dutch.


An interesting point is that the language detection is consistent with the speaker (e.g. speaker "03" is always detected as Luxembourgish).

We are certainly reaching the limit of the Whisper model when it comes to the Luxembourgish language: its training is probably insufficient and does not sufficiently represent variations in accent or pronunciation.


Creating subtitles

No surprise here, with the query we described earlier, the Wide Language Model (LLM) produces the contents of the SRT file without any problem.

multilingue 7

Multilingual content

We tested several multilingual audio streams, and in this example, we are playing the transcript of a live recording made by our client during the demo of our prototype.


Their recording was made by 4 people, speaking respectively in German, English, French and Luxembourgish.


Without diarization:

multilingue

Without the diarization, of course, we have the same problem: everything is transcribed in German because it is the first language used during recording. Additionally, the word error rate is as expected with this behavior.


With diarization:

multilingue

If we activate diarization, we have the 4 speakers identified, with their respective languages. The word error rate is better for each of them, but remains quite low for the Luxembourgish language.


Creating subtitles

Creating subtitles is also good, whether we leave the original language or force a specific translation.

 

Some numbers

Word error rate

As per our example video, the ground truth is determined by the SRT file provided by the relevant department responsible for the video.

 

Generated query (collapsed for readability)Without the diarizationWith diarization
Automatic transcription (DE detected, wrongly)93.86%88.60%
With diarization92.60%82.41%

 

REAL TIME FACTOR

Our solution involves several calculation steps, so the speed of transcription must also be taken into account. This workload requires intensive calculation and specific hardware for their optimization.


In our case, to host the API (pyannote and whisper), we used an AWS EC2 instance equipped with a dedicated GPU. This is an entry-level GPU, but it's still not the cheapest hardware. The API can be hosted on a classic CPU instance, but its performance will be really impacted (between 6 and 10 times slower).


With this instance (EC2 g4dn.xlarge - 1x Nvidia T4 GPU 16 GB), we were able to achieve a real time factor of 3: this means that for 3 minutes of audio, this solution will require 1 minute of total processing time.


This performance factor can indeed be increased by using a better GPU, but of course this will also increase the total cost of the solution.

CONCLUSION

In conclusion, the integration of diarization and transcription tracking significantly improved the quality of subtitle generation. Our chosen Whisper model excels in sentence timestamp accuracy, which is crucial for creating synchronized captions. Despite its effectiveness, the model's performance in language detection and word error rates for Luxembourgish is insufficient compared to more widely spoken languages. This was an expected trade-off, given the model's current training data.


Moving forward, our refinement strategy involves the accumulation of a substantial corpus of Luxembourg audio samples. Our goal is to collect data that not only exceeds the volume obtained in previous initiatives, such as the use of the Meta model by the University of Luxembourg, but also encompasses a diverse spectrum of dialects and pronunciations. This diversity is crucial to alleviate linguistic confusion of the model with similar languages, such as Dutch and German. By doing so, we can improve the robustness of the model and ensure that Luxembourgish is represented with the same accuracy as other languages.


In addition to improving technical metrics, our project has the potential to have a significant impact on Luxembourg's multilingual community by providing more accessible and accurate information. Additionally, it will contribute to the growing field of computational linguistics and automatic caption generation, paving the way for more inclusive technology that overcomes language barriers. As we embark on this journey, we welcome collaboration with language experts and the technology community to create a model that serves as a benchmark for smaller language groups globally.