The whisper models
There are 5 different models available in Whishper, each one more accurate than the previous one. The models are: tiny, base, small, medium and large-v2
. Also, the larger the model the more space it takes on disk.
The size difference is related to having more or less parameters. The more parameters the better it can ”understand” what it is “listening” to (less errors). With smaller models, more errors will occur (i.e. confusing words).
Also, it is important to note that when using larger models, the transcription time and the memory usage will increase.
English models
The models tiny, base, small and medium
are also available in a reduced version, with only the English language. These models are: tiny.en, base.en, small.en and medium.en
.
Using this language-specific models will reduce the transcription time and the memory usage, but it will only work with English audio.
Languages and accuracy
Not all languages provide the same accuracy when using Whisper. Please, take a look at the following graphic to see the languages and their related WER (Word Error Ratio). The smaller the WER, the better the model will understand the language.
Large-v3
The large-v3
model shows improved performance over a wide variety of languages, and the plot below includes all languages where Whisper large-v3 performs lower than 60% error rate on Common Voice 15 and Fleurs, showing 10% to 20% reduction of errors compared to `large-v2“: