Différences

Ci-dessous, les différences entre deux révisions de la page.

--- ressource:logiciel:vosk:start [2021/10/31 16:42]
gweltaz [Transcription depuis un microphone]
+++ ressource:logiciel:vosk:start [2021/12/29 22:53]
gweltaz [Entraînement d'un nouveau modèle linguistique]
@@ Ligne 49: / Ligne 49: @@
 ===== Transcription depuis un microphone =====
-Pour utiliser le script çi-dessous, exécutez-le avec l'argument ''-l'' pour obtenir la liste des périphériques audio connectés à votre machine :
+Pour utiliser le script çi-dessous, exécutez-le d'abord avec l'argument ''-l'' pour obtenir la liste des périphériques audio connectés à votre machine :
   $ python3 test_microphone.py -l
+Ensuite (ou ''n'' correspond au numéro de l'interface audio récupérée précédemment) :
+  $ python3 test_microphone.py -d n
 <accordion>
@@ Ligne 149: / Ligne 151: @@
 </code></panel></accordion>
-===== Entraînement d'un nouveau modèle linguistique =====
+===== Entraînement d'un nouveau modèle linguistique et acoustique =====
 ==== Tutoriaux ====
+https://towardsdatascience.com/how-to-start-with-kaldi-and-speech-recognition-a9b7670ffff6
 http://kaldi-asr.org/doc/kaldi_for_dummies.html
@@ Ligne 159: / Ligne 163: @@
 https://www.eleanorchodroff.com/tutorial/kaldi/training-acoustic-models.html
-https://towardsdatascience.com/how-to-start-with-kaldi-and-speech-recognition-a9b7670ffff6
+https://web.stanford.edu/class/cs224s/assignments/a3/
 ==== Installation de Kaldi et initialisation du projet ====
 **Kaldi** est un kit d'outils pour la création de modèles linguistiques. Les modèles sont ensuite utilisés par VOSK pour faciliter leur utilisation pour la reconnaissance vocale.
@@ Ligne 172: / Ligne 176: @@
 Installation des outils nécessaires à Kaldi :
-  cd tools
+  $ cd tools
-  make
+  $ make
+Installation de Intel Math Kernel Library (optimisation des opération d'algèbre linéaire) :
+  $ sudo ./tools/extra/install_mkl.sh
+Installation de kaldi :
+  $ cd src
+  $ ./configure
+  $ make -j clean depend
+  $ make -j <NCPU>    # où <NCPU> est le nombre de coeurs de processeurs à utiliser pour la compilation
 Créer un nouveau dossier pour le projet dans le dossier ''egs'' (''mycorpus'' dans l'exemple ci-dessous)
@@ Ligne 207: / Ligne 221: @@
   sw02001-A_002736-002893 AND IS
-The first element on each line is the utterance-id, which is an arbitrary text string, but if you have speaker information in your setup, you should make the speaker-id a prefix of the utterance id; this is important for reasons relating to the sorting of these files. The rest of the line is the transcription of each sentence. You don't have to make sure that all words in this file are in your vocabulary; out of vocabulary words will get mapped to a word specified in the file data/lang/oov.txt.
+The first element on each line is the ''utterance-id'', which is an arbitrary text string, but if you have speaker information in your setup, you should make the ''speaker-id'' a prefix of the utterance id; this is important for reasons relating to the sorting of these files. The rest of the line is the transcription of each sentence. You don't have to make sure that all words in this file are in your vocabulary; out of vocabulary words will get mapped to a word specified in the file data/lang/oov.txt.
-It needs to be the case that when you sort both the utt2spk and spk2utt files, the orders "agree", e.g. the list of speaker-ids extracted from the utt2spk file is the same as the string sorted order. The easiest way to make this happen is to make the speaker-ids a prefix of the utter Although, in this particular example we have used an underscore to separate the "speaker" and "utterance" parts of the utterance-id, in general it is probably safer to use a dash ("-"). This is because it has a lower ASCII value; if the speaker-ids vary in length, in certain cases the speaker-ids and their corresponding utterance ids can end up being sorted in different orders when using the standard "C"-style ordering on strings, which will lead to a crash. Another important file is wav.scp. In the Switchboard example,
+It needs to be the case that when you sort both the ''utt2spk'' and ''spk2utt'' files, the orders "agree", e.g. the list of speaker-ids extracted from the ''utt2spk'' file is the same as the string sorted order. The easiest way to make this happen is to make the speaker-ids a prefix of the utter. Although, in this particular example we have used an underscore to separate the "speaker" and "utterance" parts of the utterance-id, in general it is probably safer to use a dash ("-"). This is because it has a lower ASCII value; if the speaker-ids vary in length, in certain cases the speaker-ids and their corresponding utterance ids can end up being sorted in different orders when using the standard "C"-style ordering on strings, which will lead to a crash. Another important file is ''wav.scp''. In the Switchboard example,
@@ Ligne 315: / Ligne 329: @@
   $ utils/prepare_lang.sh data/local/dict '<UNK>' data/local/lang data/lang
+==== Modèles basés sur un réseau neuronal profond ====
+  * https://kaldi-asr.org/doc/dnn.html
+  * http://www.cs.cmu.edu/~ymiao/kaldipdnn.html
+==== Utilisation du modèle post entraînement ====
+https://medium.com/@nithinraok_/decoding-an-audio-file-using-a-pre-trained-model-with-kaldi-c1d7d2fe3dc5
+==== Entrainement d'un modèle compatible VOSK ====
+Un tutorial concis et complet que j'aurais aimé découvrir plus tôt : https://github.com/matteo-39/vosk-build-model
+L'utilisation de VOSK simplifie énormément le décodage d'un fichier son avec un modèle Kaldi. Il faut toutefois noter que VOSK n'accepte que les modèles d'un format particulier.
+D'après la page [[https://alphacephei.com/vosk/models]] il est recommandé d'utiliser la recette ''mini-librispeech'', présent sous le dossier "egs" du dossier d'installation de kaldi. Il faudra modifier les scripts ''cmd.sh'' et ''run.sh'' pour les adapter à votre configuration et à vos données.
+Il faudra également remplacer le dernier script exécuté par ''run.sh'' : ''local/chain2/run_tdnn.sh'', par le script fourni sur la page de VOSK : [[https://github.com/kaldi-asr/kaldi/blob/master/egs/mini_librispeech/s5/local/chain/tuning/run_tdnn_1j.sh]]. Ce script nécessitera également des modification pour l'adapter à votre situation. (réduction de nombre de jobs en parallèle (option $nj) et désactivation de l'utilisation du GPU dans mon cas)
+==== Troubleshooting ====
+  * Ne pas laisser de caractères spéciaux dans le nom des fichiers de données du corpus, ni d'espaces dans le noms des dossiers (un '&' dans le nom d'une archive sonore peut faire planter la phase d'entrainement du RN)