Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

Les deux révisions précédentes Révision précédente
Prochaine révision
Révision précédente
ressource:logiciel:vosk:start [2021/10/31 16:35]
gweltaz [Fichiers du dossier 'data/local/lang']
ressource:logiciel:vosk:start [2022/03/06 23:06] (Version actuelle)
gweltaz
Ligne 45: Ligne 45:
 ''​LOG (VoskAPI:​ReadDataFiles():​model.cc:​307) Loading RNNLM model from model/​vosk-model-en-us-0.22/​rnnlm/​final.raw''​ ''​LOG (VoskAPI:​ReadDataFiles():​model.cc:​307) Loading RNNLM model from model/​vosk-model-en-us-0.22/​rnnlm/​final.raw''​
  
-La solution consiste a supprimer le dossier ''​rnnlm''​...+Une solution ​(pas super satisfaisante) ​consiste a supprimer le dossier ''​rnnlm''​...
  
 ===== Transcription depuis un microphone ===== ===== Transcription depuis un microphone =====
 +
 +Pour utiliser le script çi-dessous,​ exécutez-le d'​abord avec l'​argument ''​-l''​ pour obtenir la liste des périphériques audio connectés à votre machine :
 +  $ python3 test_microphone.py -l
 +
 +Ensuite (ou ''​n''​ correspond au numéro de l'​interface audio récupérée précédemment) :
 +  $ python3 test_microphone.py -d n
  
 <​accordion>​ <​accordion>​
Ligne 145: Ligne 151:
 </​code></​panel></​accordion>​ </​code></​panel></​accordion>​
  
-===== Entraînement d'un nouveau modèle linguistique =====+===== Entraînement d'un nouveau modèle linguistique ​et acoustique ​=====
  
 ==== Tutoriaux ==== ==== Tutoriaux ====
 +
 +https://​towardsdatascience.com/​how-to-start-with-kaldi-and-speech-recognition-a9b7670ffff6
  
 http://​kaldi-asr.org/​doc/​kaldi_for_dummies.html http://​kaldi-asr.org/​doc/​kaldi_for_dummies.html
Ligne 155: Ligne 163:
 https://​www.eleanorchodroff.com/​tutorial/​kaldi/​training-acoustic-models.html https://​www.eleanorchodroff.com/​tutorial/​kaldi/​training-acoustic-models.html
  
-https://towardsdatascience.com/how-to-start-with-kaldi-and-speech-recognition-a9b7670ffff6+https://web.stanford.edu/​class/​cs224s/​assignments/​a3/
 ==== Installation de Kaldi et initialisation du projet ==== ==== Installation de Kaldi et initialisation du projet ====
 **Kaldi** est un kit d'​outils pour la création de modèles linguistiques. Les modèles sont ensuite utilisés par VOSK pour faciliter leur utilisation pour la reconnaissance vocale. **Kaldi** est un kit d'​outils pour la création de modèles linguistiques. Les modèles sont ensuite utilisés par VOSK pour faciliter leur utilisation pour la reconnaissance vocale.
Ligne 161: Ligne 169:
 Les instruction pour l'​installation sont dans le fichier ''​tools/​INSTALL''​ Les instruction pour l'​installation sont dans le fichier ''​tools/​INSTALL''​
  
-Cloner le répo de Kaldi : +Cloner le répo de Kaldi : https://​github.com/​kaldi-asr/​kaldi 
-https://​github.com/​kaldi-asr/​kaldi+  $ git clone https://​github.com/​kaldi-asr/​kaldi
  
 Vérifier les dépendances : Vérifier les dépendances :
Ligne 168: Ligne 176:
    
 Installation des outils nécessaires à Kaldi : Installation des outils nécessaires à Kaldi :
-  cd tools +  ​cd tools 
-  make+  ​make 
 + 
 +Installation de Intel Math Kernel Library (optimisation des opération d'​algèbre linéaire) : 
 +  $ sudo ./​tools/​extra/​install_mkl.sh 
 + 
 +Installation de SRILM (outil pour la création de modèles de langages) 
 +  $ ./​tools/​install_srilm.sh 
 + 
 +Installation de kaldi : 
 +  $ cd src 
 +  $ ./​configure 
 +  $ make -j clean depend 
 +  $ make -j <​NCPU> ​   # où <​NCPU>​ est le nombre de coeurs de processeurs à utiliser pour la compilation 
  
 Créer un nouveau dossier pour le projet dans le dossier ''​egs''​ (''​mycorpus''​ dans l'​exemple ci-dessous) Créer un nouveau dossier pour le projet dans le dossier ''​egs''​ (''​mycorpus''​ dans l'​exemple ci-dessous)
Ligne 185: Ligne 206:
  
 ==== Traitement des fichiers son ==== ==== Traitement des fichiers son ====
 +
 +Conversion en wav mono 16 bits et avec une fréquence d’échantillonnage de 16000 Hz
 +
 +  $ ffmpeg -i in.mp3 -acodec pcm_s16le -ac 1 -ar 16000 out.wav
 +
  
 Détection des silences et des non silences avec Python Détection des silences et des non silences avec Python
Ligne 203: Ligne 229:
   sw02001-A_002736-002893 AND IS   sw02001-A_002736-002893 AND IS
  
-The first element on each line is the utterance-id,​ which is an arbitrary text string, but if you have speaker information in your setup, you should make the speaker-id a prefix of the utterance id; this is important for reasons relating to the sorting of these files. The rest of the line is the transcription of each sentence. You don't have to make sure that all words in this file are in your vocabulary; out of vocabulary words will get mapped to a word specified in the file data/​lang/​oov.txt.+The first element on each line is the ''​utterance-id''​, which is an arbitrary text string, but if you have speaker information in your setup, you should make the ''​speaker-id'' ​a prefix of the utterance id; this is important for reasons relating to the sorting of these files. The rest of the line is the transcription of each sentence. You don't have to make sure that all words in this file are in your vocabulary; out of vocabulary words will get mapped to a word specified in the file data/​lang/​oov.txt.
  
-It needs to be the case that when you sort both the utt2spk and spk2utt files, the orders "​agree",​ e.g. the list of speaker-ids extracted from the utt2spk file is the same as the string sorted order. The easiest way to make this happen is to make the speaker-ids a prefix of the utter Although, in this particular example we have used an underscore to separate the "​speaker"​ and "​utterance"​ parts of the utterance-id,​ in general it is probably safer to use a dash ("​-"​). This is because it has a lower ASCII value; if the speaker-ids vary in length, in certain cases the speaker-ids and their corresponding utterance ids can end up being sorted in different orders when using the standard "​C"​-style ordering on strings, which will lead to a crash. Another important file is wav.scp. In the Switchboard example,+It needs to be the case that when you sort both the ''​utt2spk'' ​and ''​spk2utt'' ​files, the orders "​agree",​ e.g. the list of speaker-ids extracted from the ''​utt2spk'' ​file is the same as the string sorted order. The easiest way to make this happen is to make the speaker-ids a prefix of the utterAlthough, in this particular example we have used an underscore to separate the "​speaker"​ and "​utterance"​ parts of the utterance-id,​ in general it is probably safer to use a dash ("​-"​). This is because it has a lower ASCII value; if the speaker-ids vary in length, in certain cases the speaker-ids and their corresponding utterance ids can end up being sorted in different orders when using the standard "​C"​-style ordering on strings, which will lead to a crash. Another important file is ''​wav.scp''​. In the Switchboard example,
  
  
Ligne 218: Ligne 244:
 ''<​recording-id>​ <​extended-filename>''​ ''<​recording-id>​ <​extended-filename>''​
  
-where the "​extended-filename"​ may be an actual filename, or as in this case, a command that extracts a wav-format file. The pipe symbol on the end of the extended-filename specifies that it is to be interpreted as a pipe. We will explain what "recording-id" ​is below, but we would first like to point out that if the "segments" ​file does not exist, the first token on each line of "wav.scp" ​file is just the utterance id. The files in wav.scp must be single-channel (mono); if the underlying wav files have multiple channels, then a sox command must be used in the wav.scp to extract a particular channel.+where the "​extended-filename"​ may be an actual filename, or as in this case, a command that extracts a wav-format file. The pipe symbol on the end of the extended-filename specifies that it is to be interpreted as a pipe. We will explain what ''​recording-id'' ​is below, but we would first like to point out that if the ''​segments'' ​file does not exist, the first token on each line of ''​wav.scp'' ​file is just the utterance id. The files in ''​wav.scp'' ​must be single-channel (mono); if the underlying wav files have multiple channels, then a sox command must be used in the ''​wav.scp'' ​to extract a particular channel.
  
  
 === Fichier '​segments'​ === === Fichier '​segments'​ ===
  
-In the Switchboard setup we have the "segments" ​file, so we'll discuss this next.+In the Switchboard setup we have the ''​segments'' ​file, so we'll discuss this next.
  
 s5# head -3 data/​train/​segments s5# head -3 data/​train/​segments
Ligne 230: Ligne 256:
   sw02001-A_002736-002893 sw02001-A 27.36 28.93   sw02001-A_002736-002893 sw02001-A 27.36 28.93
  
-The format of the "segments" ​file is:+The format of the ''​segments'' ​file is:
  
 ''<​utterance-id>​ <​recording-id>​ <​segment-begin>​ <​segment-end>''​ ''<​utterance-id>​ <​recording-id>​ <​segment-begin>​ <​segment-end>''​
  
-where the segment-begin and segment-end are measured in seconds. These specify time offsets into a recording. The "recording-id" ​is the same identifier as is used in the "wav.scp" ​file– again, this is an arbitrary identifier that you can choose.+where the ''​segment-begin'' ​and ''​segment-end'' ​are measured in seconds. These specify time offsets into a recording. The ''​recording-id'' ​is the same identifier as is used in the ''​wav.scp'' ​file– again, this is an arbitrary identifier that you can choose.
  
  
 === Fichier '​utt2spk'​ === === Fichier '​utt2spk'​ ===
  
-The last file you need to create yourself is the "utt2spk" ​file. This says, for each utterance, which speaker spoke it.+The last file you need to create yourself is the ''​utt2spk'' ​file. This says, for each utterance, which speaker spoke it.
  
   s5# head -3 data/​train/​utt2spk   s5# head -3 data/​train/​utt2spk
Ligne 250: Ligne 276:
 ''<​utterance-id>​ <​speaker-id>''​ ''<​utterance-id>​ <​speaker-id>''​
  
-Note that the speaker-ids don't need to correspond in any very accurate sense to the names of actual speakers– they simply need to represent a reasonable guess. In this case we assume each conversation side (each side of the telephone conversation) corresponds to a single speaker. This is not entirely true – sometimes one person may hand the phone to another person, or the same person may be speaking in multiple calls – but it's good enough for our purposes. If you have no information at all about the speaker identities, you can just make the speaker-ids the same as the utterance-ids , so the format of the file would be just <​utterance-id>​ <​utterance-id>​. We have made the previous sentence bold because we have encountered people creating a "​global"​ speaker-id. This is a bad idea because it makes cepstral mean normalization ineffective in training (since it's applied globally), and because it will create problems when you use utils/​split_data_dir.sh to split your data into pieces.+Note that the speaker-ids don't need to correspond in any very accurate sense to the names of actual speakers– they simply need to represent a reasonable guess. In this case we assume each conversation side (each side of the telephone conversation) corresponds to a single speaker. This is not entirely true – sometimes one person may hand the phone to another person, or the same person may be speaking in multiple calls – but it's good enough for our purposes. If you have no information at all about the speaker identities, you can just make the speaker-ids the same as the utterance-ids , so the format of the file would be just ''​<​utterance-id>​ <​utterance-id>​''​. We have made the previous sentence bold because we have encountered people creating a "​global"​ speaker-id. This is a bad idea because it makes cepstral mean normalization ineffective in training (since it's applied globally), and because it will create problems when you use ''​utils/​split_data_dir.sh'' ​to split your data into pieces.
  
  
 === Fichier '​reco2file_and_channel'​ (optionnel) === === Fichier '​reco2file_and_channel'​ (optionnel) ===
  
-The file "reco2file_and_channel" ​is only used when scoring (measuring error rates) with NIST'​s ​"sclite" ​tool:+The file ''​reco2file_and_channel'' ​is only used when scoring (measuring error rates) with NIST'​s ​''​sclite'' ​tool:
  
   s5# head -3 data/​train/​reco2file_and_channel   s5# head -3 data/​train/​reco2file_and_channel
Ligne 266: Ligne 292:
 ''<​recording-id>​ <​filename>​ <​recording-side (A or B)>''​ ''<​recording-id>​ <​filename>​ <​recording-side (A or B)>''​
  
-The filename is typically the name of the .sph file, without the suffix, but in general it's whatever identifier you have in your "stm" ​file. The recording side is a concept that relates to telephone conversations where there are two channels, and if not, it's probably safe to use "​A"​. If you don't have an "stm" ​file or you have no idea what this is all about, then you don't need the "​reco2file_and_channel"​ file.+The filename is typically the name of the .sph file, without the suffix, but in general it's whatever identifier you have in your ''​stm'' ​file. The recording side is a concept that relates to telephone conversations where there are two channels, and if not, it's probably safe to use "​A"​. If you don't have an ''​stm'' ​file or you have no idea what this is all about, then you don't need the "​reco2file_and_channel"​ file.
  
  
Ligne 311: Ligne 337:
   $ utils/​prepare_lang.sh data/​local/​dict '<​UNK>'​ data/​local/​lang data/lang   $ utils/​prepare_lang.sh data/​local/​dict '<​UNK>'​ data/​local/​lang data/lang
  
 +==== A propos des mots inconnus ====
 +This is an explanation of how Kaldi deals with unknown words (words not in the vocabulary);​ we are putting it on the "data preparation"​ page for lack of a more obvious location.
 +
 +In many setups, <unk> or something similar will be present in the LM as long as the data that you used to train the LM had words that were not in the vocabulary you used to train the LM, because language modeling toolkits tend to map those all to a single special world, usually called <unk> or <​UNK>​. You can look at the arpa file to figure out what it's called; it will usually be one of those two.
 +
 +During training, if there are words in the text file in your data directory that are not in the words.txt in the lang directory that you are using, Kaldi will map them to a special word that's specified in the lang directory in the file data/​lang/​oov.txt;​ it will usually be either <​unk>,​ <UNK> or maybe <​SPOKEN_NOISE>​. This word will have been chosen by the user (i.e., you), and supplied to prepare_lang.sh as a command-line argument. If this word has nonzero probability in the language model (which you can test by looking at the arpa file), then it will be possible for Kaldi to recognize this word in test time. This will often be the case if you call this word <​unk>,​ because as we mentioned above, language modeling toolkits will often use this spelling for ''​unknown word''​ (which is a special word that all out-of-vocabulary words get mapped to). Decoding output will always be limited to the intersection of the words in the language model with the words in the lexicon.txt (or whatever file format you supplied the lexicon in, e.g. lexicop.txt);​ these words will all be present in the words.txt in your lang directory. So if Kaldi'​s "​unknown word" doesn'​t match the LM's "​unknown word", you will simply never decode this word. In any case, even when allowed to be decoded, this word typically won't be output very often and in practice it doesn'​t tend to have much impact on WERs.
 +
 +Of course a single phone isn't a very good, or accurate, model of OOV words. In some Kaldi setups we have example scripts with names local/​run_unk_model.sh:​ e.g., see the file tedlium/​s5_r2/​local/​run_unk_model.sh. These scripts replace the unk phone with a phone-level LM on phones. They make it possible to get access to the sequence of phones in a hypothesized unknown word. Note: unknown words should be considered an "​advanced topic" in speech recognition and we discourage beginners from looking into this topic too closely. ​
 +
 +==== Modèles basés sur un réseau neuronal profond ====
 +  * https://​kaldi-asr.org/​doc/​dnn.html
 +  * http://​www.cs.cmu.edu/​~ymiao/​kaldipdnn.html
 +
 +==== Utilisation du modèle post entraînement ====
 +
 +https://​medium.com/​@nithinraok_/​decoding-an-audio-file-using-a-pre-trained-model-with-kaldi-c1d7d2fe3dc5
 +
 +==== Entrainement d'un modèle compatible VOSK ====
 +Un tutorial concis et complet que j'​aurais aimé découvrir plus tôt : https://​github.com/​matteo-39/​vosk-build-model
 +
 +L'​utilisation de VOSK simplifie énormément le décodage d'un fichier son avec un modèle Kaldi. Il faut toutefois noter que VOSK n'​accepte que les modèles d'un format particulier.
 +
 +D'​après la page [[https://​alphacephei.com/​vosk/​models]] il est recommandé d'​utiliser la recette ''​mini-librispeech'',​ présent sous le dossier "​egs"​ du dossier d'​installation de kaldi. Il faudra modifier les scripts ''​cmd.sh''​ et ''​run.sh''​ pour les adapter à votre configuration et à vos données.
 +
 +Il faudra également remplacer le dernier script exécuté par ''​run.sh''​ : ''​local/​chain2/​run_tdnn.sh'',​ par le script fourni sur la page de VOSK : [[https://​github.com/​kaldi-asr/​kaldi/​blob/​master/​egs/​mini_librispeech/​s5/​local/​chain/​tuning/​run_tdnn_1j.sh]]. Ce script nécessitera également des modification pour l'​adapter à votre situation. (réduction de nombre de jobs en parallèle (option $nj) et désactivation de l'​utilisation du GPU dans mon cas)
 +
 +==== Troubleshooting ====
 +  * Ne pas laisser de caractères spéciaux dans le nom des fichiers de données du corpus, ni d'​espaces dans le noms des dossiers (un '&'​ dans le nom d'une archive sonore peut faire planter la phase d'​entrainement du RN)
  • ressource/logiciel/vosk/start.1635694502.txt.gz
  • Dernière modification: 2021/10/31 16:35
  • par gweltaz