同步操作将从 PaddlePaddle/DeepSpeech 强制同步,此操作会覆盖自 Fork 仓库以来所做的任何修改,且无法恢复!!!
确定后同步将在后台操作,完成时将刷新页面,请耐心等待。
Several shell scripts provided in ./examples/tiny/local
will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. LibriSpeech, Aishell). Reading these examples will also help you to understand how to make it work with your own data.
Some of the scripts in ./examples
are not configured with GPUs. If you want to train with 8 GPUs, please modify CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
. If you don't have any GPU available, please set CUDA_VISIBLE_DEVICES=
to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce batch_size
to fit.
Let's take a tiny sampled subset of LibriSpeech dataset for instance.
Go to directory
cd examples/tiny
Notice that this is only a toy example with a tiny sampled subset of LibriSpeech. If you would like to try with the complete dataset (would take several days for training), please go to examples/librispeech
instead.
Source env
source path.sh
Must do this before starting do anything.
Set MAIN_ROOT
as project dir. Using defualt deepspeech2
model as default, you can change this in the script.
Main entrypoint
bash run.sh
This just a demo, please make sure every step
is work fine when do next step
.
More detailed information are provided in the following sections. Wish you a happy journey with the DeepSpeech on PaddlePaddle ASR engine!
The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in examples/aishell/local
. As mentioned above, please execute sh data.sh
, sh train.sh
, sh test.sh
and sh infer.sh
to do data preparation, training, testing and inference correspondingly. We have also prepared a pre-trained model (downloaded by local/download_model.sh) for users to try with sh infer_golden.sh
and sh test_golden.sh
. Notice that, different from English LM, the Mandarin LM is character-based and please run local/tune.sh
to find an optimal setting.
An inference module caller infer.py
is provided to infer, decode and visualize speech-to-text results for several given audio clips. It might help to have an intuitive and qualitative evaluation of the ASR model's performance.
CUDA_VISIBLE_DEVICES=0 bash local/infer.sh
We provide two types of CTC decoders: CTC greedy decoder and CTC beam search decoder. The CTC greedy decoder is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The CTC beam search decoder otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument decoding_method
.
To evaluate a model's performance quantitatively, please run:
CUDA_VISIBLE_DEVICES=0 bash local/test.sh
The error rate (default: word error rate; can be set with error_rate_type
) will be printed.
For more help on arguments:
The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertion weight) for the CTC beam search decoder often have a significant impact on the decoder's performance. It would be better to re-tune them on the validation set when the acoustic model is renewed.
tune.py
performs a 2-D grid search over the hyper-parameter $\alpha$ and $\beta$. You must provide the range of $\alpha$ and $\beta$, as well as the number of their attempts.
CUDA_VISIBLE_DEVICES=0 bash local/tune.sh
The grid search will print the WER (word error rate) or CER (character error rate) at each point in the hyper-parameters space, and draw the error surface optionally. A proper hyper-parameters range should include the global minima of the error surface for WER/CER, as illustrated in the following figure.
An example error surface for tuning on the dev-clean set of LibriSpeech
Usually, as the figure shows, the variation of language model weight ($\alpha$) significantly affect the performance of CTC beam search decoder. And a better procedure is to first tune on serveral data batches (the number can be specified) to find out the proper range of hyper-parameters, then change to the whole validation set to carray out an accurate tuning.
After tuning, you can reset $\alpha$ and $\beta$ in the inference and evaluation modules to see if they really help improve the ASR performance. For more help
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。