Question

Я хочу получить временные координаты каждого слова в моем 'audio.wav', используя python pocketsphinx 0.1.15. Я воспроизвожу официальный пример кода из проекта https://pypi.org/project/pocketsphinx/, который хорошо работает для 'goforward.raw':

# ----------------------------
# | start |  end  |   word   |
# ----------------------------
# |  0.0s | 0.24s | <s>      |
# | 0.25s | 0.45s | <sil>    |
# | 0.46s | 0.63s | go       |
# | 0.64s | 1.16s | forward  |
# | 1.17s | 1.52s | ten      |
# | 1.53s | 2.11s | meters   |
# | 2.12s |  2.6s | </s>     |
# ----------------------------

Когда я использую мой 'audio.wav', вывод ps.segments (подробный = True) не так плох, но при использовании AudioFile classe (как в официальном примере) результат очень неточный. Даже близко не быть точным ни по временным координатам (поскольку звук равен 2,52 с), ни по количеству сегментов.

Что не так? Что я должен сделать, чтобы иметь правильные координаты времени?

rate 16000 frames 40371 2.5231875

[('<s>', 1, 84, 91), ('que', 1, 92, 166),
 ('<sil>', -1443, 167, 169), ('la', -355, 170, 180),
 ('voz', -323, 181, 201), ('del', -3028, 202, 216),
 ('postulante', 0, 217, 279), ('en', -1, 280, 323),
 ('</s>', 0, 324, 327)]
----------------------------
| start |  end  |   word   |
----------------------------
| 0.07s | 0.14s |      <s> |
| 0.15s | 0.79s |      que |
|  0.8s | 0.82s |     </s> |
----------------------------

Вот мой код Python:

import os.path
# This is just to have audio info
import wave
import contextlib

from pocketsphinx import (Pocketsphinx, AudioFile, LiveSpeech)
# my own ps model an other resources
from utils.utilities import (get_mexconf, get_data_path)

# get the file and print audio properties
wav = os.path.join(get_data_path(), 'audio.wav')
with contextlib.closing(wave.open(wav,'r')) as f:
   rate = f.getframerate()
   frames = f.getnframes()
   duration = frames / float(rate)
   print('rate', rate, 'frames', frames, 'duration', duration)

# This part seems to work getting segments 
segments = get_segments(wav)
print(segments)

# set up my asr models and my audio
config = get_mexconf()
config['audio_file'] = wav
audio = AudioFile(**config)

# This part is copy paste from official example #
# Frames per Second
fps = 100
config['frate'] = fps

for phrase in audio:
    print('-' * 28)
    print('| %5s |  %3s  |   %4s   |' % ('start', 'end', 'word'))
    print('-' * 28)
    for s in phrase.seg():
        print('| %4ss | %4ss | %8s |' % (s.start_frame / fps, s.end_frame / fps, s.word))
    print('-' * 28)

Обновление

Этот конфиг:

config = {
    'hmm': os.path.join(model_path, 'LKE_T29.cd_cont_6000'),
    'lm': os.path.join(model_path, 'LKE_T29.lm.bin'),
    'dict': os.path.join(model_path, 'LKE_T29.dic'),
    'verbose': True,
    'backtrace' : True
}

Вызывает этот вывод:

/home/amolina/repo/audiotranscriptor/data/audio.wav
INFO: pocketsphinx.c(152): Parsed model-specific feature parameters from /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/feat.params
Current configuration:
[NAME]          [DEFLT]     [VALUE]
-agc            none        none
-agcthresh      2.0     2.000000e+00
-allphone               
-allphone_ci        yes     yes
-alpha          0.97        9.700000e-01
-ascale         20.0        2.000000e+01
-aw         1       1
-backtrace      no      yes
-beam           1e-48       1.000000e-48
-bestpath       yes     yes
-bestpathlw     9.5     9.500000e+00
-ceplen         13      13
-cmn            live        batch
-cmninit        40,3,-1     40,3,-1
-compallsen     no      no
-dict                   /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.dic
-dictcase       no      no
-dither         no      no
-doublebw       no      no
-ds         1       1
-fdict                  
-feat           1s_c_d_dd   1s_c_d_dd
-featparams             
-fillprob       1e-8        1.000000e-08
-frate          100     100
-fsg                    
-fsgusealtpron      yes     yes
-fsgusefiller       yes     yes
-fwdflat        yes     yes
-fwdflatbeam        1e-64       1.000000e-64
-fwdflatefwid       4       4
-fwdflatlw      8.5     8.500000e+00
-fwdflatsfwin       25      25
-fwdflatwbeam       7e-29       7.000000e-29
-fwdtree        yes     yes
-hmm                    /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000
-input_endian       little      little
-jsgf                   
-keyphrase              
-kws                    
-kws_delay      10      10
-kws_plp        1e-1        1.000000e-01
-kws_threshold      1e-30       1.000000e-30
-latsize        5000        5000
-lda                    
-ldadim         0       0
-lifter         0       22
-lm                 /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.lm.bin
-lmctl                  
-lmname                 
-logbase        1.0001      1.000100e+00
-logfn                  
-logspec        no      no
-lowerf         133.33334   1.300000e+02
-lpbeam         1e-40       1.000000e-40
-lponlybeam     7e-29       7.000000e-29
-lw         6.5     6.500000e+00
-maxhmmpf       30000       30000
-maxwpf         -1      -1
-mdef                   
-mean                   
-mfclogdir              
-min_endfr      0       0
-mixw                   
-mixwfloor      0.0000001   1.000000e-07
-mllr                   
-mmap           yes     yes
-ncep           13      13
-nfft           512     512
-nfilt          40      25
-nwpen          1.0     1.000000e+00
-pbeam          1e-48       1.000000e-48
-pip            1.0     1.000000e+00
-pl_beam        1e-10       1.000000e-10
-pl_pbeam       1e-10       1.000000e-10
-pl_pip         1.0     1.000000e+00
-pl_weight      3.0     3.000000e+00
-pl_window      5       5
-rawlogdir              
-remove_dc      no      no
-remove_noise       yes     yes
-remove_silence     yes     yes
-round_filters      yes     yes
-samprate       16000       1.600000e+04
-seed           -1      -1
-sendump                
-senlogdir              
-senmgau                
-silprob        0.005       5.000000e-03
-smoothspec     no      no
-svspec                 
-tmat                   
-tmatfloor      0.0001      1.000000e-04
-topn           4       4
-topn_beam      0       0
-toprule                
-transform      legacy      dct
-unit_area      yes     yes
-upperf         6855.4976   6.800000e+03
-uw         1.0     1.000000e+00
-vad_postspeech     50      50
-vad_prespeech      20      20
-vad_startspeech    10      10
-vad_threshold      3.0     3.000000e+00
-var                    
-varfloor       0.0001      1.000000e-04
-varnorm        no      no
-verbose        no      no
-warp_params                
-warp_type      inverse_linear  inverse_linear
-wbeam          7e-29       7.000000e-29
-wip            0.65        6.500000e-01
-wlen           0.025625    2.562500e-02

INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='batch', VARNORM='no', AGC='none'
INFO: mdef.c(518): Reading model definition: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/mdef
INFO: bin_mdef.c(181): Allocating 79833 * 8 bytes (623 KiB) for CD tree
INFO: tmat.c(149): Reading HMM transition probability matrices: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/transition_matrices
INFO: acmod.c(113): Attempting to use PTM computation module
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/means
INFO: ms_gauden.c(242): 6090 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  32x39
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/variances
INFO: ms_gauden.c(242): 6090 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  32x39
INFO: ms_gauden.c(304): 79 variance values floored
INFO: ptm_mgau.c(803): Number of codebooks exceeds 256: 6090
INFO: acmod.c(115): Attempting to use semi-continuous computation module
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/means
INFO: ms_gauden.c(242): 6090 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  32x39
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/variances
INFO: ms_gauden.c(242): 6090 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  32x39
INFO: ms_gauden.c(304): 79 variance values floored
INFO: acmod.c(117): Falling back to general multi-stream GMM computation
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/means
INFO: ms_gauden.c(242): 6090 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  32x39
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/variances
INFO: ms_gauden.c(242): 6090 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  32x39
INFO: ms_gauden.c(304): 79 variance values floored
INFO: ms_senone.c(149): Reading senone mixture weights: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/mixture_weights
INFO: ms_senone.c(200): Truncating senone logs3(pdf) values by 10 bits
INFO: ms_senone.c(207): Not transposing mixture weights in memory
INFO: ms_senone.c(268): Read mixture weights for 6090 senones: 1 features x 32 codewords
INFO: ms_senone.c(320): Mapping senones to individual codebooks
INFO: ms_mgau.c(144): The value of topn: 4
INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
INFO: dict.c(320): Allocating 270112 * 32 bytes (8441 KiB) for word entries
INFO: dict.c(333): Reading main dictionary: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.dic
INFO: dict.c(213): Dictionary size 266013, allocated 2217 KiB for strings, 4260 KiB for phones
INFO: dict.c(336): 266013 words read
INFO: dict.c(358): Reading filler dictionary: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/noisedict
INFO: dict.c(213): Dictionary size 266016, allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(361): 3 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(406): Allocating 30^3 * 2 bytes (52 KiB) for word-initial triphones
INFO: dict2pid.c(132): Allocated 21840 bytes (21 KiB) for word-final triphones
INFO: dict2pid.c(196): Allocated 21840 bytes (21 KiB) for single-phone word triphones
INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format
INFO: ngram_search_fwdtree.c(74): Initializing search tree
INFO: ngram_search_fwdtree.c(101): 675 unique initial diphones
INFO: ngram_search_fwdtree.c(186): Creating search channels
INFO: ngram_search_fwdtree.c(323): Max nonroot chan increased to 580308
INFO: ngram_search_fwdtree.c(333): Created 675 root, 580180 non-root channels, 75 single-phone words
INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
INFO: cmn_live.c(120): Update from < 40.00  3.00 -1.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00 >
INFO: cmn_live.c(138): Update to   < 48.64  8.26  9.46 15.53  0.73 10.14 -16.38 -7.88  2.97 -27.40 19.44 -3.48 -3.37 >
INFO: ngram_search_fwdtree.c(1550):     2069 words recognized (8/fr)
INFO: ngram_search_fwdtree.c(1552):   660566 senones evaluated (2696/fr)
INFO: ngram_search_fwdtree.c(1556):  3277888 channels searched (13379/fr), 106535 1st, 54408 last
INFO: ngram_search_fwdtree.c(1559):     4296 words for which last channels evaluated (17/fr)
INFO: ngram_search_fwdtree.c(1561):   254434 candidate words for entering last phone (1038/fr)
INFO: ngram_search_fwdtree.c(1564): fwdtree 2.07 CPU 0.844 xRT
INFO: ngram_search_fwdtree.c(1567): fwdtree 2.07 wall 0.844 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 118 words
INFO: ngram_search_fwdflat.c(948):     1097 words recognized (4/fr)
INFO: ngram_search_fwdflat.c(950):    95729 senones evaluated (391/fr)
INFO: ngram_search_fwdflat.c(952):    72888 channels searched (297/fr)
INFO: ngram_search_fwdflat.c(954):     6586 words searched (26/fr)
INFO: ngram_search_fwdflat.c(957):     6611 word transitions (26/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 0.18 CPU 0.074 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 0.18 wall 0.074 xRT
INFO: ngram_search.c(1250): lattice start node <s>.0 end node </s>.240
INFO: ngram_search.c(1276): Eliminated 0 nodes before end node
INFO: ngram_search.c(1381): Lattice has 228 nodes, 280 links
INFO: ps_lattice.c(1374): Bestpath score: -8058
INFO: ps_lattice.c(1378): Normalizer P(O) = alpha(</s>:240:243) = -661594
INFO: ps_lattice.c(1435): Joint P(O,S) = -678760 P(S|O) = -17166
INFO: ngram_search.c(872): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(875): bestpath 0.00 wall 0.000 xRT
INFO: pocketsphinx.c(1170): que la voz del postulante en (-8196)
word                 start end   pprob ascr       lscr       lback
INFO: ngram_search.c(1027): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(1030): bestpath 0.00 wall 0.000 xRT
<s>                  84    91    1.000 -306176    0          1  
que                  92    166   1.000 -1845248   -223       2  
<sil>                167   169   0.866 -306176    -524288    2  
la                   170   180   0.965 -253952    -320       1  
voz                  181   201   0.968 -272384    -406       2  
del                  202   216   0.739 -475136    -185       3  
postulante           217   279   1.000 -1587200   -168       3  
en                   280   323   1.000 -807936    -170       2  
</s>                 324   327   1.000 -807936    -189       2  
INFO: ngram_search.c(1027): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(1030): bestpath 0.00 wall 0.000 xRT
[('<s>', 1, 84, 91), ('que', 1, 92, 166), ('<sil>', -1443, 167, 169), ('la', -355, 170, 180), ('voz', -323, 181, 201), ('del', -3028, 202, 216), ('postulante', 0, 217, 279), ('en', -1, 280, 323), ('</s>', 0, 324, 327)]
INFO: pocketsphinx.c(152): Parsed model-specific feature parameters from /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/feat.params
Current configuration:
[NAME]          [DEFLT]     [VALUE]
-agc            none        none
-agcthresh      2.0     2.000000e+00
-allphone               
-allphone_ci        yes     yes
-alpha          0.97        9.700000e-01
-ascale         20.0        2.000000e+01
-aw         1       1
-backtrace      no      yes
-beam           1e-48       1.000000e-48
-bestpath       yes     yes
-bestpathlw     9.5     9.500000e+00
-ceplen         13      13
-cmn            live        batch
-cmninit        40,3,-1     40,3,-1
-compallsen     no      no
-dict                   /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.dic
-dictcase       no      no
-dither         no      no
-doublebw       no      no
-ds         1       1
-fdict                  
-feat           1s_c_d_dd   1s_c_d_dd
-featparams             
-fillprob       1e-8        1.000000e-08
-frate          100     100
-fsg                    
-fsgusealtpron      yes     yes
-fsgusefiller       yes     yes
-fwdflat        yes     yes
-fwdflatbeam        1e-64       1.000000e-64
-fwdflatefwid       4       4
-fwdflatlw      8.5     8.500000e+00
-fwdflatsfwin       25      25
-fwdflatwbeam       7e-29       7.000000e-29
-fwdtree        yes     yes
-hmm                    /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000
-input_endian       little      little
-jsgf                   
-keyphrase              
-kws                    
-kws_delay      10      10
-kws_plp        1e-1        1.000000e-01
-kws_threshold      1e-30       1.000000e-30
-latsize        5000        5000
-lda                    
-ldadim         0       0
-lifter         0       22
-lm                 /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.lm.bin
-lmctl                  
-lmname                 
-logbase        1.0001      1.000100e+00
-logfn                  
-logspec        no      no
-lowerf         133.33334   1.300000e+02
-lpbeam         1e-40       1.000000e-40
-lponlybeam     7e-29       7.000000e-29
-lw         6.5     6.500000e+00
-maxhmmpf       30000       30000
-maxwpf         -1      -1
-mdef                   
-mean                   
-mfclogdir              
-min_endfr      0       0
-mixw                   
-mixwfloor      0.0000001   1.000000e-07
-mllr                   
-mmap           yes     yes
-ncep           13      13
-nfft           512     512
-nfilt          40      25
-nwpen          1.0     1.000000e+00
-pbeam          1e-48       1.000000e-48
-pip            1.0     1.000000e+00
-pl_beam        1e-10       1.000000e-10
-pl_pbeam       1e-10       1.000000e-10
-pl_pip         1.0     1.000000e+00
-pl_weight      3.0     3.000000e+00
-pl_window      5       5
-rawlogdir              
-remove_dc      no      no
-remove_noise       yes     yes
-remove_silence     yes     yes
-round_filters      yes     yes
-samprate       16000       1.600000e+04
-seed           -1      -1
-sendump                
-senlogdir              
-senmgau                
-silprob        0.005       5.000000e-03
-smoothspec     no      no
-svspec                 
-tmat                   
-tmatfloor      0.0001      1.000000e-04
-topn           4       4
-topn_beam      0       0
-toprule                
-transform      legacy      dct
-unit_area      yes     yes
-upperf         6855.4976   6.800000e+03
-uw         1.0     1.000000e+00
-vad_postspeech     50      50
-vad_prespeech      20      20
-vad_startspeech    10      10
-vad_threshold      3.0     3.000000e+00
-var                    
-varfloor       0.0001      1.000000e-04
-varnorm        no      no
-verbose        no      no
-warp_params                
-warp_type      inverse_linear  inverse_linear
-wbeam          7e-29       7.000000e-29
-wip            0.65        6.500000e-01
-wlen           0.025625    2.562500e-02

INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='batch', VARNORM='no', AGC='none'
INFO: mdef.c(518): Reading model definition: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/mdef
INFO: bin_mdef.c(181): Allocating 79833 * 8 bytes (623 KiB) for CD tree
INFO: tmat.c(149): Reading HMM transition probability matrices: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/transition_matrices
INFO: acmod.c(113): Attempting to use PTM computation module
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/means
INFO: ms_gauden.c(242): 6090 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  32x39
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/variances
INFO: ms_gauden.c(242): 6090 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  32x39
INFO: ms_gauden.c(304): 79 variance values floored
INFO: ptm_mgau.c(803): Number of codebooks exceeds 256: 6090
INFO: acmod.c(115): Attempting to use semi-continuous computation module
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/means
INFO: ms_gauden.c(242): 6090 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  32x39
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/variances
INFO: ms_gauden.c(242): 6090 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  32x39
INFO: ms_gauden.c(304): 79 variance values floored
INFO: acmod.c(117): Falling back to general multi-stream GMM computation
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/means
INFO: ms_gauden.c(242): 6090 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  32x39
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/variances
INFO: ms_gauden.c(242): 6090 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  32x39
INFO: ms_gauden.c(304): 79 variance values floored
INFO: ms_senone.c(149): Reading senone mixture weights: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/mixture_weights
INFO: ms_senone.c(200): Truncating senone logs3(pdf) values by 10 bits
INFO: ms_senone.c(207): Not transposing mixture weights in memory
INFO: ms_senone.c(268): Read mixture weights for 6090 senones: 1 features x 32 codewords
INFO: ms_senone.c(320): Mapping senones to individual codebooks
INFO: ms_mgau.c(144): The value of topn: 4
INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
INFO: dict.c(320): Allocating 270112 * 32 bytes (8441 KiB) for word entries
INFO: dict.c(333): Reading main dictionary: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.dic
INFO: dict.c(213): Dictionary size 266013, allocated 2217 KiB for strings, 4260 KiB for phones
INFO: dict.c(336): 266013 words read
INFO: dict.c(358): Reading filler dictionary: /home/amolina/repo/audiotranscriptor/modelSphinx/LKE_T29.cd_cont_6000/noisedict
INFO: dict.c(213): Dictionary size 266016, allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(361): 3 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(406): Allocating 30^3 * 2 bytes (52 KiB) for word-initial triphones
INFO: dict2pid.c(132): Allocated 21840 bytes (21 KiB) for word-final triphones
INFO: dict2pid.c(196): Allocated 21840 bytes (21 KiB) for single-phone word triphones
INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format
INFO: ngram_search_fwdtree.c(74): Initializing search tree
INFO: ngram_search_fwdtree.c(101): 675 unique initial diphones
INFO: ngram_search_fwdtree.c(186): Creating search channels
INFO: ngram_search_fwdtree.c(323): Max nonroot chan increased to 580308
INFO: ngram_search_fwdtree.c(333): Created 675 root, 580180 non-root channels, 75 single-phone words
INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
INFO: cmn_live.c(120): Update from < 40.00  3.00 -1.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00 >
INFO: cmn_live.c(138): Update to   < 51.42  6.44  5.63 41.21  5.63  4.06 -30.89 -11.26 11.47 -38.96 22.43 -14.16  5.74 >
INFO: ngram_search_fwdtree.c(1550):      437 words recognized (6/fr)
INFO: ngram_search_fwdtree.c(1552):   136376 senones evaluated (1771/fr)
INFO: ngram_search_fwdtree.c(1556):   477102 channels searched (6196/fr), 27215 1st, 9054 last
INFO: ngram_search_fwdtree.c(1559):      902 words for which last channels evaluated (11/fr)
INFO: ngram_search_fwdtree.c(1561):    29178 candidate words for entering last phone (378/fr)
INFO: ngram_search_fwdtree.c(1564): fwdtree 0.36 CPU 0.471 xRT
INFO: ngram_search_fwdtree.c(1567): fwdtree 0.36 wall 0.471 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 13 words
INFO: ngram_search_fwdflat.c(948):      341 words recognized (4/fr)
INFO: ngram_search_fwdflat.c(950):    13931 senones evaluated (181/fr)
INFO: ngram_search_fwdflat.c(952):     9837 channels searched (127/fr)
INFO: ngram_search_fwdflat.c(954):      923 words searched (11/fr)
INFO: ngram_search_fwdflat.c(957):      395 word transitions (5/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 0.02 CPU 0.031 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 0.02 wall 0.031 xRT
INFO: ngram_search.c(1250): lattice start node <s>.0 end node </s>.73
INFO: ngram_search.c(1276): Eliminated 1 nodes before end node
INFO: ngram_search.c(1381): Lattice has 48 nodes, 32 links
INFO: ps_lattice.c(1374): Bestpath score: -2486
INFO: ps_lattice.c(1378): Normalizer P(O) = alpha(</s>:73:75) = -148657
INFO: ps_lattice.c(1435): Joint P(O,S) = -155501 P(S|O) = -6844
INFO: ngram_search.c(872): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(875): bestpath 0.00 wall 0.000 xRT
INFO: pocketsphinx.c(1170): que (-2547)
word                 start end   pprob ascr       lscr       lback
INFO: ngram_search.c(1027): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(1030): bestpath 0.00 wall 0.000 xRT
<s>                  7     14    1.000 -306176    0          1  
que                  15    79    1.000 -1390592   -223       2  
</s>                 80    82    1.000 -1390592   -345       3  
----------------------------
| start |  end  |   word   |
----------------------------
INFO: ngram_search.c(1027): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(1030): bestpath 0.00 wall 0.000 xRT
| 0.07s | 0.14s |      <s> |
| 0.15s | 0.79s |      que |
|  0.8s | 0.82s |     </s> |
----------------------------
INFO: cmn_live.c(120): Update from < 51.42  6.44  5.63 41.21  5.63  4.06 -30.89 -11.26 11.47 -38.96 22.43 -14.16  5.74 >
INFO: cmn_live.c(138): Update to   < 48.82  8.28  9.59 15.19  0.65 10.14 -16.30 -7.60  2.97 -27.46 19.32 -3.37 -3.41 >
INFO: ngram_search_fwdtree.c(1550):     4085 words recognized (25/fr)
INFO: ngram_search_fwdtree.c(1552):   465826 senones evaluated (2806/fr)
INFO: ngram_search_fwdtree.c(1556):  1701760 channels searched (10251/fr), 73994 1st, 78396 last
INFO: ngram_search_fwdtree.c(1559):     6445 words for which last channels evaluated (38/fr)
INFO: ngram_search_fwdtree.c(1561):   107024 candidate words for entering last phone (644/fr)
INFO: ngram_search_fwdtree.c(1564): fwdtree 1.25 CPU 0.755 xRT
INFO: ngram_search_fwdtree.c(1567): fwdtree 1.25 wall 0.755 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 266 words
INFO: ngram_search_fwdflat.c(948):     1400 words recognized (8/fr)
INFO: ngram_search_fwdflat.c(950):   115214 senones evaluated (694/fr)
INFO: ngram_search_fwdflat.c(952):   114370 channels searched (688/fr)
INFO: ngram_search_fwdflat.c(954):    11749 words searched (70/fr)
INFO: ngram_search_fwdflat.c(957):    13504 word transitions (81/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 0.23 CPU 0.138 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 0.23 wall 0.138 xRT
INFO: ngram_search.c(1250): lattice start node <s>.0 end node </s>.162
INFO: ngram_search.c(1276): Eliminated 1 nodes before end node
INFO: ngram_search.c(1381): Lattice has 185 nodes, 243 links
INFO: ps_lattice.c(1374): Bestpath score: -7587
INFO: ps_lattice.c(1378): Normalizer P(O) = alpha(</s>:162:164) = -572256
INFO: ps_lattice.c(1435): Joint P(O,S) = -645037 P(S|O) = -72781
INFO: ngram_search.c(872): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(875): bestpath 0.00 wall 0.000 xRT
INFO: pocketsphinx.c(1170): la voz del postulante no (-7682)
word                 start end   pprob ascr       lscr       lback
INFO: ngram_search.c(1027): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(1030): bestpath 0.00 wall 0.000 xRT
<s>                  84    86    1.000 -133120    0          1  
la                   87    99    0.096 -362496    -257       2  
voz                  100   119   0.091 -572416    -404       3  
del                  120   132   0.015 -706560    -185       3  
<sil>                133   137   0.781 -460800    -524288    3  
postulante           138   199   1.000 -2325504   -871       1  
no                   200   245   0.501 -878592    -353       1  
</s>                 246   248   1.000 -878592    -192       2  
INFO: ngram_search_fwdtree.c(429): TOTAL fwdtree 1.62 CPU 0.670 xRT
INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 1.62 wall 0.670 xRT
INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 0.25 CPU 0.105 xRT
INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 0.25 wall 0.105 xRT
INFO: ngram_search.c(303): TOTAL bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(306): TOTAL bestpath 0.00 wall 0.000 xRT
INFO: ngram_search_fwdtree.c(429): TOTAL fwdtree 2.07 CPU 0.847 xRT
INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 2.07 wall 0.847 xRT
INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 0.18 CPU 0.074 xRT
INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 0.18 wall 0.074 xRT
INFO: ngram_search.c(303): TOTAL bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(306): TOTAL bestpath 0.00 wall 0.000 xRT

Nikolay Shmyrev · Answer 1 · 08 ноября 2018

Это ошибка в pocketsphinx-python , вы можете сообщить об этом на github.

Исправление будет выглядеть примерно так: init.py :

def __iter__(self):
    with self.f:
        with self.start_utterance():
            while self.f.readinto(self.buf):
                self.process_raw(self.buf, self.no_search, self.full_utt)
                if self.keyphrase and self.hyp():
                    with self.end_utterance():
                        yield self
                elif self.in_speech != self.get_in_speech():
                    self.in_speech = self.get_in_speech()
                    if not self.in_speech and self.hyp():
                        with self.end_utterance():
                            yield self
                if self.in_speech and self.hyp():
                    with self.end_utterance():
                         yield self # <-- Return last chunk

Pocketsphinx Python не возвращает последнее высказывание при переборе аудио

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Pocketsphinx Python не возвращает последнее высказывание при переборе аудио

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы