whipser.cpp安装完毕后,加载了多个大模型,分别进行测试。

测试项目

下载模型命令:

1
sh ./models/download-ggml-model.sh base

测试命令:

1
2
3
4
5
6
7
8
# 转化成wav
ffmpeg -i samples/meeting.m4a -ar 16000 -ac 1 -c:a pcm_s16le samples/meeting.wav
# 运行命令
./build/bin/whisper-cli \
-f samples/meeting.wav \ # 语音文件
-m models/ggml-base.bin \ # 使用的模型
-l zh \ # 中文
-t 8 # 8线程

设备性能:

  • 7840HS,单核CPU跑,不使用多进程,默认4线程
Model Disk Mem 时长 转化时间 速度 测试结果
tiny 75 MiB ~273 MB 34s 1.1s 31x 需要加上-l zh参数,识别中文,效果不好
base 142 MiB ~388 MB 34s 2.2s 15x 需要加上-l zh参数,识别中文,效果不好
small 466 MiB ~852 MB 34s 9.8s 3.5x 需要加上-l zh参数,识别中文
medium 1.5 GiB ~2.1 GB 34s 21s 1.6x 需要加上-l zh参数
large-v3 2.9 GiB ~3.9 GB 34s 40.7s 0.83x
large-v3-turbo.bin 1.6GB 2.6GB 34s 33s 1.03x

tiny测试

必须要指定中文名称,使用命令./build/bin/whisper-cli -f samples/test.wav -m models/ggml-tiny.bin -l zh, 相同会议识别如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[00:00:00.000 --> 00:00:07.680]  现在来你跟我说两句话
[00:00:07.680 --> 00:00:10.240] 我看一次
[00:00:10.240 --> 00:00:11.780] 两句话
[00:00:11.780 --> 00:00:13.320] 好
[00:00:13.320 --> 00:00:17.400] 请一下落音效果怎么样
[00:00:17.400 --> 00:00:20.480] 你这三两句话
[00:00:20.480 --> 00:00:23.040] 好
[00:00:29.180 --> 00:00:30.180] 好多吗?
whisper_print_timings: load time = 46.42 ms
whisper_print_timings: fallbacks = 1 p / 0 h
whisper_print_timings: mel time = 23.61 ms
whisper_print_timings: sample time = 82.86 ms / 282 runs ( 0.29 ms per run)
whisper_print_timings: encode time = 706.36 ms / 2 runs ( 353.18 ms per run)
whisper_print_timings: decode time = 1.88 ms / 1 runs ( 1.88 ms per run)
whisper_print_timings: batchd time = 193.39 ms / 269 runs ( 0.72 ms per run)
whisper_print_timings: prompt time = 50.64 ms / 96 runs ( 0.53 ms per run)
whisper_print_timings: total time = 1128.47 ms

base测试

如果不指定语言,会如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[00:00:00.000 --> 00:00:02.000]   "What do you want to say?"
[00:00:02.000 --> 00:00:05.000] "What do you want to say to me?"
[00:00:05.000 --> 00:00:06.000] "What do you want to say to me?"
[00:00:06.000 --> 00:00:07.000] "What do you want to say to me?"
[00:00:07.000 --> 00:00:08.000] "What do you want to say to me?"
[00:00:08.000 --> 00:00:09.000] "What do you want to say to me?"
[00:00:09.000 --> 00:00:10.000] "What do you want to say to me?"
[00:00:10.000 --> 00:00:11.000] "What do you want to say to me?"
[00:00:11.000 --> 00:00:12.000] "What do you want to say to me?"
[00:00:12.000 --> 00:00:13.000] "What do you want to say to me?"
[00:00:13.000 --> 00:00:14.000] "What do you want to say to me?"
[00:00:14.000 --> 00:00:15.000] "What do you want to say to me?"
[00:00:15.000 --> 00:00:16.000] "What do you want to say to me?"
[00:00:16.000 --> 00:00:17.000] "What do you want to say to me?"
[00:00:17.000 --> 00:00:18.000] "What do you want to say to me?"
[00:00:18.000 --> 00:00:19.000] "What do you want to say to me?"
[00:00:19.000 --> 00:00:20.000] "What do you want to say to me?"
[00:00:20.000 --> 00:00:21.000] "What do you want to say to me?"
[00:00:21.000 --> 00:00:36.000] "What do you want to say to me?"

必须要指定中文名称,使用命令./build/bin/whisper-cli -f samples/test.wav -m models/ggml-base.bin -l zh, 相同会议识别如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[00:00:00.000 --> 00:00:07.680]  现在来你跟我说两句话
[00:00:07.680 --> 00:00:10.240] 我看一次
[00:00:10.240 --> 00:00:11.780] 两句话
[00:00:11.780 --> 00:00:13.320] 好
[00:00:13.320 --> 00:00:17.400] 请一下落音效果怎么样
[00:00:17.400 --> 00:00:20.480] 你这三两句话
[00:00:20.480 --> 00:00:23.040] 好
[00:00:29.180 --> 00:00:30.180] 好多吗?
whisper_print_timings: load time = 46.42 ms
whisper_print_timings: fallbacks = 1 p / 0 h
whisper_print_timings: mel time = 23.61 ms
whisper_print_timings: sample time = 82.86 ms / 282 runs ( 0.29 ms per run)
whisper_print_timings: encode time = 706.36 ms / 2 runs ( 353.18 ms per run)
whisper_print_timings: decode time = 1.88 ms / 1 runs ( 1.88 ms per run)
whisper_print_timings: batchd time = 193.39 ms / 269 runs ( 0.72 ms per run)
whisper_print_timings: prompt time = 50.64 ms / 96 runs ( 0.53 ms per run)
whisper_print_timings: total time = 1128.47 ms

small

可能人声音太嘈杂了,也需要加-l zh,识别如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

[00:00:00.000 --> 00:00:07.460] 现在来你跟我说两句话
[00:00:07.460 --> 00:00:11.640] 我看你是 你想说啥
[00:00:11.640 --> 00:00:17.280] 好 听一下录音效果怎么样
[00:00:17.280 --> 00:00:20.240] 你再说两句话
[00:00:20.240 --> 00:00:22.880] 好
[00:00:29.240 --> 00:00:30.280] 好 听一下


whisper_print_timings: load time = 208.93 ms
whisper_print_timings: fallbacks = 1 p / 0 h
whisper_print_timings: mel time = 24.02 ms
whisper_print_timings: sample time = 515.76 ms / 940 runs ( 0.55 ms per run)
whisper_print_timings: encode time = 5609.63 ms / 2 runs ( 2804.82 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: batchd time = 3176.09 ms / 928 runs ( 3.42 ms per run)
whisper_print_timings: prompt time = 231.88 ms / 96 runs ( 2.42 ms per run)
whisper_print_timings: total time = 9850.00 ms

medium

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

[00:00:00.000 --> 00:00:08.000] 你跟我說兩句話
[00:00:08.000 --> 00:00:11.000] 好你試
[00:00:11.000 --> 00:00:13.000] 你想說啥
[00:00:13.000 --> 00:00:15.000] 好
[00:00:15.000 --> 00:00:18.000] 聽一下錄音效果怎麼樣
[00:00:18.000 --> 00:00:21.000] 你再說兩句話
[00:00:21.000 --> 00:00:23.000] 好
[00:00:23.000 --> 00:00:33.000] 他怎麼了
whisper_print_timings: load time = 594.03 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 24.38 ms
whisper_print_timings: sample time = 80.54 ms / 253 runs ( 0.32 ms per run)
whisper_print_timings: encode time = 18115.92 ms / 2 runs ( 9057.96 ms per run)
whisper_print_timings: decode time = 62.90 ms / 3 runs ( 20.97 ms per run)
whisper_print_timings: batchd time = 2026.68 ms / 243 runs ( 8.34 ms per run)
whisper_print_timings: prompt time = 329.93 ms / 48 runs ( 6.87 ms per run)
whisper_print_timings: total time = 21424.55 ms

large-v3

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[00:00:00.000 --> 00:00:07.600]  现在来你跟我说两句话
[00:00:07.600 --> 00:00:12.000] 我看你是你要说啥
[00:00:12.000 --> 00:00:17.400] 好听一下录音效果怎么样
[00:00:17.400 --> 00:00:20.400] 你再说两句话
[00:00:20.400 --> 00:00:34.440] 他怎么


whisper_print_timings: load time = 1698.18 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 29.04 ms
whisper_print_timings: sample time = 78.72 ms / 232 runs ( 0.34 ms per run)
whisper_print_timings: encode time = 34552.24 ms / 2 runs ( 17276.12 ms per run)
whisper_print_timings: decode time = 77.53 ms / 2 runs ( 38.77 ms per run)
whisper_print_timings: batchd time = 3319.69 ms / 223 runs ( 14.89 ms per run)
whisper_print_timings: prompt time = 547.42 ms / 43 runs ( 12.73 ms per run)
whisper_print_timings: total time = 40676.03 ms

ggml-large-v3-turbo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18


[00:00:00.000 --> 00:00:07.440] 现在来 你跟我说两句话
[00:00:07.440 --> 00:00:17.240] 你是 你要说啥 好 听一下录音效果怎么样
[00:00:17.240 --> 00:00:22.800] 你再说两句话 好
[00:00:22.800 --> 00:00:34.440] 他怎么


whisper_print_timings: load time = 1131.77 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 27.35 ms
whisper_print_timings: sample time = 68.56 ms / 208 runs ( 0.33 ms per run)
whisper_print_timings: encode time = 31061.01 ms / 2 runs ( 15530.50 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: batchd time = 527.58 ms / 201 runs ( 2.62 ms per run)
whisper_print_timings: prompt time = 102.62 ms / 44 runs ( 2.33 ms per run)
whisper_print_timings: total time = 32967.73 ms