Table 1. Prosody experiment under clean and masked audio. Values are mean [95% CI].
| Judge |
Measure |
Clean (mean [CIs]) |
Masked (mean [CIs]) with WER=1.64 |
| MERaLiON |
Sensitivity |
-0.055 [-0.065, -0.045] |
-0.088 [-0.098, -0.077] |
| MERaLiON |
Specificity |
-0.004 [-0.006, -0.002] |
-0.007 [-0.013, -0.002] |
| Qwen |
Sensitivity |
0.005 [-0.004, -0.019] |
-0.007 [-0.016, 0.004] |
| Qwen |
Specificity |
-0.087 [-0.096, -0.077] |
-0.076 [-0.084, -0.067] |
| Flamingo |
Sensitivity |
-0.001 [-0.015, 0.012] |
0.001 [-0.005, 0.006] |
| Flamingo |
Specificity |
-0.015 [-0.022, -0.008] |
-0.015 [-0.020, -0.012] |
Table 2A. Sensitivity across transcript sources (GT, Whisper-Large, Whisper-Base) for all judges under text-only and multimodal settings.
| Judge |
Modality |
GT |
Whisper-Large |
Whisper-Base |
Direction |
| Gemini-2.5-Flash |
text only |
+0.080 |
+0.015 |
-0.009 |
Degrades (-0.089) |
| Gemini-2.5-Flash |
audio text |
+0.067 |
+0.015 |
— |
Degrades (-0.052) |
| Qwen2.5-Omni-7B |
text only |
+0.029 |
+0.037 |
+0.020 |
Stable |
| Qwen2.5-Omni-7B |
audio text |
+0.090 |
+0.127 |
+0.143 |
Improves (+0.053) |
| MiniCPM-o-4.5 |
text only |
-0.127 |
+0.226 |
+0.213 |
Inverts |
| MiniCPM-o-4.5 |
audio text |
-0.109 |
-0.149 |
-0.150 |
Degrades (-0.041) |
Table 2B. Specificity across transcript sources (GT, Whisper-Large, Whisper-Base) for all judges under text-only and multimodal settings.
| Judge |
Modality |
GT |
Whisper-Large |
Whisper-Base |
Direction |
| Gemini-2.5-Flash |
text only |
1 |
0.4 |
0.5 |
Collapses |
| Gemini-2.5-Flash |
audio text |
1 |
0.8 |
— |
Degrades |
| Qwen2.5-Omni-7B |
text only |
0.4 |
0.9 |
0.8 |
Improves |
| Qwen2.5-Omni-7B |
audio text |
0 |
0.2 |
0.2 |
Improves |
| MiniCPM-o-4.5 |
text only |
0 |
0.9 |
0.8 |
Improves |
| MiniCPM-o-4.5 |
audio text |
0 |
0 |
0 |
Unchanged |
Table 3. Gain decomposition and rescue values for judges evaluated under multiple transcript sources.
| Judge |
Measure |
gain_GT |
gain_WL |
gain_WB |
rescue_WL |
rescue_WB |
| Qwen2.5-Omni-7B |
Sensitivity |
+0.061 |
+0.09 |
+0.122 |
+0.029 |
+0.061 |
| Qwen2.5-Omni-7B |
Specificity |
-0.010 |
+0.09 |
+0.122 |
+0.1 |
+0.132 |
| Gemini-2.5-Flash |
Sensitivity |
-0.013 |
0 |
— |
+0.013 |
— |
| MiniCPM-o-4.5 |
Sensitivity |
+0.017 |
-0.375 |
-0.363 |
-0.392 |
-0.38 |
| MiniCPM-o-4.5 |
Specificity |
+0.021 |
-0.374 |
-0.356 |
-0.395 |
-0.377 |
Table 4. AAPB of Sensitivity across transcript sources for judges with multi-source evaluations.
| Judge |
Modality |
GT |
Whisper-Large |
Whisper-Base |
| Gemini-2.5-Flash |
text_only |
0.027 |
0.025 |
0.028 |
| Gemini-2.5-Flash |
audio_text |
0.042 |
— |
— |
| Qwen2.5-Omni-7B |
text_only |
0.059 |
0.159 |
0.205 |
| Qwen2.5-Omni-7B |
audio_text |
0.097 |
0.106 |
0.087 |
| MiniCPM-o-4.5 |
text_only |
0.414 |
0.087 |
0.075 |
| MiniCPM-o-4.5 |
audio_text |
0.325 |
0.350 |
0.360 |
Table 5A. Category-based comparison between MERaLiON (MER) and Audio Flamingo (AF). Values are mean [95% CI].
| Setting |
Category |
MER Sensitivity (CI) |
AF Sensitivity (CI) |
MER Specificity AUC (CI) |
AF Specificity AUC (CI) |
| audio only |
harassment |
0.181 [0.160, 0.200] |
-0.002 [-0.037, 0.033] |
0.765 [0.731, 0.801] |
0.765 [0.724, 0.801] |
| audio only |
hate |
0.332 [0.305, 0.364] |
0.071 [0.023, 0.120] |
0.79 [0.756, 0.825] |
0.629 [0.580, 0.675] |
| audio only |
violence |
0.162 [0.137, 0.190] |
0.035 [-0.001, 0.071] |
0.674 [0.628, 0.720] |
0.685 [0.639, 0.729] |
| audio text/GT |
harassment |
0.125 [0.106, 0.146] |
0.052 [-0.221, 0.344] |
0.843 [0.821, 0.865] |
0.727 [0.690, 0.763] |
| audio text/GT |
hate |
0.249 [0.224, 0.277] |
0.076 [-0.022, 0.170] |
0.818 [0.789, 0.845] |
0.707 [0.672, 0.742] |
| audio text/GT |
violence |
0.165 [0.138, 0.196] |
-0.19 [-0.426, 0.057] |
0.823 [0.791, 0.852] |
0.733 [0.694, 0.770] |
| audio text/WL |
harassment |
0.142 [0.12, 0.162] |
-0.004 [-0.046, 0.029] |
0.861 [0.841, 0.881] |
0.725 [0.685, 0.759] |
| audio text/WL |
hate |
0.256 [0.225, 0.283] |
-0.037 [-0.089, 0.01] |
0.825 [0.795, 0.855] |
0.707 [0.669, 0.742] |
| audio text/WL |
violence |
0.157 [0.130, 0.186] |
-0.02 [-0.062, 0.014] |
0.810 [0.774, 0.844] |
0.686 [0.637, 0.731] |
| audio text/WB |
harassment |
0.140 [0.119, 0.161] |
0.033 [-0.263, 0.286] |
0.850 [0.830, 0.872] |
0.723 [0.687, 0.759] |
| audio text/WB |
hate |
0.249 [0.220, 0.277] |
-0.045 [-0.142, 0.046] |
0.809 [0.778, 0.844] |
0.694 [0.649, 0.735] |
| audio text/WB |
violence |
0.169 [0.139, 0.199] |
-0.204 [-0.464, 0.075] |
0.818 [0.784, 0.849] |
0.694 [0.646, 0.738] |
Table 5B. Sensitivity comparison for dangerous, self_harm, and sexual categories under text-only and audio text settings. Values are mean [95% CI].
| Category |
MER text only Sensitivity |
AF text only Sensitivity |
MER audio text Sensitivity |
AF audio text Sensitivity |
| dangerous |
0.169 [0.135, 0.201] |
0.360 [0.316, 0.404] |
0.152 [0.138, 0.170] |
0.481 [0.424, 0.537] |
| self_harm |
0.058 [0.018, 0.096] |
0.580 [0.523, 0.629] |
0.091 [0.077, 0.107] |
0.413 [0.071, 0.682] |
| sexual |
0.193 [0.152, 0.235] |
0.651 [0.615, 0.683] |
0.278 [0.251, 0.306] |
0.433 [0.357, 0.505] |
Table 6. Safety scores for severity-0 dialogues.
| Model and modality |
Safety score |
| Qwen, Audio+Transcriptions |
0.824 |
| Qwen, Audio |
0.8 |
| Flamingo, Audio+Transcriptions |
0.711 |
| Flamingo, Audio |
0.739 |
| MERaLiON, Audio+Transcriptions |
0.654 |
| MERaLiON, Audio |
0.672 |