为什么会以这种方式计算词信息丢失(Word Information Lost,简称WIL)?

huangapple go评论59阅读模式
英文:

Why is Word Information Lost (WIL) calculated the way it is?

问题

Word Information Lost (WIL)是衡量自动语音识别(ASR)服务(例如AWS Transcribe、Google Speech-to-Text等)与黄金标准(通常是人工生成的)转录之间性能的指标,通常被认为是比词错误率(WER)更复杂的指标。

WIL的计算公式如下:

为什么会以这种方式计算词信息丢失(Word Information Lost,简称WIL)?

...其中:

  • H = 匹配的单词数(黄金标准和ASR转录之间的匹配单词数)
  • N = 黄金标准转录中的总单词数
  • P = ASR转录中的总单词数
  • S = 替换(一个单词被另一个单词替换)
  • D = 删除(黄金标准转录中存在但ASR转录中不存在的单词)
  • I = 插入(ASR转录中存在但黄金标准转录中不存在的单词)

我的问题是:为什么要这样计算?

我不太理解WIL在这里究竟代表什么,特别是它在分子和分母中都具有指数性质。

似乎可以使用更简单、更直观的版本,例如:

为什么会以这种方式计算词信息丢失(Word Information Lost,简称WIL)?

你如何用通俗的语言描述WIL的含义?

英文:

Word Information Lost (WIL) is a measure of the performance of an automated speech recognition (ASR) service (e.g. AWS Transcribe, Google Speech-to-Text, etc.) against a gold standard (usually human-generated) transcript, and is generally considered a more sophisticated measure than Word Error Rate (WER).

The formula for WIL is as follows:

为什么会以这种方式计算词信息丢失(Word Information Lost,简称WIL)?

...where:

  • H = hits (matching words between the gold standard and ASR transcripts)
  • N = total words in the gold standard transcript
  • P = total words in the ASR transcript
  • S = substitutions (one word replaced with another)
  • D = deletions (a word in the gold standard transcript not present in the ASR one)
  • I = insertions (a word in the ASR transcript not present in the gold standard one)

My question is: why is it calculated this way?

I'm not grasping what exactly WIL is supposed to represent here, especially its exponential nature (in both the numerator and denominator).

It seems like a simpler, more immediately understandable version could be something like:

为什么会以这种方式计算词信息丢失(Word Information Lost,简称WIL)?

How would you describe what WIL means, in layman's terms?

答案1

得分: 1

词信息损失(Word Information Loss,WIL)度量是一个对早期2000年代用于衡量自动语音识别系统(ASR)准确性的不同度量的近似。该度量是相对信息损失(Relative Information Loss,RIL)度量。

RIL依赖于了解插入、删除或替换的单词之间的“相似性”。这通常被称为“互信息”。例如,如果ASR系统将“pieced”错误地转录为“piece”,则可能存在较高的互信息。但是,如果ASR将“piece”错误地转录为“peace”,则互信息较低。类似地,“bread”替换为“bred”将具有较低的互信息。

RIL的核心是希望“排名”转录中发生的错误,以便某些类型的错误对ASR系统的整体准确性得分影响较小。这就是人类语音的工作原理;如果我们听错或误解了语音,我们能够更好地“填补空白”,因为我们对上下文有更好的理解——如果我们谈论早餐,我们知道是“bread”,如果我们谈论馅饼,我们知道是“piece”而不是“peace”。

WIL不依赖于了解“命中”的单词与“插入”、“删除”或“替换”的单词之间的统计关系。WIL试图通过“加权命中”和“加权未命中”来近似RIL。我预期在WIL中会看到的是,它会“抑制”极低和极高的词错误率(WER);但会提高在多次插入、删除或替换的情况下的WER。这使研究人员能够针对特定单词、短语或n-gram在重新训练或微调ASR系统方面进行努力。

WIL的一个特别有趣的应用是在Whisper中使用:

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July). Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning (pp. 28492-28518). PMLR.

Whisper的研究人员意识到,转录中的小错误(句号、单个字母)可能会使WER偏低,因此他们基本上忽略了这些类型的错误。他们正在使用一种形式的WIL,但没有称之为WIL。

英文:

The Word Information Loss (WIL) metric is an approximation of a different metric that was used in the early 2000s to measure the accuracy of automatic speech recognition systems (ASR), being the Relative Information Loss (RIL) metric.

RIL depends on knowing the "similarity" between the words that were inserted, deleted or substituted. This is often called "mutual information". For example, if the ASR system transcribes "piece" instead of "pieced", there is likely to be a high degree of mutual information. But if the ASR transcribes, say "peace" instead of "piece", the mutual information is lower. Similarly, "bred" for "bread" would have less mutual information.

At the heart of RIL is a desire to "rank" errors that are made in transcription so that some types of errors matter less to the overall accuracy score of the ASR system. This is how human speech works; we are better able to "fill the gaps" if we mishear or misunderstand speech because we have a much better grasp of context - we know it's "bread" if we're speaking about breakfast, and we know it's "piece" rather than "peace" if we're talking about pie.

WIL is not dependent on knowing the statistical relationship between the words that were "hit" and those "inserted", "deleted" or "substituted. WIL tries to approximate RIL by "weighting the hits" and "weighting the misses". What I would expect to see with WIL is that it "dampens out" extremely low and extremely high WER; but raises the WER where multiples of insertion, deletion or substitution are made. This allows researchers to target efforts in re-training or fine-tuning the ASR system on particular words, phrases or n-grams.

A particularly interesting application of WIL is used in Whisper:

> Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July). Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning (pp. 28492-28518). PMLR.

The researchers behind Whisper realised that small mistakes in transcription - periods, single letters - can skew the WER down, so they essentially ignored these types of errors. They are using a form of WIL, without calling it WIL.

huangapple
  • 本文由 发表于 2023年8月9日 02:56:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76862465.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定