谷歌提出二次标注改进MQM框架,提升人工翻译评估可靠性(谷歌二次验证怎么设置)

fjmyhfvclm2025-12-21  4

小编速览

谷歌研究指出,单次人工翻译评估容易产生“噪音”,影响模型间的质量对比。为此,团队在MQM框架中引入二次标注环节,即由另一评估员复核已有标注。实验表明,该方法能显著提升评分一致性与可靠性,尤其在人机协作流程中可平衡质量与成本。研究同时提醒,需防范评估者过度依赖初次标注,专家监督仍不可或缺。

Google Wants to Improve Human Translation Evaluation with This Simple Step

谷歌想要通过这一简单步骤提升人工翻译

久久小常识(www.99xcs.com)™

As AI translation quality becomes better and better, a crucial question has emerged: can human evaluation keep up? A new study from Google researchers argues that it can and explains how.

随着人工智能翻译质量越来越好,一个关键问题浮出水面:人工评估能否跟上发展步伐?谷歌研究人员发布的一项新研究给出了肯定答案,并阐述了具体实施方案。

In their October 28, 2025, paper, researchers Parker Riley, Daniel Deutsch, Mara Finkelstein, Colten DiIanni, Juraj Juraska, and Markus Freitag proposed a refinement to the Multidimensional Quality Metrics (MQM) framework: re-annotation.

在2025年10月28日发表的论文中,研究人员帕克·莱利(Parker Riley)、丹尼尔·多伊奇(Daniel Deutsch)、玛拉·芬克尔斯坦(Mara Finkelstein)、科尔顿·迪安尼(Colten DiIanni)、尤拉伊·尤拉斯卡(Juraj Juraska)和马库斯·弗赖塔格(Markus Freitag)针对多维质量指标(MQM)框架提出了改进方法:重新标注

Rather than relying on a single pass, a second human rater — either the same or a different one — reviews an existing annotation — whether human- or machine-generated — correcting, deleting, or adding error spans.

展开全文

该方案不依赖单次标注,而是由第二位人工评估员-要么是同一位,要么是不同的评估员-对已有标注内容进行二次审核-无论这些标注来自人工还是机器-都需对其中的错误标注范围进行修正、删除或补充。

Adjusting the State-of-the-Art

调整前沿水准

The researchers describe the MQM framework as “the current state-of-the-art human evaluation framework.” Under this framework, raters mark translation errors by type and severity across dimensions such as fluency, accuracy, and terminology.

研究人员将多维质量指标(MQM)框架描述为“当前最先进的人工翻译评估框架”。在该框架下,评估人员需从流畅度、准确度、专业术语等维度,按错误类型和严重程度对翻译问题进行评分。

As AI translation systems keep improving, this “evaluation noise” may blur real quality differences between models and lead to wrong decisions.

随着人工智能翻译系统的持续优化,这种“评估噪音”可能会模糊不同模型之间的真实质量差异,进而导致错误的决策。

A Single Annotation Pass Is Insufficient

单轮标注传递是不够的

To test whether a second pass actually helps, the team ran a series of experiments. Here’s what they found:

为验证二次标注是否真的有用,研究团队进行了一系列实验。结果发现:

Reviewing another human’s annotation led to even more changes.

复核他人的人工标注会导致更多的变化。

When raters re-annotated their own work, they consistently found new errors they had missed the first time.

当标注人员重新核查他们自己先前的工作时,总能发现他们他们第一次错过的新错误。

Reviewing another human’s annotation led to even more changes. Reviewing automatic annotations produced the most.

复核他人的人工标注会导致更多的变化。而审核自动标注结果时产生的修正数量最多。

The second round of reviews made the results more consistent and reliable overall, even when the first pass came from automatic annotation — like GEMBA-MQM (prompted GPT-4) or AutoMQM (a fine-tuned Gemini 1.0).

即便第一轮标注来自自动注释——就像GEMBA-MQM(基于GPT-4提示)或AutoMQM(一个基于微调的Gemini 1.0),第二轮复核仍能全面提升评估结果的一致性与可靠性。

Across all scenarios, re-annotation led to stronger agreement between raters — underscoring the method’s potential to boost evaluation reliability.

在所有实验场景中,重复标注操作均有效提升了不同评估者之间判断标准的一致性——这充分证明了该方法对增强评估可靠性的潜力。

The researchers highlighted, however, that there is still a risk that raters are influenced by the first set of annotations — overly trusting prior errors and focusing mainly on adding new ones. They inserted a few fake error marks into the data, and they found that while most raters spotted and deleted them, a minority kept them “at concerningly high rates” — around 80%.

然而研究人员也指出,这种做法仍存在评估者受初始标注影响的风险——他们会过度信任先前错误,并将重点主要放在添加新错误上。研究团队在数据中植入若干虚构错误标记后他们发现,虽然大多数评估者能识别并删除这些标记,但仍有少部分评估者以“高得惊人的比例”——约80%保留了这些虚构错误。

久久小常识(www.99xcs.com)™

久久小常识(www.99xcs.com)™

Operational Takeaways for the Language Industry

对语言行业的实际启示

For Language Solutions Integrators (LSIs) and enterprise buyers, the findings carry clear operational relevance.

对于语言解决方案集成商(LSIs)和企业采购方而言,这些研究结果具有明确的运营相关性。

As AI translation systems converge in quality, evaluation reliability has become the new bottleneck. A two-stage, collaborative process could strengthen benchmarks, vendor comparisons, and model selection.

当各AI翻译系统质量日趋接近,评估可靠性已成为新的瓶颈。一个两阶段协作式流程可以增强基准设定、供应商对比以及模型选择工作。

The results also support hybrid workflows in which automatic MQM annotations are reviewed by human experts — improving consistency while controlling costs and turnaround times.

研究结果同样支持人机协作的工作流程:由人类专家审核自动生成的MQM标注——可在控制成本与周转时间的同时提升评估一致性

“Providing raters with prior annotations from high-quality LLM-based automatic systems improves rating quality over from-scratch annotation, at no additional human annotation cost,” the researchers said.

研究人员指出:“相较于从零开始标注,为评估人员提供基于高质量大语言模型的自动标注结果,可以在不增加额外的人工注释成本的前提下提高评级质量。”

久久小常识(www.99xcs.com)™

The findings also underline that training and calibration — keeping evaluators aligned on how they apply quality criteria — remain essential. Some raters were clearly influenced by earlier annotations, showing that re-annotation improves consistency but doesn’t replace expert oversight or quality control.

这些发现还强调,培训和校准——让评估人员在如何应用质量标准方面保持一致——仍然至关重要。部分评估者明显受到初始标注影响的现象表明,重复标注虽能提升一致性,但无法取代专家监督与质量控制环节。

原文网址:

Google Wants to Improve Human Translation Evaluation with This Simple Step - Slator

转载请注明原文地址:http://demo.aspcms.cn/tech/3665955.shtml
上一篇下一篇
00

热门资讯