Japanese OCR for GCP Document AI custom processor

huangapple go评论54阅读模式
英文:

Japanese OCR for GCP Document AI custom processor

问题

GCP Document AI自定义处理器的培训,似乎根本不识别日文文本。是否有启用日语语言支持的选项?

英文:

I am training the GCP Document AI custom processor for my project. It seems the processor does not recognize Japanese text at all. Is there an option to enable Japanese language support?

答案1

得分: 2

目前在自定义文档提取器中,不支持ja: Japanese语言,如果您希望实现对Custom Document Extractor的日语语言支持功能,可以在问题跟踪器上打开一个新的功能请求,详细描述您的需求。有关自定义处理器的更多信息,您可以参考此文档

英文:

Currently in Custom Document Extractor, ja: Japanese language is not supported.

If you want the feature of Japanese language support for Custom Document Extractor to be implemented, you can open a new feature request on the issue tracker describing your requirement.

For more information regarding custom processor you can refer to this documentation.

答案2

得分: 2

评论 是准确的。当前自定义文档提取器不支持日语,但已计划在2023年上半年产品路线中加入此功能。目前有一个可行的解决方法,可以在该功能实施之前使用。

注意:这不是永久解决方案,但可以在一段时间内增加Document AI Workbench的语言能力。

  1. 预处理您的培训文档,使用支持日语的文档OCR处理器
  2. 保存输出的ProcessResponse JSON文件,然后移除HumanReviewStatus并解包Document对象。
    • (即JSON应以uri: ""开头)。
  3. 将您创建的Document JSON文件导入Document AI Workbench数据集并标记文档。
    • 注意:模式标签只能用英语定义。
  4. 在预测期间,使用文档OCR处理器预处理您的文档,然后将输出发送到自定义文档提取器以进行预测。
    • 注意:这仅适用于在线处理,不适用于批处理。
英文:

This comment is accurate. Custom Document Extractor currently doesn't support Japanese, but it is on the product roadmap for H1 2023. There is a workaround that could work for you until the feature is implemented.

Note: This is not intended to be a permanent solution, but it can increase language capabilities for Document AI Workbench for the time being.

  1. Pre-process your documents for training with the Document OCR processor which supports Japanese.
  2. Save the output ProcessResponse JSON files, then remove the HumanReviewStatus and unwrap the Document object.
    • (i.e. the JSON should start with uri: "").
  3. Import the Document JSON files you have created into a Document AI Workbench Dataset and label the documents.
    • Note: Schema Labels can only be defined in English.
  4. During prediction, pre-process your documents with the Document OCR Processor then send the output into the the Custom Document Extractor for prediction.
    • Note: This only works for online processing, not batch processing

huangapple
  • 本文由 发表于 2023年3月8日 16:50:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/75670977.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定