使用BreakIterator在Java中将日文文本拆分为单词。

huangapple go评论79阅读模式
英文:

Splitting Japanese text into words in java using BreakIterator

问题

我们正在尝试使用BreakIterator将日语句子分解为单词,按照这个问题中的代码进行操作。这段代码仅对问题中提供的文本有效,在我们尝试提供不同的文本,例如"速い茶色のキツネは怠惰な犬を飛び越えます"时,无法正确分解单词。

可能的问题是什么?

英文:

We are trying to break Japanese sentences into words using BreakIterator by following the code in this question. This code is working fine only for the text which is given in the question and when we try giving a different text e.g "速い茶色のキツネは怠惰な犬を飛び越えます" it is unable to break the words.

What could be the issue?

答案1

得分: 1

BreakIterator.getSentenceInstance(Locale.JAPAN)这个问题中用于将日语脚本分成句子,而不是单词。通常,日语语言写作时没有标点符号来分隔单词。

要将句子分成单词,您需要使用形态分析器。例如,您可以使用TinySegmenter的Java移植版

import java.util.List;
import jp.toastkid.libs.tinysegmenter.TinySegmenter;

public class Test {
  public static void main(String[] args) {
      TinySegmenter ts = TinySegmenter.getInstance();
      List<String> list = ts.segment("速い茶色のキツネは怠惰な犬を飛び越えます");
      System.out.println(String.join(" | ", list));
      // 您将获得"速い | 茶色 | の | キツネ | は | 怠惰 | な | 犬 | を | 飛び越え | ます"
  }
}
英文:

BreakIterator.getSentenceInstance(Locale.JAPAN) in this question breaks a Japanese script into sentences, rather than words. Usually, the Japanese language is written without punctuation to separate words.

You have to use a morphological analyzer to break a sentence into words. For example, you can use a Java port of TinySegmenter.

import java.util.List;
import jp.toastkid.libs.tinysegmenter.TinySegmenter;

public class Test {
  public static void main(String[] args) {
      TinySegmenter ts = TinySegmenter.getInstance();
      List<String> list = ts.segment("速い茶色のキツネは怠惰な犬を飛び越えます");
      System.out.println(String.join(" | ", list));
      // You will get "速い | 茶色 | の | キツネ | は | 怠惰 | な | 犬 | を | 飛び越え | ます"
  }
}

huangapple
  • 本文由 发表于 2020年10月8日 16:47:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/64258959.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定