使用正则表达式的扫描器未读取整个文件。

huangapple go评论76阅读模式
英文:

Scanner with Regex not reading the entire file

问题

这是我的解析方法。

public void loadInput(File fileName) throws IOException {
    try {
        Scanner s = new Scanner(fileName);
        int numWords = 0;
        while (s.hasNext("(?<!')[\\w']+") {
            System.out.println("word:" + s.next());
            numWords++;
        }
        System.out.println("Number of words: " + numWords);
    } catch (IOException e) {
        System.out.println("Error accessing input file!");
    }
}

这是一个示例输入文件:

Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do:  once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, 'and what is the use of a book,'
thought Alice 'without pictures or conversation?'

  So she was considering in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.

它只匹配了这些单词:

word:Alice
word:was
word:beginning
word:to
word:get
word:very
word:tired
word:of
word:sitting
word:by
word:her
word:sister
word:on
word:the
Number of words: 14

不知何故,扫描器认为已经到达了文件的末尾,但实际上并非如此。关于为什么会出现这种情况,有什么想法吗?我已经检查过我的正则表达式,似乎是有效的(单词包含字母 a-z 和撇号)。谢谢!

英文:

Here's my parsing method.

public void loadInput(File fileName) throws IOException {
    try {
      Scanner s = new Scanner(fileName);
      int numWords = 0;
      while (s.hasNext(&quot;(?&lt;!&#39;)[\\w&#39;]+&quot;)) {
        System.out.println(&quot;word:&quot; + s.next());
        numWords++;
      }
      System.out.println(&quot;Number of words: &quot; + numWords);
    } catch (IOException e) {
      System.out.println(&quot;Error accessing input file!&quot;);
    }
  }

And here's an example input file:

Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do:  once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,&#39;
thought Alice `without pictures or conversation?&#39;

  So she was considering in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.

It only matches these words:

word:Alice
word:was
word:beginning
word:to
word:get
word:very
word:tired
word:of
word:sitting
word:by
word:her
word:sister
word:on
word:the
Number of words: 14

Somehow, scanner is thinks that it has reached the end of the file, which it's not true. Any ideas on why this is happening? I checked my Regex and it does seem to work (A words contain letters a-z and apostrophes). Thanks!

答案1

得分: 1

Scanner正在将文本分割成“标记”(tokens)。默认的标记分隔符是空白字符。当您的程序停止时,当前标记是bank,。当您将其与.hasNext()正则表达式进行比较时,由于末尾多了一个逗号,所以不匹配。

一种解决方法是保持Scanner对空白字符标记分隔符的使用,分别用于.hasNext().next()方法,并将正则表达式应用于println语句。

while(s.hasNext()) {
    Matcher m = wordPattern.matcher(s.next());
    if (m.find()) {
        System.out.println("word:" + m.group(0))
    }
}
英文:

Scanner is dividing the text up into "tokens". The default token separator is whitespace. When your program stops, the current token is bank, When you compare that against your .hasNext() regex, it is not matching due to the extra comma on the end.

A solution may be keep the scanner using whitespace token separator for both .hasNext() and .next() methods and apply the regex on the println statement.

while(s.hasNext()) {
   Matcher m = wordPattern.matcher(s.next());
   if (m.find()) {
       System.out.println(&quot;word:&quot; + m.group(0))
   }
}

答案2

得分: 1

scanner的hasNext大多数情况下是没什么用的。

Scanner的工作原理如下:

  1. 在任何相关的情况下(无论是在任何next() / nextX()的调用,或者任何hasNext的调用,但不包括nextLine()),确保扫描器知道队列中的“下一个标记”。如果队列中没有标记,那么从输入中读取另一个标记。这是通过完全忽略所需内容来完成的,而是扫描直到流的末尾,或者扫描到“分隔符”(默认情况下是“任何空白字符”)。在此之前的所有内容都是下一个标记。
  2. hasX()检查下一个标记,并根据它是否匹配返回true或false。它与是否还有数据可读无关。
  3. nextLine则忽略所有这些,与扫描器中的其他任何内容都不兼容。

所以,你在调用hasNext,而hasNext忠实地报告:嗯,队列中的下一个标记是bank,而这与正则表达式不匹配,所以返回false。正如文档所说。

解决方案

不要使用hasX,你不需要它们。你也永远不会需要nextLine。如果分隔符不好,最好是改变分隔符(即不要调用nextLine,调用useDelimiter("\r?\n")然后使用next()),然后调用.nextX()方法。这就是你要做的全部内容。

因此,只需调用next(),检查它是否匹配,然后继续前进。

英文:

scanner's hasNext is mostly useless.

Scanner works like this:

  1. Anytime where relevant (either on any next() / nextX() call, or any hasNext call, but not nextLine(), ensure that scanner is aware of the 'next token in the queue'. If there isn't one already, then go read read another token from the feed. This is done by completely disregarding what is asked for, and instead scanning for either end-of-stream, or the 'delimiter' (which, by default, is 'any whitespace'). everything up to then is the next token.
  2. hasX() checks the token that is next in line and returns true or false depending on whether it matches or not. It has nothing to do with whether there is any data left to read.
  3. nextLine ignores all this and doesn't work well with anything else in scanner.

So, you're calling hasNext, and hasNext is faithfully reporting: Well, the next token in line is bank, and that doesn't match the regexp, so returns false. Just as the docs say.

Solution

Forget hasX, you don't want those. You also never want nextLine. Scanner works best if you change the delimiter if the delimiter is no good (i.e. never invoke nextLine, invoke useDelimiter(&quot;\r?\n&quot;) and next() instead), and call .nextX() methods. And that is all you ever do with it.

So, just invoke next(), check that it matches or not, and keep going.

huangapple
  • 本文由 发表于 2020年8月26日 02:05:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/63584722.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定