How to read a file starting from a specific line number using Scanner?

huangapple go评论96阅读模式
英文:

How to read a file starting from a specific line number using Scanner?

问题

我是新手,正在尝试编写一个简单的脚本,逐行读取文件。我还想将进度(即已读取的最后一行的行号)保存在文件系统的某个位置,以便如果再次将同一文件作为脚本的输入,它将从上次停止的行开始读取文件。以下是我开始的代码:

package main

// Package Imports
import (
	"bufio"
	"flag"
	"fmt"
	"log"
	"os"
)

// Variable Declaration
var (
	ConfigFile = flag.String("configfile", "../config.json", "Path to json configuration file.")
)

// The main function that reads the file and parses the log entries
func main() {
	flag.Parse()
	settings := NewConfig(*ConfigFile)

	inputFile, err := os.Open(settings.Source)
	if err != nil {
		log.Fatal(err)
	}
	defer inputFile.Close()

	scanner := bufio.NewScanner(inputFile)
	for scanner.Scan() {
		fmt.Println(scanner.Text())
	}

	if err := scanner.Err(); err != nil {
		log.Fatal(err)
	}
}

// Saves the current progress
func SaveProgress() {

}

// Get the line count from the progress to make sure
func GetCounter() {

}

我在scanner包中找不到处理行号的方法。我知道我可以声明一个整数,比如counter := 0,并在每次读取一行时递增它,例如counter++。但下一次我该如何告诉scanner从特定行开始读取呢?例如,如果我读到第30行,下次再用相同的输入文件运行脚本时,如何让scanner从第31行开始读取?

更新

我能想到的一个解决方法是使用上面提到的计数器,并使用如下的条件语句:

scanner := bufio.NewScanner(inputFile)
for scanner.Scan() {
    if counter > progress {
        fmt.Println(scanner.Text())
    }
}

我相当确定这样做是可行的,但它仍然会循环遍历我们已经读取过的行。请建议更好的方法。

英文:

I am new to Go and I am trying to write a simple script that reads a file line by line. I also want to save the progress (i.e. the last line number that was read) on the filesystem somewhere so that if the same file was given as the input to the script again, it starts reading the file from the line where it left off. Following is what I have started off with.

package main

// Package Imports
import (
	"bufio"
	"flag"
	"fmt"
	"log"
	"os"
)

// Variable Declaration
var (
	ConfigFile = flag.String("configfile", "../config.json", "Path to json configuration file.")
)

// The main function that reads the file and parses the log entries
func main() {
	flag.Parse()
	settings := NewConfig(*ConfigFile)

	inputFile, err := os.Open(settings.Source)
	if err != nil {
		log.Fatal(err)
	}
	defer inputFile.Close()

	scanner := bufio.NewScanner(inputFile)
	for scanner.Scan() {
		fmt.Println(scanner.Text())
	}

	if err := scanner.Err(); err != nil {
		log.Fatal(err)
	}
}

// Saves the current progress
func SaveProgress() {

}

// Get the line count from the progress to make sure
func GetCounter() {

}

I could not find any methods that deals with line numbers in the scanner package. I know I can declare an integer say counter := 0 and increment it each time a line is read like counter++. But the next time how do I tell the scanner to start from a specific line? So for example if I read till line 30 the next time I run the script with the same input file, how can I make scanner to start reading from line 31?

Update

One solution I can think of here is to use the counter as I stated above and use an if condition like the following.

	scanner := bufio.NewScanner(inputFile)
	for scanner.Scan() {
        if counter > progress {
    		fmt.Println(scanner.Text())
        }
	}

I am pretty sure something like this would work, but it is still going to loop over the lines that we have already read. Please suggest a better way.

答案1

得分: 29

如果你不想阅读,只想跳过之前读过的行,你需要获取你上次停下来的位置。

不同的解决方案以函数的形式呈现,该函数接受要读取的输入和开始读取行的起始位置(字节位置),例如:

func solution(input io.ReadSeeker, start int64) error

这里使用了特殊的io.Reader输入,它还实现了io.Seeker接口,这是一个常见的接口,允许跳过数据而无需读取它们。*os.File实现了这个接口,所以你可以将*File传递给这些函数。很好。io.Readerio.Seeker的"合并"接口是io.ReadSeeker

如果你想要一个"干净的开始"(从文件的开头开始读取),只需传递start = 0。如果你想要"恢复"之前的处理,传递上次处理停止/中断的字节位置。这个位置是函数(解决方案)中pos局部变量的值。

下面的示例及其测试代码可以在Go Playground上找到。

1. 使用bufio.Scanner

bufio.Scanner不维护位置,但我们可以很容易地扩展它以维护位置(读取的字节),这样当我们想要重新开始时,我们可以定位到这个位置。

为了以最小的努力实现这一点,我们可以使用一个新的分割函数将输入分割为标记(行)。我们可以使用Scanner.Split()来设置分割函数(决定标记/行边界的逻辑)。默认的分割函数是bufio.ScanLines()

让我们来看看分割函数的声明:bufio.SplitFunc

type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)

它返回要前进的字节数:advance。这正是我们需要维护文件位置的东西。因此,我们可以使用内置的bufio.ScanLines()创建一个新的分割函数,这样我们甚至不需要实现它的逻辑,只需使用advance返回值来维护位置:

func withScanner(input io.ReadSeeker, start int64) error {
    fmt.Println("--SCANNER, start:", start)
    if _, err := input.Seek(start, 0); err != nil {
        return err
    }
    scanner := bufio.NewScanner(input)

    pos := start
    scanLines := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
        advance, token, err = bufio.ScanLines(data, atEOF)
        pos += int64(advance)
        return
    }
    scanner.Split(scanLines)

    for scanner.Scan() {
        fmt.Printf("Pos: %d, Scanned: %s\n", pos, scanner.Text())
    }
    return scanner.Err()
}

2. 使用bufio.Reader

在这个解决方案中,我们使用bufio.Reader类型而不是Scannerbufio.Reader已经有一个ReadBytes()方法,如果我们将'\n'字节作为分隔符传递,它就非常类似于"读取一行"的功能。

这个解决方案类似于JimB的解决方案,但还处理了所有有效的行终止序列,并从读取的行中去掉它们(它们很少需要);在正则表达式表示中,它是\r?\n

func withReader(input io.ReadSeeker, start int64) error {
    fmt.Println("--READER, start:", start)
    if _, err := input.Seek(start, 0); err != nil {
        return err
    }

    r := bufio.NewReader(input)
    pos := start
    for {
        data, err := r.ReadBytes('\n')
        pos += int64(len(data))
        if err == nil || err == io.EOF {
            if len(data) > 0 && data[len(data)-1] == '\n' {
                data = data[:len(data)-1]
            }
            if len(data) > 0 && data[len(data)-1] == '\r' {
                data = data[:len(data)-1]
            }
            fmt.Printf("Pos: %d, Read: %s\n", pos, data)
        }
        if err != nil {
            if err != io.EOF {
                return err
            }
            break
        }
    }
    return nil
}

**注意:**如果内容以一个空行(行终止符)结尾,这个解决方案将处理一个空行。如果你不想要这个,你可以简单地像这样检查它:

if len(data) != 0 {
    fmt.Printf("Pos: %d, Read: %s\n", pos, data)
} else {
    // 最后一行是空行,忽略它
}

测试解决方案:

测试代码将简单地使用内容"first\r\nsecond\nthird\nfourth",其中包含多行以不同的行终止符结尾。我们将使用strings.NewReader()来获得一个源为stringio.ReadSeeker

测试代码首先调用withScanner()withReader(),传递0作为起始位置:一个"干净的开始"。在下一轮中,我们将传递起始位置start = 14,这是第3行的位置,所以我们不会看到前两行被处理(打印):模拟"恢复"。

func main() {
    const content = "first\r\nsecond\nthird\nfourth"

    if err := withScanner(strings.NewReader(content), 0); err != nil {
        fmt.Println("Scanner error:", err)
    }
    if err := withReader(strings.NewReader(content), 0); err != nil {
        fmt.Println("Reader error:", err)
    }

    if err := withScanner(strings.NewReader(content), 14); err != nil {
        fmt.Println("Scanner error:", err)
    }
    if err := withReader(strings.NewReader(content), 14); err != nil {
        fmt.Println("Reader error:", err)
    }
}

输出:

--SCANNER, start: 0
Pos: 7, Scanned: first
Pos: 14, Scanned: second
Pos: 20, Scanned: third
Pos: 26, Scanned: fourth
--READER, start: 0
Pos: 7, Read: first
Pos: 14, Read: second
Pos: 20, Read: third
Pos: 26, Read: fourth
--SCANNER, start: 14
Pos: 20, Scanned: third
Pos: 26, Scanned: fourth
--READER, start: 14
Pos: 20, Read: third
Pos: 26, Read: fourth

Go Playground上尝试这些解决方案和测试代码。

英文:

If you don't want to read but just skip the lines you read previously, you need to acquire the position where you left off.

The different solutions are presented in a form of a function which takes the input to read from and the start position (byte position) to start reading lines from, e.g.:

func solution(input io.ReadSeeker, start int64) error

A special io.Reader input is used which also implements io.Seeker, the common interface which allows skipping data without having to read them. *os.File implements this, so you are allowed to pass a *File to these functions. Good. The "merged" interface of both io.Reader and io.Seeker is io.ReadSeeker.

If you want a clean start (to start reading from the beginning of the file), simply pass start = 0. If you want to resume a previous processing, pass the byte position where the last processing was stopped/aborted. This position is the value of the pos local variable in the functions (solutions) below.

All the examples below with their testing code can be found on the Go Playground.

1. With bufio.Scanner

bufio.Scanner does not maintain the position, but we can very easily extend it to maintain the position (the read bytes), so when we want to restart next, we can seek to this position.

In order to do this with minimal effort, we can use a new split function which splits the input into tokens (lines). We can use Scanner.Split() to set the splitter function (the logic to decide where are the boundaries of tokens/lines). The default split function is bufio.ScanLines().

Let's take a look at the split function declaration: bufio.SplitFunc

type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)

It returns the number of bytes to advance: advance. Exactly what we need to maintain the file position. So we can create a new split function using the builtin bufio.ScanLines(), so we don't even have to implement its logic, just use the advance return value to maintain position:

func withScanner(input io.ReadSeeker, start int64) error {
	fmt.Println("--SCANNER, start:", start)
	if _, err := input.Seek(start, 0); err != nil {
		return err
	}
	scanner := bufio.NewScanner(input)

	pos := start
	scanLines := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
		advance, token, err = bufio.ScanLines(data, atEOF)
		pos += int64(advance)
		return
	}
	scanner.Split(scanLines)

	for scanner.Scan() {
		fmt.Printf("Pos: %d, Scanned: %s\n", pos, scanner.Text())
	}
	return scanner.Err()
}

2. With bufio.Reader

In this solution we use the bufio.Reader type instead of the Scanner. bufio.Reader already has a ReadBytes() method which is very similar to the "read a line" functionality if we pass the '\n' byte as the delimeter.

This solution is similar to JimB's, with the addition of handling all valid line terminator sequences and also stripping them off from the read line (it is very rare they are needed); in regular expression notation, it is \r?\n.

func withReader(input io.ReadSeeker, start int64) error {
	fmt.Println("--READER, start:", start)
	if _, err := input.Seek(start, 0); err != nil {
		return err
	}

	r := bufio.NewReader(input)
	pos := start
	for {
		data, err := r.ReadBytes('\n')
		pos += int64(len(data))
		if err == nil || err == io.EOF {
			if len(data) > 0 && data[len(data)-1] == '\n' {
				data = data[:len(data)-1]
			}
			if len(data) > 0 && data[len(data)-1] == '\r' {
				data = data[:len(data)-1]
			}
			fmt.Printf("Pos: %d, Read: %s\n", pos, data)
		}
		if err != nil {
			if err != io.EOF {
				return err
			}
			break
		}
	}
	return nil
}

Note: If the content ends with an empty line (line terminator), this solution will process an empty line. If you don't want this, you can simply check it like this:

if len(data) != 0 {
    fmt.Printf("Pos: %d, Read: %s\n", pos, data)
} else {
    // Last line is empty, omit it
}

Testing the solutions:

Testing code will simply use the content "first\r\nsecond\nthird\nfourth" which contains multiple lines with varying line terminating. We will use strings.NewReader() to obtain an io.ReadSeeker whose source is a string.

Test code first calls withScanner() and withReader() passing 0 start position: a clean start. In the next round we will pass a start position of start = 14 which is the position of the 3. line, so we won't see the first 2 lines processed (printed): resume simulation.

func main() {
	const content = "first\r\nsecond\nthird\nfourth"

	if err := withScanner(strings.NewReader(content), 0); err != nil {
		fmt.Println("Scanner error:", err)
	}
	if err := withReader(strings.NewReader(content), 0); err != nil {
		fmt.Println("Reader error:", err)
	}

	if err := withScanner(strings.NewReader(content), 14); err != nil {
		fmt.Println("Scanner error:", err)
	}
	if err := withReader(strings.NewReader(content), 14); err != nil {
		fmt.Println("Reader error:", err)
	}
}

Output:

--SCANNER, start: 0
Pos: 7, Scanned: first
Pos: 14, Scanned: second
Pos: 20, Scanned: third
Pos: 26, Scanned: fourth
--READER, start: 0
Pos: 7, Read: first
Pos: 14, Read: second
Pos: 20, Read: third
Pos: 26, Read: fourth
--SCANNER, start: 14
Pos: 20, Scanned: third
Pos: 26, Scanned: fourth
--READER, start: 14
Pos: 20, Read: third
Pos: 26, Read: fourth

Try the solutions and testing code on the Go Playground.

答案2

得分: 4

使用bufio.Reader而不是Scanner,具体使用ReadBytesReadString方法。这样你可以读取每一行的终止符,并且仍然可以接收带有行结束符的完整行。

r := bufio.NewReader(inputFile)

var line []byte
fPos := 0 // 或者保存的位置

for i := 1; ; i++ {
    line, err = r.ReadBytes('\n')
    fmt.Printf("[line:%d pos:%d] %q\n", i, fPos, line)

    if err != nil {
        break
    }
    fPos += len(line)
}

if err != io.EOF {
    log.Fatal(err)
}

你可以按照自己的方式存储文件位置和行号的组合,下次开始时,使用inputFile.Seek(fPos, os.SEEK_SET)将光标移动到上次停止的位置。

英文:

Instead of using a Scanner, use a bufio.Reader, specifically the ReadBytes or ReadString methods. This way you can read up to each line termination, and still receive the full line with line endings.

r := bufio.NewReader(inputFile)

var line []byte
fPos := 0 // or saved position

for i := 1; ; i++ {
	line, err = r.ReadBytes('\n')
	fmt.Printf("[line:%d pos:%d] %q\n", i, fPos, line)

	if err != nil {
		break
	}
	fPos += len(line)
}

if err != io.EOF {
	log.Fatal(err)
}

You can store the combination of file position and line number however you choose, and the next time you start, you use inputFile.Seek(fPos, os.SEEK_SET) to move to where you left off.

答案3

得分: 3

如果你想使用Scanner,你需要在文件的开头找到GetCounter()结束行符。

scanner := bufio.NewScanner(inputFile)
// 上面的上下文行

// 跳过前GetCounter()行
for i := 0; i < GetCounter(); i++ {
    scanner.Scan()
}

// 下面的上下文行
for scanner.Scan() {
    fmt.Println(scanner.Text())
}

或者,你可以在计数器中存储偏移量而不是行号,但要记住,当使用Scanner时,终止标记被剥离,对于换行符,标记是\r?\n(正则表达式表示),所以不清楚你是否应该在文本长度上加1或2:

// 不清楚如何存储偏移量,除非提供自定义的SplitFunc
inputFile.Seek(GetCounter(), 0)
scanner := bufio.NewScanner(inputFile)

因此,最好使用先前的解决方案,或者根本不使用Scanner。

英文:

If you want to use Scanner you have go trough the begging of the file till you find GetCounter() end-line symbols.

scanner := bufio.NewScanner(inputFile)
// context line above

// skip first GetCounter() lines
for i := 0; i &lt; GetCounter(); i++ {
	scanner.Scan()
}

// context line below
for scanner.Scan() {
	fmt.Println(scanner.Text())
}

Alternatively you could store offset instead of line number in the counter but remember that termination token is stripped when using Scanner and for new line the token is \r?\n (regexp notation) so it isn't clear if you should add 1 or 2 to the text length:

// Not clear how to store offset unless custom SplitFunc provided
inputFile.Seek(GetCounter(), 0)
scanner := bufio.NewScanner(inputFile)

So it is better to use previous solution or not using Scanner at all.

答案4

得分: 2

这里有很多其他答案中的文字,它们并不是可重用的代码,所以这里有一个可重用的函数,它可以定位到给定的行号并返回该行及其起始位置的偏移量。

func SeekToLine(r io.Reader, lineNo int) (line []byte, offset int, err error) {
    s := bufio.NewScanner(r)

    var pos int

    s.Split(func(data []byte, atEof bool) (advance int, token []byte, err error) {
        advance, token, err = bufio.ScanLines(data, atEof)
        pos += advance
        return advance, token, err
    })

    for i := 0; i < lineNo; i++ {
        offset = pos

        if !s.Scan() {
            return nil, 0, io.EOF
        }
    }

    return s.Bytes(), pos, nil
}

你可以在这里找到这段代码的在线演示。

英文:

There's a lot of words in the other answers, and they're not really reusable code so here's a re-usable function that seeks to the given line number & returns it and the offset where the line starts. play.golang

func SeekToLine(r io.Reader, lineNo int) (line []byte, offset int, err error) {
	s := bufio.NewScanner(r)

	var pos int

	s.Split(func(data []byte, atEof bool) (advance int, token []byte, err error) {
		advance, token, err = bufio.ScanLines(data, atEof)
		pos += advance
		return advance, token, err
	})

	for i := 0; i &lt; lineNo; i++ {
		offset = pos

		if !s.Scan() {
			return nil, 0, io.EOF
		}
	}

	return s.Bytes(), pos, nil
}

huangapple
  • 本文由 发表于 2016年1月7日 19:54:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/34654514.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定