在Go语言中读取具有可变行尾的文件的行。

huangapple go评论78阅读模式
英文:

Read lines from a file with variable line endings in Go

问题

你可以使用bufio.Scanner来读取文件中的行,它可以处理以CR、LF或CRLF结尾的行。bufio.Scanner会自动处理\n前面可能有\r的情况,但是不能处理单独的\r

以下是一个示例代码,演示如何使用bufio.Scanner读取文件中的行:

package main

import (
	"bufio"
	"fmt"
	"os"
)

func main() {
	file, err := os.Open("file.txt")
	if err != nil {
		fmt.Println("Failed to open file:", err)
		return
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	for scanner.Scan() {
		line := scanner.Text()
		fmt.Println(line)
	}

	if scanner.Err() != nil {
		fmt.Println("Error while reading file:", scanner.Err())
	}
}

你可以将上述代码保存为一个.go文件,并将file.txt替换为你要读取的文件路径。运行代码后,它将逐行打印文件的内容。

希望这可以帮助到你!如果你有其他问题,请随时问。

英文:

How can I read lines from a file where the line endings are carriage return (CR), newline (NL), or both?

The PDF specification allows lines to end with CR, LF, or CRLF.

  • bufio.Reader.ReadString() and bufio.Reader.ReadBytes() allow a single delimiter byte.

  • bufio.Scanner.Scan() handles \n optionally preceded by \r, but not a lone \r.
    > The end-of-line marker is one optional carriage return followed by one mandatory newline.

Do I need to write my own function that uses bufio.Reader.ReadByte()?

答案1

得分: 5

你可以为bufio.Scanner编写自定义的bufio.SplitFunc。例如:

// 大部分是bufio.ScanLines的代码:
func ScanPDFLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }
    if i := bytes.IndexAny(data, "\r\n"); i >= 0 {
        if data[i] == '\n' {
            // 我们有一行以单个换行符结尾。
            return i + 1, data[0:i], nil
        }
        advance = i + 1
        if len(data) > i+1 && data[i+1] == '\n' {
            advance += 1
        }
        return advance, data[0:i], nil
    }
    // 如果我们在EOF处,我们有一行最后没有终止符。返回它。
    if atEOF {
        return len(data), data, nil
    }
    // 请求更多数据。
    return 0, nil, nil
}

然后像这样使用它

scan := bufio.NewScanner(r)
scan.Split(ScanPDFLines)
英文:

You can write custom bufio.SplitFunc for bufio.Scanner. E.g:

// Mostly bufio.ScanLines code:
func ScanPDFLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
	if atEOF && len(data) == 0 {
		return 0, nil, nil
	}
	if i := bytes.IndexAny(data, "\r\n"); i >= 0 {
		if data[i] == '\n' {
			// We have a line terminated by single newline.
			return i + 1, data[0:i], nil
		}
		advance = i + 1
		if len(data) > i+1 && data[i+1] == '\n' {
			advance += 1
		}
		return advance, data[0:i], nil
	}
	// If we're at EOF, we have a final, non-terminated line. Return it.
	if atEOF {
		return len(data), data, nil
	}
	// Request more data.
	return 0, nil, nil
}

And use it like:

scan := bufio.NewScanner(r)
scan.Split(ScanPDFLines)

答案2

得分: 0

在阅读一个只有CR换行符的旧Mac生成的文件时,我遇到了一个回归问题,即如果CRLF跨越了缓冲区边界,接受的答案将把它们视为单独的行终止符。你需要在缓冲区以CR结尾时提前退出并请求更多数据。这似乎可以解决这个问题。

func scanLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }
    if i := bytes.IndexAny(data, "\r\n"); i >= 0 {
        if data[i] == '\n' {
            // We have a line terminated by single newline.
            return i + 1, data[0:i], nil
        }
        // We have a line terminated by carriage return at the end of the buffer.
        if !atEOF && len(data) == i+1 {
            return 0, nil, nil
        }
        advance = i + 1
        if len(data) > i+1 && data[i+1] == '\n' {
            advance += 1
        }
        return advance, data[0:i], nil
    }
    // If we're at EOF, we have a final, non-terminated line. Return it.
    if atEOF {
        return len(data), data, nil
    }
    // Request more data.
    return 0, nil, nil
}
英文:

While reading an older Mac generated file with only CR line endings, I ran into regression for the edge case where if CRLF is split across the buffer boundary, the accepted answer will treat them as separate line terminators. You basically need to exit early and request more data if the buffer ends with CR. This seems to solve it.

func scanLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
	if atEOF && len(data) == 0 {
		return 0, nil, nil
	}
	if i := bytes.IndexAny(data, "\r\n"); i >= 0 {
		if data[i] == '\n' {
			// We have a line terminated by single newline.
			return i + 1, data[0:i], nil
		}
		// We have a line terminated by carriage return at the end of the buffer.
		if !atEOF && len(data) == i+1 {
			return 0, nil, nil
		}
		advance = i + 1
		if len(data) > i+1 && data[i+1] == '\n' {
			advance += 1
		}
		return advance, data[0:i], nil
	}
	// If we're at EOF, we have a final, non-terminated line. Return it.
	if atEOF {
		return len(data), data, nil
	}
	// Request more data.
	return 0, nil, nil
}

huangapple
  • 本文由 发表于 2017年1月3日 05:20:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/41433422.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定