如何使用自定义分割实现扫描仪

huangapple go评论91阅读模式
英文:

How a scanner can be implemented with a custom split

问题

我有一个日志文件,需要使用golang解析其中的每条记录。
每条记录以"#"开头,一条记录可以跨越一行或多行:

# 行1
# 行2
继续行2
继续行2
# 行3
.....

一些代码,我是初学者

f, _ := os.Open(mylog)
scanner := bufio.NewScanner(f)
var queryRec string

for scanner.Scan() {
    line := scanner.Text()

    if strings.HasPrefix(line, "# ") && len(queryRec) == 0 {
        queryRec = line
    } else if !strings.HasPrefix(line, "# ") && len(queryRec) == 0 {
        fmt.Println("有一个大问题!!!")
    } else if !strings.HasPrefix(line, "# ") && len(queryRec) != 0 {
        queryRec += line
    } else if strings.HasPrefix(line, "# ") && len(queryRec) != 0 {
        queryRec = line
    }
}

谢谢,

英文:

I have a log file, and I need to parse each record in it using golang.
Each record begin with "#", and a record can span one or more lines :

# Line1
# Line2
Continued line2
Continued line2
# line3
.....

Some code :), I'm a beginner

   f, _ := os.Open(mylog)
    scanner := bufio.NewScanner(f)
    var queryRec string

    for scanner.Scan() {
            line := scanner.Text()

            if strings.HasPrefix(line, "# ") && len(queryRec) == 0 {
                    queryRec = line
            } else if !strings.HasPrefix(line, "# ") && len(queryRec) == 0 {
                    fmt.Println("There is a big problem!!!")
            } else if !strings.HasPrefix(line, "# ") && len(queryRec) != 0 {
                    queryRec += line
            } else if strings.HasPrefix(line, "# ") && len(queryRec) != 0 {
                    queryRec = line
            }
    }

Thanks,

答案1

得分: 17

Scanner 类型有一个名为 Split 的函数,它允许你传递一个 SplitFunc 来确定扫描器如何分割给定的字节切片。默认的 SplitFuncScanLines,你可以在 实现源码 中看到它的具体实现。从这一点出发,你可以编写自己的 SplitFunc 来根据特定的格式来分割 bufio.Reader 的内容。

func crunchSplitFunc(data []byte, atEOF bool) (advance int, token []byte, err error) {

    // 如果在文件末尾且没有传递数据,则返回空
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }

    // 查找输入中以换行符后跟井号的索引
    if i := strings.Index(string(data), "\n#"); i >= 0 {
        return i + 1, data[0:i], nil
    }

    // 如果在文件末尾且有数据,则返回数据
    if atEOF {
        return len(data), data, nil
    }

    return
}

你可以在 https://play.golang.org/p/ecCYkTzme4 上查看该示例的完整实现。文档提供了实现类似功能所需的所有信息。

英文:

The Scanner type has a function called Split which allows you to pass a SplitFunc to determine how the scanner will split the given byte slice. The default SplitFunc is the ScanLines which you can see the implementation source. From this point you can write your own SplitFunc to break the bufio.Reader content based on your specific format.

func crunchSplitFunc(data []byte, atEOF bool) (advance int, token []byte, err error) {

    // Return nothing if at end of file and no data passed
    if atEOF && len(data) == 0 {
	    return 0, nil, nil
    }

    // Find the index of the input of a newline followed by a 
    // pound sign.
    if i := strings.Index(string(data), "\n#"); i >= 0 {
	    return i + 1, data[0:i], nil
    }

    // If at end of file with data return the data
    if atEOF {
	    return len(data), data, nil
    }

    return
}

You can see the full implementation of the example at https://play.golang.org/p/ecCYkTzme4. The documentation provides all the insight needed to implement something like this.

答案2

得分: 10

稍微优化了Ben Campbellsto-b-doo的解决方案。

将字节切片转换为字符串似乎是一个相当耗费资源的操作。

在我的日志处理应用程序中,这成为了一个瓶颈。

只保持数据为字节可以使我的应用程序性能提升约1500%

func SplitAt(substring string) func(data []byte, atEOF bool) (advance int, token []byte, err error) {
    searchBytes := []byte(substring)
    searchLen := len(searchBytes)
    return func(data []byte, atEOF bool) (advance int, token []byte, err error) {
        dataLen := len(data)

        // 如果在文件末尾且没有传递数据,则返回空
        if atEOF && dataLen == 0 {
            return 0, nil, nil
        }

        // 查找下一个分隔符并返回标记
        if i := bytes.Index(data, searchBytes); i >= 0 {
            return i + searchLen, data[0:i], nil
        }

        // 如果在文件末尾,且有最后一行没有终止符号,则返回该行
        if atEOF {
            return dataLen, data, nil
        }

        // 请求更多数据
        return 0, nil, nil
    }
}
英文:

Slightly optimized solution of Ben Campbell and sto-b-doo

Conversion of byte slice to string appears to be quite heavy operation.

In my app for log processing it became a bottleneck.

Just keeping data in bytes gives ~1500% performance boost to my app.

func SplitAt(substring string) func(data []byte, atEOF bool) (advance int, token []byte, err error) {
    searchBytes := []byte(substring)
    searchLen := len(searchBytes)
    return func(data []byte, atEOF bool) (advance int, token []byte, err error) {
	    dataLen := len(data)
	
	    // Return nothing if at end of file and no data passed
	    if atEOF && dataLen == 0 {
		    return 0, nil, nil
	    }
	
	    // Find next separator and return token
	    if i := bytes.Index(data, searchBytes); i >= 0 {
		    return i + searchLen, data[0:i], nil
	    }
	
	    // If we're at EOF, we have a final, non-terminated line. Return it.
	    if atEOF {
		    return dataLen, data, nil
	    }
	
	    // Request more data.
	    return 0, nil, nil
    }
}

答案3

得分: 2

以下是将Ben Campbell的答案包装成一个返回子字符串的splitfunc的函数:

demo on play.golang.org

欢迎提出改进建议

// SplitAt返回一个bufio.SplitFunc闭包,用于在子字符串处进行分割
// scanner.Split(SplitAt("\n# "))
func SplitAt(substring string) func(data []byte, atEOF bool) (advance int, token []byte, err error) {

	return func(data []byte, atEOF bool) (advance int, token []byte, err error) {

		// 如果在文件末尾且没有传递数据,则返回空
		if atEOF && len(data) == 0 {
			return 0, nil, nil
		}

		// 查找分隔子字符串的输入索引
		if i := strings.Index(string(data), substring); i >= 0 {
			return i + len(substring), data[0:i], nil
		}

		// 如果在文件末尾且有数据,则返回数据
		if atEOF {
			return len(data), data, nil
		}

		return
	}
}
英文:

Ben Campbell's answer wrapped into a func that returns a splitfunc for a substring:

demo on play.golang.org

Improvement suggestions welcome

// SplitAt returns a bufio.SplitFunc closure, splitting at a substring
// scanner.Split(SplitAt("\n# "))
func SplitAt(substring string) func(data []byte, atEOF bool) (advance int, token []byte, err error) {

	return func(data []byte, atEOF bool) (advance int, token []byte, err error) {

		// Return nothing if at end of file and no data passed
		if atEOF && len(data) == 0 {
			return 0, nil, nil
		}

		// Find the index of the input of the separator substring
		if i := strings.Index(string(data), substring); i >= 0 {
			return i + len(substring), data[0:i], nil
		}

		// If at end of file with data return the data
		if atEOF {
			return len(data), data, nil
		}

		return
	}
}

答案4

得分: 0

希望这是对stu0292改进的一个改进(也许是可读性)并使用了最终令牌信号。

// SplitAt返回一个bufio.SplitFunc闭包,以子字符串分割
// scanner.Split(SplitAt("\n#"))
func SplitAt(substring string) func(data []byte, atEOF bool) (advance    int, token []byte, err error) {

   return func(data []byte, atEOF bool) (advance int, token []byte, err error) {

      // 找到分隔子字符串的输入索引
       if i := strings.Index(string(data), substring); i >= 0 {
         return i + len(substring), data[0:i], nil
       }

       if !atEOF {
	     return 0, nil, nil
       }
     return len(data), data, bufio.ErrFinalToken
  }
}
英文:

Hopefully an improvement (maybe readability) over stu0292's improvements
And uses the final token signal.

// SplitAt returns a bufio.SplitFunc closure, splitting at a substring
// scanner.Split(SplitAt("\n#"))
func SplitAt(substring string) func(data []byte, atEOF bool) (advance    int, token []byte, err error) {

   return func(data []byte, atEOF bool) (advance int, token []byte, err error) {

      // Find the index of the input of the separator substring
       if i := strings.Index(string(data), substring); i >= 0 {
         return i + len(substring), data[0:i], nil
       }

       if !atEOF {
	     return 0, nil, nil
       }
     return len(data), data, bufio.ErrFinalToken
  }
}

huangapple
  • 本文由 发表于 2015年10月12日 02:34:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/33068644.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定