提取Golang中*html.Node的位置偏移量

huangapple go评论113阅读模式
英文:

Extracting positional offset of *html.Node in Golang

问题

如何提取已解析的HTML文档中特定节点的位置偏移量?例如,对于文档<div>Hello, <b>World!</b></div>,我想知道World!的偏移量是15:21。在解析过程中,文档可能会发生变化。

我有一个解决方案,可以使用特殊标记来渲染整个文档,但性能非常差。有什么想法吗?

以下是提取位置偏移量的Go代码示例:

  1. package main
  2. import (
  3. "bytes"
  4. "golang.org/x/net/html"
  5. "golang.org/x/net/html/atom"
  6. "log"
  7. "strings"
  8. )
  9. func nodeIndexOffset(context *html.Node, node *html.Node) (int, int) {
  10. if node.Type != html.TextNode {
  11. node = node.FirstChild
  12. }
  13. originalData := node.Data
  14. var buf bytes.Buffer
  15. node.Data = "|start|" + originalData
  16. _ = html.Render(&buf, context.FirstChild)
  17. start := strings.Index(buf.String(), "|start|")
  18. buf = bytes.Buffer{}
  19. node.Data = originalData + "|end|"
  20. _ = html.Render(&buf, context.FirstChild)
  21. end := strings.Index(buf.String(), "|end|")
  22. node.Data = originalData
  23. return start, end
  24. }
  25. func main() {
  26. s := "<div>Hello, <b>World!</b></div>"
  27. var context html.Node
  28. context = html.Node{
  29. Type: html.ElementNode,
  30. Data: "body",
  31. DataAtom: atom.Body,
  32. }
  33. nodes, err := html.ParseFragment(strings.NewReader(s), &context)
  34. if err != nil {
  35. log.Fatal(err)
  36. }
  37. for _, node := range nodes {
  38. context.AppendChild(node)
  39. }
  40. world := nodes[0].FirstChild.NextSibling.FirstChild
  41. log.Println("target", world)
  42. log.Println(nodeIndexOffset(&context, world))
  43. }

这段代码使用了golang.org/x/net/html包来解析HTML文档,并通过nodeIndexOffset函数提取了特定节点的位置偏移量。你可以根据需要进行修改和使用。

英文:

How do I can extract positional offset for specific node of already parsed HTML document? For example, for document &lt;div&gt;Hello, &lt;b&gt;World!&lt;/b&gt;&lt;/div&gt; I want to be able to know that offset of World! is 15:21. Document may be changed while parsing.

I have a solution to render whole document with special marks, but it's really bad for performance. Any ideas?

  1. package main
  2. import (
  3. &quot;bytes&quot;
  4. &quot;golang.org/x/net/html&quot;
  5. &quot;golang.org/x/net/html/atom&quot;
  6. &quot;log&quot;
  7. &quot;strings&quot;
  8. )
  9. func nodeIndexOffset(context *html.Node, node *html.Node) (int, int) {
  10. if node.Type != html.TextNode {
  11. node = node.FirstChild
  12. }
  13. originalData := node.Data
  14. var buf bytes.Buffer
  15. node.Data = &quot;|start|&quot; + originalData
  16. _ = html.Render(&amp;buf, context.FirstChild)
  17. start := strings.Index(buf.String(), &quot;|start|&quot;)
  18. buf = bytes.Buffer{}
  19. node.Data = originalData + &quot;|end|&quot;
  20. _ = html.Render(&amp;buf, context.FirstChild)
  21. end := strings.Index(buf.String(), &quot;|end|&quot;)
  22. node.Data = originalData
  23. return start, end
  24. }
  25. func main() {
  26. s := &quot;&lt;div&gt;Hello, &lt;b&gt;World!&lt;/b&gt;&lt;/div&gt;&quot;
  27. var context html.Node
  28. context = html.Node{
  29. Type: html.ElementNode,
  30. Data: &quot;body&quot;,
  31. DataAtom: atom.Body,
  32. }
  33. nodes, err := html.ParseFragment(strings.NewReader(s), &amp;context)
  34. if err != nil {
  35. log.Fatal(err)
  36. }
  37. for _, node := range nodes {
  38. context.AppendChild(node)
  39. }
  40. world := nodes[0].FirstChild.NextSibling.FirstChild
  41. log.Println(&quot;target&quot;, world)
  42. log.Println(nodeIndexOffset(&amp;context, world))
  43. }

答案1

得分: 3

不是一个答案,但是对于评论来说太长了。以下方法在某种程度上可能有效:

  • 使用一个Tokenizer,逐个元素进行处理。
  • 将你的输入包装成一个自定义的读取器,该读取器在Tokenizer从中读取时记录行和列的偏移量。
  • 在调用Next()之前和之后,通过查询你的自定义读取器来记录所需的近似位置信息。

这可能有点麻烦,而且不是非常准确,但可能是你能做的最好的方法。

英文:

Not an answer, but too long for a comment. The following could work to some extent:

  • Use a Tokenizer and step through each element one by one.
  • Wrap your input into a custom reader which records lines and
    column offsets as the Tokenizer reads from it.
  • Query your custom reader for the position before and after calling Next()
    to record the approximate position information you need.

This is a bit painful and not too accurate but probably the best you could do.

答案2

得分: 1

我提出了一种解决方案,即通过附加一个名为custom.go的文件来扩展原始的HTML包,并添加一个新的导出函数。这个函数能够访问Tokenizer的未导出的data属性,该属性保存了当前Node的起始和结束位置。我们必须在每次缓冲区读取后调整位置。请参考globalBufDif

我不太喜欢只为了访问几个属性而必须分叉该包,但似乎这是Go的方式。

  1. func parseWithIndexes(p *parser) (map[*Node][2]int, error) {
  2. // 迭代直到文件结束。任何其他错误将导致提前返回。
  3. var err error
  4. var globalBufDif int
  5. var prevEndBuf int
  6. var tokenIndex [2]int
  7. tokenMap := make(map[*Node][2]int)
  8. for err != io.EOF {
  9. // 仅在外部内容中允许CDATA部分。
  10. n := p.oe.top()
  11. p.tokenizer.AllowCDATA(n != nil && n.Namespace != "")
  12. t := p.top().FirstChild
  13. for {
  14. if t != nil && t.NextSibling != nil {
  15. t = t.NextSibling
  16. } else {
  17. break
  18. }
  19. }
  20. tokenMap[t] = tokenIndex
  21. if prevEndBuf > p.tokenizer.data.end {
  22. globalBufDif += prevEndBuf
  23. }
  24. prevEndBuf = p.tokenizer.data.end
  25. // 读取并解析下一个标记。
  26. p.tokenizer.Next()
  27. tokenIndex = [2]int{p.tokenizer.data.start + globalBufDif, p.tokenizer.data.end + globalBufDif}
  28. p.tok = p.tokenizer.Token()
  29. if p.tok.Type == ErrorToken {
  30. err = p.tokenizer.Err()
  31. if err != nil && err != io.EOF {
  32. return tokenMap, err
  33. }
  34. }
  35. p.parseCurrentToken()
  36. }
  37. return tokenMap, nil
  38. }
  39. // ParseFragmentWithIndexes解析HTML片段并返回找到的节点。如果片段是现有元素的InnerHTML,请将该元素传递给上下文。
  40. func ParseFragmentWithIndexes(r io.Reader, context *Node) ([]*Node, map[*Node][2]int, error) {
  41. contextTag := ""
  42. if context != nil {
  43. if context.Type != ElementNode {
  44. return nil, nil, errors.New("html: ParseFragment of non-element Node")
  45. }
  46. // 下面的检查不仅仅是 context.DataAtom.String() == context.Data,因为传递一个标签不是已知原子的元素是有效的。例如,DataAtom == 0 并且 Data = "tagfromthefuture" 是完全一致的。
  47. if context.DataAtom != a.Lookup([]byte(context.Data)) {
  48. return nil, nil, fmt.Errorf("html: inconsistent Node: DataAtom=%q, Data=%q", context.DataAtom, context.Data)
  49. }
  50. contextTag = context.DataAtom.String()
  51. }
  52. p := &parser{
  53. tokenizer: NewTokenizerFragment(r, contextTag),
  54. doc: &Node{
  55. Type: DocumentNode,
  56. },
  57. scripting: true,
  58. fragment: true,
  59. context: context,
  60. }
  61. root := &Node{
  62. Type: ElementNode,
  63. DataAtom: a.Html,
  64. Data: a.Html.String(),
  65. }
  66. p.doc.AppendChild(root)
  67. p.oe = nodeStack{root}
  68. p.resetInsertionMode()
  69. for n := context; n != nil; n = n.Parent {
  70. if n.Type == ElementNode && n.DataAtom == a.Form {
  71. p.form = n
  72. break
  73. }
  74. }
  75. tokenMap, err := parseWithIndexes(p)
  76. if err != nil {
  77. return nil, nil, err
  78. }
  79. parent := p.doc
  80. if context != nil {
  81. parent = root
  82. }
  83. var result []*Node
  84. for c := parent.FirstChild; c != nil; {
  85. next := c.NextSibling
  86. parent.RemoveChild(c)
  87. result = append(result, c)
  88. c = next
  89. }
  90. return result, tokenMap, nil
  91. }
英文:

I come up with solution where we extend (please fix me if there's another way to do it) original HTML package with additional custom.go file with new exported function. This function is able to access unexported data property of Tokenizer, which holds exactly start and end position of current Node. We have to adjust positions after each buffer read. See globalBufDif.

I don't really like that I have to fork the package only to access couple of properties, but seems like this is a Go way.

  1. func parseWithIndexes(p *parser) (map[*Node][2]int, error) {
  2. // Iterate until EOF. Any other error will cause an early return.
  3. var err error
  4. var globalBufDif int
  5. var prevEndBuf int
  6. var tokenIndex [2]int
  7. tokenMap := make(map[*Node][2]int)
  8. for err != io.EOF {
  9. // CDATA sections are allowed only in foreign content.
  10. n := p.oe.top()
  11. p.tokenizer.AllowCDATA(n != nil &amp;&amp; n.Namespace != &quot;&quot;)
  12. t := p.top().FirstChild
  13. for {
  14. if t != nil &amp;&amp; t.NextSibling != nil {
  15. t = t.NextSibling
  16. } else {
  17. break
  18. }
  19. }
  20. tokenMap[t] = tokenIndex
  21. if prevEndBuf &gt; p.tokenizer.data.end {
  22. globalBufDif += prevEndBuf
  23. }
  24. prevEndBuf = p.tokenizer.data.end
  25. // Read and parse the next token.
  26. p.tokenizer.Next()
  27. tokenIndex = [2]int{p.tokenizer.data.start + globalBufDif, p.tokenizer.data.end + globalBufDif}
  28. p.tok = p.tokenizer.Token()
  29. if p.tok.Type == ErrorToken {
  30. err = p.tokenizer.Err()
  31. if err != nil &amp;&amp; err != io.EOF {
  32. return tokenMap, err
  33. }
  34. }
  35. p.parseCurrentToken()
  36. }
  37. return tokenMap, nil
  38. }
  39. // ParseFragmentWithIndexes parses a fragment of HTML and returns the nodes
  40. // that were found. If the fragment is the InnerHTML for an existing element,
  41. // pass that element in context.
  42. func ParseFragmentWithIndexes(r io.Reader, context *Node) ([]*Node, map[*Node][2]int, error) {
  43. contextTag := &quot;&quot;
  44. if context != nil {
  45. if context.Type != ElementNode {
  46. return nil, nil, errors.New(&quot;html: ParseFragment of non-element Node&quot;)
  47. }
  48. // The next check isn&#39;t just context.DataAtom.String() == context.Data because
  49. // it is valid to pass an element whose tag isn&#39;t a known atom. For example,
  50. // DataAtom == 0 and Data = &quot;tagfromthefuture&quot; is perfectly consistent.
  51. if context.DataAtom != a.Lookup([]byte(context.Data)) {
  52. return nil, nil, fmt.Errorf(&quot;html: inconsistent Node: DataAtom=%q, Data=%q&quot;, context.DataAtom, context.Data)
  53. }
  54. contextTag = context.DataAtom.String()
  55. }
  56. p := &amp;parser{
  57. tokenizer: NewTokenizerFragment(r, contextTag),
  58. doc: &amp;Node{
  59. Type: DocumentNode,
  60. },
  61. scripting: true,
  62. fragment: true,
  63. context: context,
  64. }
  65. root := &amp;Node{
  66. Type: ElementNode,
  67. DataAtom: a.Html,
  68. Data: a.Html.String(),
  69. }
  70. p.doc.AppendChild(root)
  71. p.oe = nodeStack{root}
  72. p.resetInsertionMode()
  73. for n := context; n != nil; n = n.Parent {
  74. if n.Type == ElementNode &amp;&amp; n.DataAtom == a.Form {
  75. p.form = n
  76. break
  77. }
  78. }
  79. tokenMap, err := parseWithIndexes(p)
  80. if err != nil {
  81. return nil, nil, err
  82. }
  83. parent := p.doc
  84. if context != nil {
  85. parent = root
  86. }
  87. var result []*Node
  88. for c := parent.FirstChild; c != nil; {
  89. next := c.NextSibling
  90. parent.RemoveChild(c)
  91. result = append(result, c)
  92. c = next
  93. }
  94. return result, tokenMap, nil
  95. }

huangapple
  • 本文由 发表于 2016年1月15日 21:34:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/34812279.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定