提取Golang中*html.Node的位置偏移量

huangapple go评论72阅读模式
英文:

Extracting positional offset of *html.Node in Golang

问题

如何提取已解析的HTML文档中特定节点的位置偏移量?例如,对于文档<div>Hello, <b>World!</b></div>,我想知道World!的偏移量是15:21。在解析过程中,文档可能会发生变化。

我有一个解决方案,可以使用特殊标记来渲染整个文档,但性能非常差。有什么想法吗?

以下是提取位置偏移量的Go代码示例:

package main

import (
	"bytes"
	"golang.org/x/net/html"
	"golang.org/x/net/html/atom"
	"log"
	"strings"
)

func nodeIndexOffset(context *html.Node, node *html.Node) (int, int) {
	if node.Type != html.TextNode {
		node = node.FirstChild
	}
	originalData := node.Data

	var buf bytes.Buffer
	node.Data = "|start|" + originalData
	_ = html.Render(&buf, context.FirstChild)
	start := strings.Index(buf.String(), "|start|")

	buf = bytes.Buffer{}
	node.Data = originalData + "|end|"
	_ = html.Render(&buf, context.FirstChild)
	end := strings.Index(buf.String(), "|end|")

	node.Data = originalData
	return start, end
}

func main() {
	s := "<div>Hello, <b>World!</b></div>"
	var context html.Node
	context = html.Node{
		Type:     html.ElementNode,
		Data:     "body",
		DataAtom: atom.Body,
	}
	nodes, err := html.ParseFragment(strings.NewReader(s), &context)
	if err != nil {
		log.Fatal(err)
	}
	for _, node := range nodes {
		context.AppendChild(node)
	}
	world := nodes[0].FirstChild.NextSibling.FirstChild
	log.Println("target", world)
	log.Println(nodeIndexOffset(&context, world))
}

这段代码使用了golang.org/x/net/html包来解析HTML文档,并通过nodeIndexOffset函数提取了特定节点的位置偏移量。你可以根据需要进行修改和使用。

英文:

How do I can extract positional offset for specific node of already parsed HTML document? For example, for document &lt;div&gt;Hello, &lt;b&gt;World!&lt;/b&gt;&lt;/div&gt; I want to be able to know that offset of World! is 15:21. Document may be changed while parsing.

I have a solution to render whole document with special marks, but it's really bad for performance. Any ideas?

package main
import (
&quot;bytes&quot;
&quot;golang.org/x/net/html&quot;
&quot;golang.org/x/net/html/atom&quot;
&quot;log&quot;
&quot;strings&quot;
)
func nodeIndexOffset(context *html.Node, node *html.Node) (int, int) {
if node.Type != html.TextNode {
node = node.FirstChild
}
originalData := node.Data
var buf bytes.Buffer
node.Data = &quot;|start|&quot; + originalData
_ = html.Render(&amp;buf, context.FirstChild)
start := strings.Index(buf.String(), &quot;|start|&quot;)
buf = bytes.Buffer{}
node.Data = originalData + &quot;|end|&quot;
_ = html.Render(&amp;buf, context.FirstChild)
end := strings.Index(buf.String(), &quot;|end|&quot;)
node.Data = originalData
return start, end
}
func main() {
s := &quot;&lt;div&gt;Hello, &lt;b&gt;World!&lt;/b&gt;&lt;/div&gt;&quot;
var context html.Node
context = html.Node{
Type:     html.ElementNode,
Data:     &quot;body&quot;,
DataAtom: atom.Body,
}
nodes, err := html.ParseFragment(strings.NewReader(s), &amp;context)
if err != nil {
log.Fatal(err)
}
for _, node := range nodes {
context.AppendChild(node)
}
world := nodes[0].FirstChild.NextSibling.FirstChild
log.Println(&quot;target&quot;, world)
log.Println(nodeIndexOffset(&amp;context, world))
}

答案1

得分: 3

不是一个答案,但是对于评论来说太长了。以下方法在某种程度上可能有效:

  • 使用一个Tokenizer,逐个元素进行处理。
  • 将你的输入包装成一个自定义的读取器,该读取器在Tokenizer从中读取时记录行和列的偏移量。
  • 在调用Next()之前和之后,通过查询你的自定义读取器来记录所需的近似位置信息。

这可能有点麻烦,而且不是非常准确,但可能是你能做的最好的方法。

英文:

Not an answer, but too long for a comment. The following could work to some extent:

  • Use a Tokenizer and step through each element one by one.
  • Wrap your input into a custom reader which records lines and
    column offsets as the Tokenizer reads from it.
  • Query your custom reader for the position before and after calling Next()
    to record the approximate position information you need.

This is a bit painful and not too accurate but probably the best you could do.

答案2

得分: 1

我提出了一种解决方案,即通过附加一个名为custom.go的文件来扩展原始的HTML包,并添加一个新的导出函数。这个函数能够访问Tokenizer的未导出的data属性,该属性保存了当前Node的起始和结束位置。我们必须在每次缓冲区读取后调整位置。请参考globalBufDif

我不太喜欢只为了访问几个属性而必须分叉该包,但似乎这是Go的方式。

func parseWithIndexes(p *parser) (map[*Node][2]int, error) {
    // 迭代直到文件结束。任何其他错误将导致提前返回。
    var err error
    var globalBufDif int
    var prevEndBuf int
    var tokenIndex [2]int
    tokenMap := make(map[*Node][2]int)
    for err != io.EOF {
        // 仅在外部内容中允许CDATA部分。
        n := p.oe.top()
        p.tokenizer.AllowCDATA(n != nil && n.Namespace != "")

        t := p.top().FirstChild
        for {
            if t != nil && t.NextSibling != nil {
                t = t.NextSibling
            } else {
                break
            }
        }
        tokenMap[t] = tokenIndex
        if prevEndBuf > p.tokenizer.data.end {
            globalBufDif += prevEndBuf
        }
        prevEndBuf = p.tokenizer.data.end
        // 读取并解析下一个标记。
        p.tokenizer.Next()
        tokenIndex = [2]int{p.tokenizer.data.start + globalBufDif, p.tokenizer.data.end + globalBufDif}

        p.tok = p.tokenizer.Token()
        if p.tok.Type == ErrorToken {
            err = p.tokenizer.Err()
            if err != nil && err != io.EOF {
                return tokenMap, err
            }
        }
        p.parseCurrentToken()
    }
    return tokenMap, nil
}

// ParseFragmentWithIndexes解析HTML片段并返回找到的节点。如果片段是现有元素的InnerHTML,请将该元素传递给上下文。
func ParseFragmentWithIndexes(r io.Reader, context *Node) ([]*Node, map[*Node][2]int, error) {
    contextTag := ""
    if context != nil {
        if context.Type != ElementNode {
            return nil, nil, errors.New("html: ParseFragment of non-element Node")
        }
        // 下面的检查不仅仅是 context.DataAtom.String() == context.Data,因为传递一个标签不是已知原子的元素是有效的。例如,DataAtom == 0 并且 Data = "tagfromthefuture" 是完全一致的。
        if context.DataAtom != a.Lookup([]byte(context.Data)) {
            return nil, nil, fmt.Errorf("html: inconsistent Node: DataAtom=%q, Data=%q", context.DataAtom, context.Data)
        }
        contextTag = context.DataAtom.String()
    }
    p := &parser{
        tokenizer: NewTokenizerFragment(r, contextTag),
        doc: &Node{
            Type: DocumentNode,
        },
        scripting: true,
        fragment:  true,
        context:   context,
    }

    root := &Node{
        Type:     ElementNode,
        DataAtom: a.Html,
        Data:     a.Html.String(),
    }
    p.doc.AppendChild(root)
    p.oe = nodeStack{root}
    p.resetInsertionMode()

    for n := context; n != nil; n = n.Parent {
        if n.Type == ElementNode && n.DataAtom == a.Form {
            p.form = n
            break
        }
    }

    tokenMap, err := parseWithIndexes(p)
    if err != nil {
        return nil, nil, err
    }

    parent := p.doc
    if context != nil {
        parent = root
    }

    var result []*Node
    for c := parent.FirstChild; c != nil; {
        next := c.NextSibling
        parent.RemoveChild(c)
        result = append(result, c)
        c = next
    }
    return result, tokenMap, nil
}
英文:

I come up with solution where we extend (please fix me if there's another way to do it) original HTML package with additional custom.go file with new exported function. This function is able to access unexported data property of Tokenizer, which holds exactly start and end position of current Node. We have to adjust positions after each buffer read. See globalBufDif.

I don't really like that I have to fork the package only to access couple of properties, but seems like this is a Go way.

func parseWithIndexes(p *parser) (map[*Node][2]int, error) {
// Iterate until EOF. Any other error will cause an early return.
var err error
var globalBufDif int
var prevEndBuf int
var tokenIndex [2]int
tokenMap := make(map[*Node][2]int)
for err != io.EOF {
// CDATA sections are allowed only in foreign content.
n := p.oe.top()
p.tokenizer.AllowCDATA(n != nil &amp;&amp; n.Namespace != &quot;&quot;)
t := p.top().FirstChild
for {
if t != nil &amp;&amp; t.NextSibling != nil {
t = t.NextSibling
} else {
break
}
}
tokenMap[t] = tokenIndex
if prevEndBuf &gt; p.tokenizer.data.end {
globalBufDif += prevEndBuf
}
prevEndBuf = p.tokenizer.data.end
// Read and parse the next token.
p.tokenizer.Next()
tokenIndex = [2]int{p.tokenizer.data.start + globalBufDif, p.tokenizer.data.end + globalBufDif}
p.tok = p.tokenizer.Token()
if p.tok.Type == ErrorToken {
err = p.tokenizer.Err()
if err != nil &amp;&amp; err != io.EOF {
return tokenMap, err
}
}
p.parseCurrentToken()
}
return tokenMap, nil
}
// ParseFragmentWithIndexes parses a fragment of HTML and returns the nodes
// that were found. If the fragment is the InnerHTML for an existing element,
// pass that element in context.
func ParseFragmentWithIndexes(r io.Reader, context *Node) ([]*Node, map[*Node][2]int, error) {
contextTag := &quot;&quot;
if context != nil {
if context.Type != ElementNode {
return nil, nil, errors.New(&quot;html: ParseFragment of non-element Node&quot;)
}
// The next check isn&#39;t just context.DataAtom.String() == context.Data because
// it is valid to pass an element whose tag isn&#39;t a known atom. For example,
// DataAtom == 0 and Data = &quot;tagfromthefuture&quot; is perfectly consistent.
if context.DataAtom != a.Lookup([]byte(context.Data)) {
return nil, nil, fmt.Errorf(&quot;html: inconsistent Node: DataAtom=%q, Data=%q&quot;, context.DataAtom, context.Data)
}
contextTag = context.DataAtom.String()
}
p := &amp;parser{
tokenizer: NewTokenizerFragment(r, contextTag),
doc: &amp;Node{
Type: DocumentNode,
},
scripting: true,
fragment:  true,
context:   context,
}
root := &amp;Node{
Type:     ElementNode,
DataAtom: a.Html,
Data:     a.Html.String(),
}
p.doc.AppendChild(root)
p.oe = nodeStack{root}
p.resetInsertionMode()
for n := context; n != nil; n = n.Parent {
if n.Type == ElementNode &amp;&amp; n.DataAtom == a.Form {
p.form = n
break
}
}
tokenMap, err := parseWithIndexes(p)
if err != nil {
return nil, nil, err
}
parent := p.doc
if context != nil {
parent = root
}
var result []*Node
for c := parent.FirstChild; c != nil; {
next := c.NextSibling
parent.RemoveChild(c)
result = append(result, c)
c = next
}
return result, tokenMap, nil
}

huangapple
  • 本文由 发表于 2016年1月15日 21:34:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/34812279.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定