2016年1月15日 21:34:27go评论86阅读模式

英文:

Extracting positional offset of *html.Node in Golang

问题

如何提取已解析的HTML文档中特定节点的位置偏移量？例如，对于文档<div>Hello, <b>World!</b></div>，我想知道World!的偏移量是15:21。在解析过程中，文档可能会发生变化。

我有一个解决方案，可以使用特殊标记来渲染整个文档，但性能非常差。有什么想法吗？

以下是提取位置偏移量的Go代码示例：

package main

import (
	"bytes"
	"golang.org/x/net/html"
	"golang.org/x/net/html/atom"
	"log"
	"strings"
)

func nodeIndexOffset(context *html.Node, node *html.Node) (int, int) {
	if node.Type != html.TextNode {
		node = node.FirstChild
	}
	originalData := node.Data

	var buf bytes.Buffer
	node.Data = "|start|" + originalData
	_ = html.Render(&buf, context.FirstChild)
	start := strings.Index(buf.String(), "|start|")

	buf = bytes.Buffer{}
	node.Data = originalData + "|end|"
	_ = html.Render(&buf, context.FirstChild)
	end := strings.Index(buf.String(), "|end|")

	node.Data = originalData
	return start, end
}

func main() {
	s := "<div>Hello, <b>World!</b></div>"
	var context html.Node
	context = html.Node{
		Type:     html.ElementNode,
		Data:     "body",
		DataAtom: atom.Body,
	}
	nodes, err := html.ParseFragment(strings.NewReader(s), &context)
	if err != nil {
		log.Fatal(err)
	}
	for _, node := range nodes {
		context.AppendChild(node)
	}
	world := nodes[0].FirstChild.NextSibling.FirstChild
	log.Println("target", world)
	log.Println(nodeIndexOffset(&context, world))
}

这段代码使用了golang.org/x/net/html包来解析HTML文档，并通过nodeIndexOffset函数提取了特定节点的位置偏移量。你可以根据需要进行修改和使用。

英文:

How do I can extract positional offset for specific node of already parsed HTML document? For example, for document <div>Hello, <b>World!</b></div> I want to be able to know that offset of World! is 15:21. Document may be changed while parsing.

I have a solution to render whole document with special marks, but it's really bad for performance. Any ideas?

package main
import (
&quot;bytes&quot;
&quot;golang.org/x/net/html&quot;
&quot;golang.org/x/net/html/atom&quot;
&quot;log&quot;
&quot;strings&quot;
)
func nodeIndexOffset(context *html.Node, node *html.Node) (int, int) {
if node.Type != html.TextNode {
node = node.FirstChild
}
originalData := node.Data
var buf bytes.Buffer
node.Data = &quot;|start|&quot; + originalData
_ = html.Render(&amp;buf, context.FirstChild)
start := strings.Index(buf.String(), &quot;|start|&quot;)
buf = bytes.Buffer{}
node.Data = originalData + &quot;|end|&quot;
_ = html.Render(&amp;buf, context.FirstChild)
end := strings.Index(buf.String(), &quot;|end|&quot;)
node.Data = originalData
return start, end
}
func main() {
s := &quot;&lt;div&gt;Hello, &lt;b&gt;World!&lt;/b&gt;&lt;/div&gt;&quot;
var context html.Node
context = html.Node{
Type:     html.ElementNode,
Data:     &quot;body&quot;,
DataAtom: atom.Body,
}
nodes, err := html.ParseFragment(strings.NewReader(s), &amp;context)
if err != nil {
log.Fatal(err)
}
for _, node := range nodes {
context.AppendChild(node)
}
world := nodes[0].FirstChild.NextSibling.FirstChild
log.Println(&quot;target&quot;, world)
log.Println(nodeIndexOffset(&amp;context, world))
}

答案1

得分: 3

不是一个答案，但是对于评论来说太长了。以下方法在某种程度上可能有效：

使用一个Tokenizer，逐个元素进行处理。
将你的输入包装成一个自定义的读取器，该读取器在Tokenizer从中读取时记录行和列的偏移量。
在调用Next()之前和之后，通过查询你的自定义读取器来记录所需的近似位置信息。

这可能有点麻烦，而且不是非常准确，但可能是你能做的最好的方法。

英文:

Not an answer, but too long for a comment. The following could work to some extent:

Use a Tokenizer and step through each element one by one.
Wrap your input into a custom reader which records lines and
column offsets as the Tokenizer reads from it.
Query your custom reader for the position before and after calling Next()
to record the approximate position information you need.

This is a bit painful and not too accurate but probably the best you could do.

答案2

得分: 1

我提出了一种解决方案，即通过附加一个名为custom.go的文件来扩展原始的HTML包，并添加一个新的导出函数。这个函数能够访问Tokenizer的未导出的data属性，该属性保存了当前Node的起始和结束位置。我们必须在每次缓冲区读取后调整位置。请参考globalBufDif。

我不太喜欢只为了访问几个属性而必须分叉该包，但似乎这是Go的方式。

func parseWithIndexes(p *parser) (map[*Node][2]int, error) {
    // 迭代直到文件结束。任何其他错误将导致提前返回。
    var err error
    var globalBufDif int
    var prevEndBuf int
    var tokenIndex [2]int
    tokenMap := make(map[*Node][2]int)
    for err != io.EOF {
        // 仅在外部内容中允许CDATA部分。
        n := p.oe.top()
        p.tokenizer.AllowCDATA(n != nil && n.Namespace != "")

        t := p.top().FirstChild
        for {
            if t != nil && t.NextSibling != nil {
                t = t.NextSibling
            } else {
                break
            }
        }
        tokenMap[t] = tokenIndex
        if prevEndBuf > p.tokenizer.data.end {
            globalBufDif += prevEndBuf
        }
        prevEndBuf = p.tokenizer.data.end
        // 读取并解析下一个标记。
        p.tokenizer.Next()
        tokenIndex = [2]int{p.tokenizer.data.start + globalBufDif, p.tokenizer.data.end + globalBufDif}

        p.tok = p.tokenizer.Token()
        if p.tok.Type == ErrorToken {
            err = p.tokenizer.Err()
            if err != nil && err != io.EOF {
                return tokenMap, err
            }
        }
        p.parseCurrentToken()
    }
    return tokenMap, nil
}

// ParseFragmentWithIndexes解析HTML片段并返回找到的节点。如果片段是现有元素的InnerHTML，请将该元素传递给上下文。
func ParseFragmentWithIndexes(r io.Reader, context *Node) ([]*Node, map[*Node][2]int, error) {
    contextTag := ""
    if context != nil {
        if context.Type != ElementNode {
            return nil, nil, errors.New("html: ParseFragment of non-element Node")
        }
        // 下面的检查不仅仅是 context.DataAtom.String() == context.Data，因为传递一个标签不是已知原子的元素是有效的。例如，DataAtom == 0 并且 Data = "tagfromthefuture" 是完全一致的。
        if context.DataAtom != a.Lookup([]byte(context.Data)) {
            return nil, nil, fmt.Errorf("html: inconsistent Node: DataAtom=%q, Data=%q", context.DataAtom, context.Data)
        }
        contextTag = context.DataAtom.String()
    }
    p := &parser{
        tokenizer: NewTokenizerFragment(r, contextTag),
        doc: &Node{
            Type: DocumentNode,
        },
        scripting: true,
        fragment:  true,
        context:   context,
    }

    root := &Node{
        Type:     ElementNode,
        DataAtom: a.Html,
        Data:     a.Html.String(),
    }
    p.doc.AppendChild(root)
    p.oe = nodeStack{root}
    p.resetInsertionMode()

    for n := context; n != nil; n = n.Parent {
        if n.Type == ElementNode && n.DataAtom == a.Form {
            p.form = n
            break
        }
    }

    tokenMap, err := parseWithIndexes(p)
    if err != nil {
        return nil, nil, err
    }

    parent := p.doc
    if context != nil {
        parent = root
    }

    var result []*Node
    for c := parent.FirstChild; c != nil; {
        next := c.NextSibling
        parent.RemoveChild(c)
        result = append(result, c)
        c = next
    }
    return result, tokenMap, nil
}

英文:

I come up with solution where we extend (please fix me if there's another way to do it) original HTML package with additional custom.go file with new exported function. This function is able to access unexported data property of Tokenizer, which holds exactly start and end position of current Node. We have to adjust positions after each buffer read. See globalBufDif.

I don't really like that I have to fork the package only to access couple of properties, but seems like this is a Go way.

func parseWithIndexes(p *parser) (map[*Node][2]int, error) {
// Iterate until EOF. Any other error will cause an early return.
var err error
var globalBufDif int
var prevEndBuf int
var tokenIndex [2]int
tokenMap := make(map[*Node][2]int)
for err != io.EOF {
// CDATA sections are allowed only in foreign content.
n := p.oe.top()
p.tokenizer.AllowCDATA(n != nil &amp;&amp; n.Namespace != &quot;&quot;)
t := p.top().FirstChild
for {
if t != nil &amp;&amp; t.NextSibling != nil {
t = t.NextSibling
} else {
break
}
}
tokenMap[t] = tokenIndex
if prevEndBuf &gt; p.tokenizer.data.end {
globalBufDif += prevEndBuf
}
prevEndBuf = p.tokenizer.data.end
// Read and parse the next token.
p.tokenizer.Next()
tokenIndex = [2]int{p.tokenizer.data.start + globalBufDif, p.tokenizer.data.end + globalBufDif}
p.tok = p.tokenizer.Token()
if p.tok.Type == ErrorToken {
err = p.tokenizer.Err()
if err != nil &amp;&amp; err != io.EOF {
return tokenMap, err
}
}
p.parseCurrentToken()
}
return tokenMap, nil
}
// ParseFragmentWithIndexes parses a fragment of HTML and returns the nodes
// that were found. If the fragment is the InnerHTML for an existing element,
// pass that element in context.
func ParseFragmentWithIndexes(r io.Reader, context *Node) ([]*Node, map[*Node][2]int, error) {
contextTag := &quot;&quot;
if context != nil {
if context.Type != ElementNode {
return nil, nil, errors.New(&quot;html: ParseFragment of non-element Node&quot;)
}
// The next check isn&#39;t just context.DataAtom.String() == context.Data because
// it is valid to pass an element whose tag isn&#39;t a known atom. For example,
// DataAtom == 0 and Data = &quot;tagfromthefuture&quot; is perfectly consistent.
if context.DataAtom != a.Lookup([]byte(context.Data)) {
return nil, nil, fmt.Errorf(&quot;html: inconsistent Node: DataAtom=%q, Data=%q&quot;, context.DataAtom, context.Data)
}
contextTag = context.DataAtom.String()
}
p := &amp;parser{
tokenizer: NewTokenizerFragment(r, contextTag),
doc: &amp;Node{
Type: DocumentNode,
},
scripting: true,
fragment:  true,
context:   context,
}
root := &amp;Node{
Type:     ElementNode,
DataAtom: a.Html,
Data:     a.Html.String(),
}
p.doc.AppendChild(root)
p.oe = nodeStack{root}
p.resetInsertionMode()
for n := context; n != nil; n = n.Parent {
if n.Type == ElementNode &amp;&amp; n.DataAtom == a.Form {
p.form = n
break
}
}
tokenMap, err := parseWithIndexes(p)
if err != nil {
return nil, nil, err
}
parent := p.doc
if context != nil {
parent = root
}
var result []*Node
for c := parent.FirstChild; c != nil; {
next := c.NextSibling
parent.RemoveChild(c)
result = append(result, c)
c = next
}
return result, tokenMap, nil
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

提取Golang中*html.Node的位置偏移量

问题

答案1

答案2

appengine/file.Delete() 的 fileName 参数应传递什么值？

如何正确从Stripe订阅响应对象中获取结构化项？

读取一个文本文件，替换其中的单词，并将结果输出到另一个文本文件中。

Infinite 'for' loop in Go

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论