问题

我们需要使用Go语言解析一个巨大的XML文件。我们希望使用类似SAX的基于事件的算法，使用xml.NewDecoder()和decoder.Token()库函数。我们已经创建了带有XML注释的适当的结构类型。到目前为止，一切都很简单。

现在，我们遍历文件并检测xml.StartElement标记。这里出现了问题。我们需要仅解码此起始标记的属性，并继续解析其内容。如果我们调用token.DecodeElement()，整个内容将在我们的场景中被“解码”或跳过。

如何仅解码特定StartElement的属性并继续解析元素的内容？

英文:

We need to parse a huge XML file using Go. We'd like to use a SAX-like event based algorithm using xml.NewDecoder() and decoder.Token() library calls. We've created the appropriate struct types with XML annotations. Everything easy peasy so far.

Now, we go through the file and detect the xml.StartElement tokens. And here comes the problem. We need to decode ONLY the attributes of this starting token and continue into its content. If we call token.DecodeElement() the whole content is "decoded" or skipped in our scenario.

How to decode only the attributes of a specific StartElement and continue to the element's body?

答案1

得分: 2

我在go-wikiparse中使用普通的结构/反射解码来解析维基百科的XML转储文件（约50GB的XML文件）。这非常简单。

基本策略如下：

首先，读取信封标记：

d := xml.NewDecoder(r)
_, err := d.Token()
if err != nil {
    return nil, err
}

例如，对于<someDocument><billions-of-other-things/></someDocument>，这将给你返回someDocument。

然后，你可以在循环中使用结构解码下一个元素：

var i item
d.Decode(&amp;i)

占用的内存不多，而且解析起来非常简单。

英文:

I parse wikipedia xml dumps (~50GB xml files) in go-wikiparse using plain struct/reflect decoding. It's super simple.

The strategy is basically this:

First, read the envelope token:

d := xml.NewDecoder(r)
_, err := d.Token()
if err != nil {
    return nil, err
}

e.g., for <someDocument><billions-of-other-things/></someDocument> that will give you someDocument.

Then, you can just struct decode the next things in a loop:

var i item
d.Decode(&amp;i)

Not much RAM, and it's super easy to parse.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Go语言解析大型XML文件

问题

答案1

Golang – 将字符串分割为最多 N 部分？

指针还是复制

Monkey patching instance in Go

Pprof和Golang – 如何解读结果？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论