获取元素的”outerxml”(类似于innerxml,但包括元素本身)

huangapple go评论73阅读模式
英文:

Retrieving the "outerxml" of an element (like innerxml, but including the element itself)

问题

我有一个使用输入流(即os.Stdin)的Go程序,它可以处理非常大的XML文件,因此我无法一次性处理它。

我想要提取特定类型的所有XML元素以进行后续处理。

我没有问题识别要提取的元素,并获取相关的起始和结束元素。然而,我不确定如何将整个元素作为字符串输出,而不仅仅是内部XML。

例如,假设我有以下XML:

<a>
  <b somethingUseful="1">
    <c>Hello</c>
    <d>world</d>
  </b>
  <e>
    <foo/>
  </e>
  <!-- 假设中间有10亿行 - 我需要流式处理! -->
  <b somethingUseful="321">
    <c>Hello again</c>
  </b>
</a>

在这个例子中,我想要输出每个<b>元素,从开始到结束。

使用innerxmlDecodeElement,我能够以流式处理的方式做到这一点:

Here comes a B:

    <c>Hello</c>
    <d>world</d>
  
Here comes a B:

    <c>Hello again</c>

非常接近了,但是缺少了<b>标签(和属性)本身。我还没有找到在不牺牲解码的流式处理性质的情况下完成最后一步的方法。

明确一下,我想要的输出类似于:

Here comes a B:
  <b somethingUseful="1">
    <c>Hello</c>
    <d>world</d>
  </b>
Here comes a B:
  <b somethingUseful="321">
    <c>Hello again</c>
  </b>

这是一个演示示例和我迄今为止所做的工作的示例:

https://play.golang.org/p/XqJY_1pa9j

英文:

I've got a Go program that's working with an input stream, i.e. os.Stdin: a very large XML file, so I can't process it all at once.

I'm wanting to extract all XML elements of a certain nature for post-processing.

I've got no trouble identifying the elements for extraction, and getting the related start and end element. However I'm not sure how to dump the whole element as a string, as opposed to only the inner XML.

For instance, imagine I have the following XML:

&lt;a&gt;
  &lt;b somethingUseful=&quot;1&quot;&gt;
    &lt;c&gt;Hello&lt;/c&gt;
    &lt;d&gt;world&lt;/d&gt;
  &lt;/b&gt;
  &lt;e&gt;
    &lt;foo/&gt;
  &lt;/e&gt;
  &lt;!-- Imagine there were 1 billion lines in between -
       I need to stream this! --&gt;
  &lt;b somethingUseful=&quot;321&quot;&gt;
    &lt;c&gt;Hello again&lt;/c&gt;
  &lt;/b&gt;
&lt;/a&gt;

In this example, I want to output each of the &lt;b&gt; elements, from start to finish.

Using innerxml with DecodeElement, I'm able to get this far, in a streaming fashion:

Here comes a B:

    &lt;c&gt;Hello&lt;/c&gt;
    &lt;d&gt;world&lt;/d&gt;
  
Here comes a B:

    &lt;c&gt;Hello again&lt;/c&gt;

So close, but it's missing the &lt;b&gt; tag (and attributes) itself. I haven't been able to figure out how to make that last step without sacrificing the streaming nature of the decoding.

To be clear, the output that I desire is something like:

Here comes a B:
  &lt;b somethingUseful=&quot;1&quot;&gt;
    &lt;c&gt;Hello&lt;/c&gt;
    &lt;d&gt;world&lt;/d&gt;
  &lt;/b&gt;
Here comes a B:
  &lt;b somethingUseful=&quot;321&quot;&gt;
    &lt;c&gt;Hello again&lt;/c&gt;
  &lt;/b&gt;

Here's a playground that enunciates this example and what I've done in getting this far:

https://play.golang.org/p/XqJY_1pa9j

答案1

得分: 2

受到@nothingmuch使用decoder.InputOffset的启发,我使用TeeReader将输入的Reader分成两部分:一个标准部分通过解码器进行解析,另一个缓冲区用于输出确切的元素(该元素位于遇到元素之前和之后的decoder.InputOffset之间)。

为了最小化内存使用,缓冲区仅在我们知道不会匹配的点之前连续清除。我们维护偏移量以跟踪这一点。这种额外的复杂性是必要的,因为解码器可以从读取器中提取比当前标记更前的字节,所以我们需要小心不要清除我们实际需要的内容。

因此,额外的内存使用量仅为:

  1. 在缓冲区在被清除回到一个之前可能同时存储的最大两个标记。
  2. 输出的实际元素的大小。

这是一个更新的解决方案的示例:

https://play.golang.org/p/H8WVDWI57r

英文:

Inspired by @nothingmuch's usage of decoder.InputOffset, I use a TeeReader to split the input Reader into two: the standard which gets parsed through the decoder, and a buffer that we'll use to output the exact element (which lies between the decoder.InputOffset before and after the element is encountered).

To minimise memory usage, the buffer is continuously cleared only up to the point we know is not potentially matching. We maintain offsets to keep track of this. This added complexity is necessary because the decoder can grab bytes from the reader further ahead of the token at hand, so we need to be careful not to clear something we actually need.

So the additional memory usage is only as much as:

  1. The largest two tokens that may be simultaneously stored in the buffer before it's cleared back to one.
  2. The size of the actual element being output.

Here's an updated playground with the solution:

https://play.golang.org/p/H8WVDWI57r

答案2

得分: 1

一种相对简单的方法是保存偏移量,并通过在开始元素之前和结束元素之后向解码器询问偏移量来读取这些字节。

请参见此示例,它将读取器分成两个管道,其中一个管道用于XML解码器,而另一个管道则进行缓冲,然后用于提取与XML元素对应的字节范围。

然后,XML解码例程会在一个通道上写入偏移量对,另一个线程会使用该通道来跳过或输出来自读取器流副本的感兴趣的区域。这可能需要更认真地完成,比我所做的简单处理要好,例如使用堆栈和匹配过滤条件。

这个解决方案假设Seek/ReadAt不可行,回顾起来,我可能在那里做得过多了,如果你只是打开文件两次的话,这会简单得多,假设它是一个文件。

英文:

A rather crude approach is to save the offsets and just read those bytes, by asking the decoder for the offset before the start element and after the end element.

See this playground example, which fans out the reader into two pipes, one of which goes to the XML decoder while the other pipeline is buffered and then used for extracting byte ranges corresponding to XML elements.

The XML decoding routine then writes pairs of offsets on a channel, which another thread uses to skip or output regions of interest from the copy of the reader stream. This should probably be done more seriously than the hack job I did, i.e. by using a stack and matching filter criteria.

This solution assumes Seek/ReadAt are not viable, in retrospect I probably overdid it there, this would be much simpler if you just opened the file twice, assuming it is a file.

huangapple
  • 本文由 发表于 2016年11月10日 07:46:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/40517885.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定