encoding/xml在动态结构元素上的解组

huangapple go评论70阅读模式
英文:

encoding/xml Unmarshal on dynamically structure elements

问题

我正在使用Golang处理epub文件,需要从cover.xhtml文件(或者在.opf文件中指定的其他文件)中获取封面图片。

我的问题在于Cover.xhtml文件的动态结构。

每个epub的Cover.xhtml文件结构都不同。例如,

<body>
    <figure id="cover-image">
        <img src="covers/9781449328030_lrg.jpg" alt="First Edition" />
    </figure>
</body>

另一个epub的cover.xhtml文件:

<body>
    <div>
        <img src="@public@vhost@g@gutenberg@html@files@54869@54869-h@images@cover.jpg" alt="Cover" />
    </div>
</body>

我需要从这个文件中获取img标签的src属性。但是我无法做到。

这是我处理unmarshalling cover.xhtml文件的代码的一部分:

type CPSRCS struct {
    Src string `xml:"src,attr"`
}

type CPIMGS struct {
    Image CPSRCS `xml:"img"`
}

XMLContent, err = ioutil.ReadFile("./uploads/moby-dick/OPS/cover.xhtml")
CheckError(err)

coverFile := CPIMGS{}
err = xml.Unmarshal(XMLContent, &coverFile)
CheckError(err)
fmt.Println(coverFile)

输出结果是:

{{}}

我期望的输出是:

{{covers/9781449328030_lrg.jpg}}

提前感谢!

英文:

I'm working with epubs using Golang, I have to fetch the cover image from cover.xhtml file (or whatever file it is mentioned in .opf file).

My problem is with dynamic structure of elements in the Cover.xhtml files.

Each epubs has different structure on the Cover.xhtml file. For example,

&lt;body&gt;
    &lt;figure id=&quot;cover-image&quot;&gt;
        &lt;img src=&quot;covers/9781449328030_lrg.jpg&quot; alt=&quot;First Edition&quot; /&gt;
    &lt;/figure&gt;
&lt;/body&gt;

Another epub cover.xhtml file

&lt;body&gt;
    &lt;div&gt;
        &lt;img src=&quot;@public@vhost@g@gutenberg@html@files@54869@54869-h@images@cover.jpg&quot; alt=&quot;Cover&quot; /&gt;
    &lt;/div&gt;
&lt;/body&gt;

I need to fetch the img tag's src attribute from this file. But I couldn't do it.

Here is the part of my Code that deals with unmarshalling the cover.xhtml file

type CPSRCS struct {
    Src string `xml:&quot;src,attr&quot;`
}

type CPIMGS struct {
    Image CPSRCS `xml:&quot;img&quot;`
}

XMLContent, err = ioutil.ReadFile(&quot;./uploads/moby-dick/OPS/cover.xhtml&quot;)
CheckError(err)

coverFile := CPIMGS{}
err = xml.Unmarshal(XMLContent, &amp;coverFile)
CheckError(err)
fmt.Println(coverFile)

The output is:

{{}}

The output I'm expecting is:

{{covers/9781449328030_lrg.jpg}}

Thanks in advance!

答案1

得分: 1

这将从读取的文件中提取出img元素,然后从元素中解析出src属性。这是基于你只需要从文件中获取第一个img元素的假设。

XMLContent, err = ioutil.ReadFile("./uploads/moby-dick/OPS/cover.xhtml")
CheckError(err)

//解析XMLContent以仅获取img元素
strContent := string(XMLContent)
imgLoc := strings.Index(strContent, "<img")
prefixRem := strContent[imgLoc:]
endImgLoc := strings.Index(prefixRem, "/>")
//向右移动2个位置以恢复'/>'
trimmed := prefixRem[:endImgLoc+2]

var coverFile CPSRCS
err = xml.Unmarshal([]byte(trimmed), &coverFile)
CheckError(err)
fmt.Println(coverFile)

这将产生以下结果:对于第一个输入文件,结果为{covers/9781449328030_lrg.jpg};对于第二个输入文件,结果为{@public@vhost@g@gutenberg@html@files@54869@54869-h@images@cover.jpg}。

英文:

This will pull out the img element from the read in file and then unmarshal the src attribute from the element. This is making the assumption that you will only ever need to grab the first img element from the file.

XMLContent, err = ioutil.ReadFile(&quot;./uploads/moby-dick/OPS/cover.xhtml&quot;)
CheckError(err)

//Parse the XMLContent to grab just the img element
strContent := string(XMLContent)
imgLoc := strings.Index(strContent, &quot;&lt;img&quot;)
prefixRem := strContent[imgLoc:]
endImgLoc := strings.Index(prefixRem, &quot;/&gt;&quot;)
//Move over by 2 to recover the &#39;/&gt;&#39;
trimmed := prefixRem[:endImgLoc+2]

var coverFile CPSRCS
err = xml.Unmarshal([]byte(trimmed), &amp;coverFile)
CheckError(err)
fmt.Println(coverFile)

This will produce the result of {covers/9781449328030_lrg.jpg} for the first input file and {@public@vhost@g@gutenberg@html@files@54869@54869-h@images@cover.jpg} for the second input file you provided.

huangapple
  • 本文由 发表于 2017年6月15日 19:25:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/44566297.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定