2022年1月20日 07:13:40go评论110阅读模式

英文:

How to get both the chardata and the value of the attributes of an XML tag when decoding it in Golang

问题

我的XML文件类似于这样：

&lt;page&gt;
    &lt;title&gt;Antoine Meillet&lt;/title&gt;
    &lt;ns&gt;0&lt;/ns&gt;
    &lt;id&gt;3&lt;/id&gt;
    &lt;revision&gt;
      &lt;id&gt;178204512&lt;/id&gt;
      &lt;parentid&gt;178097574&lt;/parentid&gt;
      &lt;timestamp&gt;2020-12-30T10:12:14Z&lt;/timestamp&gt;
      &lt;contributor&gt;
        &lt;username&gt;Rovo&lt;/username&gt;
        &lt;id&gt;34820&lt;/id&gt;
      &lt;/contributor&gt;
      &lt;minor /&gt;
      &lt;model&gt;wikitext&lt;/model&gt;
      &lt;format&gt;text/x-wiki&lt;/format&gt;
      &lt;text bytes=&quot;11274&quot; xml:space=&quot;preserve&quot;&gt;
        大量的文本
      &lt;/text&gt;
      &lt;sha1&gt;ikqy1f9ppwo8eo38a0hh817eynr40vg&lt;/sha1&gt;
    &lt;/revision&gt;
  &lt;/page&gt;

我的目标是过滤掉大量的标签，只保留page标签和其中的title、id、text标签。

到目前为止，我已经成功提取了具有正确值的page标签和title、id标签。这是我得到的结果：

&lt;page&gt;
 &lt;title&gt;Antoine Meillet&lt;/title&gt;
 &lt;id&gt;3&lt;/id&gt;
 &lt;text bytes=&quot;0&quot; xml:space=&quot;&quot;&gt;&lt;/text&gt;
&lt;/page&gt;
&lt;page&gt;
 &lt;title&gt;Alg&#232;bre lin&#233;aire&lt;/title&gt;
 &lt;id&gt;7&lt;/id&gt;
 &lt;text bytes=&quot;0&quot; xml:space=&quot;&quot;&gt;&lt;/text&gt;
&lt;/page&gt;

所以问题在于，正如你所看到的，text标签的属性值不正确，而且其中没有文本。

我使用了以下代码来实现这一点：

package main
import (
	&quot;encoding/xml&quot;
	&quot;fmt&quot;
	&quot;io&quot;
	&quot;os&quot;
)
type Page struct {
	XMLName xml.Name `xml:&quot;page&quot;`
	Title   string   `xml:&quot;title&quot;`
	Id      int64    `xml:&quot;id&quot;`
	Text    struct {
		Key   float32 `xml:&quot;bytes,attr&quot;`
		Space string  `xml:&quot;xml:space,attr&quot;`
	} `xml:&quot;text&quot;`
}
func main() {
	frwikiXML, err := os.Open(&quot;frwiki10000.xml&quot;)
	if err != nil {
		fmt.Println(err)
	}
	cleanedWikiXML, err := os.Create(&quot;cleaned_fr_wiki.xml&quot;)
	if err != nil {
		fmt.Println(err)
	}
	cleanXMLEncoder := xml.NewEncoder(cleanedWikiXML)
	cleanXMLEncoder.Indent(&quot;&quot;, &quot; &quot;)
	frwikiDecoder := xml.NewDecoder(frwikiXML)
	for {
		t, tokenErr := frwikiDecoder.Token()
		if tokenErr != nil {
			if tokenErr == io.EOF {
				break
			}
			fmt.Errorf(&quot;decoding token: %w&quot;, tokenErr)
		}
		switch t := t.(type) {
		case xml.StartElement:
			if t.Name.Local == &quot;page&quot; {
				var page Page
				if err := frwikiDecoder.DecodeElement(&amp;page, &amp;t); err != nil {
					fmt.Errorf(&quot;decoding element %q: %v&quot;, t.Name.Local, err)
				}
				fmt.Println(&quot;Element was decoded successfully.&quot;)
				fmt.Printf(&quot;Page title: %v\n Page id: %d\n&quot;, page.Title, page.Id)
				fmt.Printf(&quot;Text: %v&quot;, page.Text)
				cleanXMLEncoder.Encode(page)
			}
		}
	}
	defer frwikiXML.Close()
	defer cleanedWikiXML.Close()
}

请问我该如何解决这个问题呢？

谢谢。

英文:

My XML file resembles to something like this:

&lt;page&gt;
    &lt;title&gt;Antoine Meillet&lt;/title&gt;
    &lt;ns&gt;0&lt;/ns&gt;
    &lt;id&gt;3&lt;/id&gt;
    &lt;revision&gt;
      &lt;id&gt;178204512&lt;/id&gt;
      &lt;parentid&gt;178097574&lt;/parentid&gt;
      &lt;timestamp&gt;2020-12-30T10:12:14Z&lt;/timestamp&gt;
      &lt;contributor&gt;
        &lt;username&gt;Rovo&lt;/username&gt;
        &lt;id&gt;34820&lt;/id&gt;
      &lt;/contributor&gt;
      &lt;minor /&gt;
      &lt;model&gt;wikitext&lt;/model&gt;
      &lt;format&gt;text/x-wiki&lt;/format&gt;
      &lt;text bytes=&quot;11274&quot; xml:space=&quot;preserve&quot;&gt;
        a lot of text
      &lt;/text&gt;
      &lt;sha1&gt;ikqy1f9ppwo8eo38a0hh817eynr40vg&lt;/sha1&gt;
    &lt;/revision&gt;
  &lt;/page&gt;

My goal is to filter out a lot of those tags and only keep the page tag and those inner tags: title, id, text.

So far, I have been able to successfully extract the page tag with title and id having the right value.
This is what I get:

&lt;page&gt;
 &lt;title&gt;Antoine Meillet&lt;/title&gt;
 &lt;id&gt;3&lt;/id&gt;
 &lt;text bytes=&quot;0&quot; xml:space=&quot;&quot;&gt;&lt;/text&gt;
&lt;/page&gt;
&lt;page&gt;
 &lt;title&gt;Alg&#232;bre lin&#233;aire&lt;/title&gt;
 &lt;id&gt;7&lt;/id&gt;
 &lt;text bytes=&quot;0&quot; xml:space=&quot;&quot;&gt;&lt;/text&gt;
&lt;/page&gt;

So the problem here as you can see is that the text tag doesn't have the right values for its attributes and the absence of text in it.

I have achieved this using this piece of code:

package main
import (
	&quot;encoding/xml&quot;
	&quot;fmt&quot;
	&quot;io&quot;
	&quot;os&quot;
)
type Page struct {
	XMLName xml.Name `xml:&quot;page&quot;`
	Title   string   `xml:&quot;title&quot;`
	Id      int64    `xml:&quot;id&quot;`
	Text    struct {
		Key   float32 `xml:&quot;bytes,attr&quot;`
		Space string  `xml:&quot;xml:space,attr&quot;`
	} `xml:&quot;text&quot;`
}
func main() {
	frwikiXML, err := os.Open(&quot;frwiki10000.xml&quot;)
	if err != nil {
		fmt.Println(err)
	}
	cleanedWikiXML, err := os.Create(&quot;cleaned_fr_wiki.xml&quot;)
	if err != nil {
		fmt.Println(err)
	}
	cleanXMLEncoder := xml.NewEncoder(cleanedWikiXML)
	cleanXMLEncoder.Indent(&quot;&quot;, &quot; &quot;)
	frwikiDecoder := xml.NewDecoder(frwikiXML)
	for {
		t, tokenErr := frwikiDecoder.Token()
		if tokenErr != nil {
			if tokenErr == io.EOF {
				break
			}
			fmt.Errorf(&quot;decoding token: %w&quot;, tokenErr)
		}
		switch t := t.(type) {
		case xml.StartElement:
			if t.Name.Local == &quot;page&quot; {
				var page Page
				if err := frwikiDecoder.DecodeElement(&amp;page, &amp;t); err != nil {
					fmt.Errorf(&quot;decoding element %q: %v&quot;, t.Name.Local, err)
				}
				fmt.Println(&quot;Element was decoded successfully.&quot;)
				fmt.Printf(&quot;Page title: %v\n Page id: %d\n&quot;, page.Title, page.Id)
				fmt.Printf(&quot;Text: %v&quot;, page.Text)
				cleanXMLEncoder.Encode(page)
			}
		}
	}
	defer frwikiXML.Close()
	defer cleanedWikiXML.Close()
}

How would I be able to solve this problem, please?

Thanks.

答案1

得分: 1

要解析大型的xml文件，可以使用标准的xml Decoder。

调用Token逐个读取标记。当找到一个具有所需名称的起始元素（"page"），调用DecodeElement来解码该元素并准备下一步操作的结果。

type Page struct {
	XMLName  xml.Name `xml:"page"`
	Title    string   `xml:"title"`
	Id       int64    `xml:"id"`
	Revision struct {
		Text struct {
			Key   float32 `xml:"bytes,attr"`
			Space string  `xml:"xml:space,attr"`
		} `xml:"text"`
	} `xml:"revision"`
}
type PageTarget struct {
	XMLName xml.Name `xml:"page"`
	Title   string   `xml:"title"`
	Id      int64    `xml:"id"`
	Text    struct {
		Key   float32 `xml:"bytes,attr"`
		Space string  `xml:"xml:space,attr"`
	} `xml:"text"`
}

dec := xml.NewDecoder(strings.NewReader(sample))
loop:
for {
	tok, err := dec.Token()
	switch {
	case err != nil && err != io.EOF:
		panic(err)
	case err == io.EOF:
		break loop
	case tok == nil:
		fmt.Println("token is nill")
	}
	switch se := tok.(type) {
	case xml.StartElement:
		if se.Name.Local == "page" {
			var page Page
			if err := dec.DecodeElement(&page, &se); err != nil {
				panic(err)
			}
			target := PageTarget{
				XMLName: page.XMLName,
				Id:      page.Id,
				Title:   page.Title,
				Text:    page.Revision.Text,
			}
			out, err := xml.MarshalIndent(target, " ", "  ")
			if err != nil {
				panic(err)
			}
			fmt.Println(string(out))
		}
	}
}

<kbd>PLAYGROUND</kbd>

英文:

To parse huge file xml file, use the standard xml Decoder.

Call Token to read tokens one by one. When a start element with required name is found ("page"), call DecodeElement to decode the element and prepare result to next actions.

type Page struct {
XMLName  xml.Name `xml:&quot;page&quot;`
Title    string   `xml:&quot;title&quot;`
Id       int64    `xml:&quot;id&quot;`
Revision struct {
Text struct {
Key   float32 `xml:&quot;bytes,attr&quot;`
Space string  `xml:&quot;xml:space,attr&quot;`
} `xml:&quot;text&quot;`
} `xml:&quot;revision&quot;`
}
type PageTarget struct {
XMLName xml.Name `xml:&quot;page&quot;`
Title   string   `xml:&quot;title&quot;`
Id      int64    `xml:&quot;id&quot;`
Text    struct {
Key   float32 `xml:&quot;bytes,attr&quot;`
Space string  `xml:&quot;xml:space,attr&quot;`
} `xml:&quot;text&quot;`
}

	dec := xml.NewDecoder(strings.NewReader(sample))
loop:
for {
tok, err := dec.Token()
switch {
case err != nil &amp;&amp; err != io.EOF:
panic(err)
case err == io.EOF:
break loop
case tok == nil:
fmt.Println(&quot;token is nill&quot;)
}
switch se := tok.(type) {
case xml.StartElement:
if se.Name.Local == &quot;page&quot; {
var page Page
if err := dec.DecodeElement(&amp;page, &amp;se); err != nil {
panic(err)
}
target := PageTarget{
XMLName: page.XMLName,
Id:      page.Id,
Title:   page.Title,
Text:    page.Revision.Text,
}
out, err := xml.MarshalIndent(target, &quot; &quot;, &quot;  &quot;)
if err != nil {
panic(err)
}
fmt.Println(string(out))
}
}
}

<kbd>PLAYGROUND</kbd>

答案2

得分: 0

只需将其解码为结构体，然后再进行编码即可满足您的目标。

请查看此链接：https://go.dev/play/p/69vjlve4P6p

英文:

Simply decoding to the struct and encoding again will satisfy your goal.

Please check this: https://go.dev/play/p/69vjlve4P6p

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to get both the chardata and the value of the attributes of an XML tag when decoding it in Golang

问题

答案1

答案2

Golang Make the output all unique 3-digit numbers

在Golang中使用MongoDB减去两个字段的值。

Golang正则表达式提取两个分隔符之间的文本 – 包括分隔符

你如何为测试而截断对GitHub的调用？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。