How to get both the chardata and the value of the attributes of an XML tag when decoding it in Golang

huangapple go评论110阅读模式
英文:

How to get both the chardata and the value of the attributes of an XML tag when decoding it in Golang

问题

我的XML文件类似于这样:

  1. <page>
  2. <title>Antoine Meillet</title>
  3. <ns>0</ns>
  4. <id>3</id>
  5. <revision>
  6. <id>178204512</id>
  7. <parentid>178097574</parentid>
  8. <timestamp>2020-12-30T10:12:14Z</timestamp>
  9. <contributor>
  10. <username>Rovo</username>
  11. <id>34820</id>
  12. </contributor>
  13. <minor />
  14. <model>wikitext</model>
  15. <format>text/x-wiki</format>
  16. <text bytes="11274" xml:space="preserve">
  17. 大量的文本
  18. </text>
  19. <sha1>ikqy1f9ppwo8eo38a0hh817eynr40vg</sha1>
  20. </revision>
  21. </page>

我的目标是过滤掉大量的标签,只保留page标签和其中的titleidtext标签。

到目前为止,我已经成功提取了具有正确值的page标签和titleid标签。这是我得到的结果:

  1. <page>
  2. <title>Antoine Meillet</title>
  3. <id>3</id>
  4. <text bytes="0" xml:space=""></text>
  5. </page>
  6. <page>
  7. <title>Algèbre linéaire</title>
  8. <id>7</id>
  9. <text bytes="0" xml:space=""></text>
  10. </page>

所以问题在于,正如你所看到的,text标签的属性值不正确,而且其中没有文本。

我使用了以下代码来实现这一点:

  1. package main
  2. import (
  3. "encoding/xml"
  4. "fmt"
  5. "io"
  6. "os"
  7. )
  8. type Page struct {
  9. XMLName xml.Name `xml:"page"`
  10. Title string `xml:"title"`
  11. Id int64 `xml:"id"`
  12. Text struct {
  13. Key float32 `xml:"bytes,attr"`
  14. Space string `xml:"xml:space,attr"`
  15. } `xml:"text"`
  16. }
  17. func main() {
  18. frwikiXML, err := os.Open("frwiki10000.xml")
  19. if err != nil {
  20. fmt.Println(err)
  21. }
  22. cleanedWikiXML, err := os.Create("cleaned_fr_wiki.xml")
  23. if err != nil {
  24. fmt.Println(err)
  25. }
  26. cleanXMLEncoder := xml.NewEncoder(cleanedWikiXML)
  27. cleanXMLEncoder.Indent("", " ")
  28. frwikiDecoder := xml.NewDecoder(frwikiXML)
  29. for {
  30. t, tokenErr := frwikiDecoder.Token()
  31. if tokenErr != nil {
  32. if tokenErr == io.EOF {
  33. break
  34. }
  35. fmt.Errorf("decoding token: %w", tokenErr)
  36. }
  37. switch t := t.(type) {
  38. case xml.StartElement:
  39. if t.Name.Local == "page" {
  40. var page Page
  41. if err := frwikiDecoder.DecodeElement(&page, &t); err != nil {
  42. fmt.Errorf("decoding element %q: %v", t.Name.Local, err)
  43. }
  44. fmt.Println("Element was decoded successfully.")
  45. fmt.Printf("Page title: %v\n Page id: %d\n", page.Title, page.Id)
  46. fmt.Printf("Text: %v", page.Text)
  47. cleanXMLEncoder.Encode(page)
  48. }
  49. }
  50. }
  51. defer frwikiXML.Close()
  52. defer cleanedWikiXML.Close()
  53. }

请问我该如何解决这个问题呢?

谢谢。

英文:

My XML file resembles to something like this:

  1. <page>
  2. <title>Antoine Meillet</title>
  3. <ns>0</ns>
  4. <id>3</id>
  5. <revision>
  6. <id>178204512</id>
  7. <parentid>178097574</parentid>
  8. <timestamp>2020-12-30T10:12:14Z</timestamp>
  9. <contributor>
  10. <username>Rovo</username>
  11. <id>34820</id>
  12. </contributor>
  13. <minor />
  14. <model>wikitext</model>
  15. <format>text/x-wiki</format>
  16. <text bytes="11274" xml:space="preserve">
  17. a lot of text
  18. </text>
  19. <sha1>ikqy1f9ppwo8eo38a0hh817eynr40vg</sha1>
  20. </revision>
  21. </page>

My goal is to filter out a lot of those tags and only keep the page tag and those inner tags: title, id, text.

So far, I have been able to successfully extract the page tag with title and id having the right value.
This is what I get:

  1. <page>
  2. <title>Antoine Meillet</title>
  3. <id>3</id>
  4. <text bytes="0" xml:space=""></text>
  5. </page>
  6. <page>
  7. <title>Algèbre linéaire</title>
  8. <id>7</id>
  9. <text bytes="0" xml:space=""></text>
  10. </page>

So the problem here as you can see is that the text tag doesn't have the right values for its attributes and the absence of text in it.

I have achieved this using this piece of code:

  1. package main
  2. import (
  3. "encoding/xml"
  4. "fmt"
  5. "io"
  6. "os"
  7. )
  8. type Page struct {
  9. XMLName xml.Name `xml:"page"`
  10. Title string `xml:"title"`
  11. Id int64 `xml:"id"`
  12. Text struct {
  13. Key float32 `xml:"bytes,attr"`
  14. Space string `xml:"xml:space,attr"`
  15. } `xml:"text"`
  16. }
  17. func main() {
  18. frwikiXML, err := os.Open("frwiki10000.xml")
  19. if err != nil {
  20. fmt.Println(err)
  21. }
  22. cleanedWikiXML, err := os.Create("cleaned_fr_wiki.xml")
  23. if err != nil {
  24. fmt.Println(err)
  25. }
  26. cleanXMLEncoder := xml.NewEncoder(cleanedWikiXML)
  27. cleanXMLEncoder.Indent("", " ")
  28. frwikiDecoder := xml.NewDecoder(frwikiXML)
  29. for {
  30. t, tokenErr := frwikiDecoder.Token()
  31. if tokenErr != nil {
  32. if tokenErr == io.EOF {
  33. break
  34. }
  35. fmt.Errorf("decoding token: %w", tokenErr)
  36. }
  37. switch t := t.(type) {
  38. case xml.StartElement:
  39. if t.Name.Local == "page" {
  40. var page Page
  41. if err := frwikiDecoder.DecodeElement(&page, &t); err != nil {
  42. fmt.Errorf("decoding element %q: %v", t.Name.Local, err)
  43. }
  44. fmt.Println("Element was decoded successfully.")
  45. fmt.Printf("Page title: %v\n Page id: %d\n", page.Title, page.Id)
  46. fmt.Printf("Text: %v", page.Text)
  47. cleanXMLEncoder.Encode(page)
  48. }
  49. }
  50. }
  51. defer frwikiXML.Close()
  52. defer cleanedWikiXML.Close()
  53. }

How would I be able to solve this problem, please?

Thanks.

答案1

得分: 1

要解析大型的xml文件,可以使用标准的xml Decoder

调用Token逐个读取标记。当找到一个具有所需名称的起始元素("page"),调用DecodeElement来解码该元素并准备下一步操作的结果。

  1. type Page struct {
  2. XMLName xml.Name `xml:"page"`
  3. Title string `xml:"title"`
  4. Id int64 `xml:"id"`
  5. Revision struct {
  6. Text struct {
  7. Key float32 `xml:"bytes,attr"`
  8. Space string `xml:"xml:space,attr"`
  9. } `xml:"text"`
  10. } `xml:"revision"`
  11. }
  12. type PageTarget struct {
  13. XMLName xml.Name `xml:"page"`
  14. Title string `xml:"title"`
  15. Id int64 `xml:"id"`
  16. Text struct {
  17. Key float32 `xml:"bytes,attr"`
  18. Space string `xml:"xml:space,attr"`
  19. } `xml:"text"`
  20. }
  1. dec := xml.NewDecoder(strings.NewReader(sample))
  2. loop:
  3. for {
  4. tok, err := dec.Token()
  5. switch {
  6. case err != nil && err != io.EOF:
  7. panic(err)
  8. case err == io.EOF:
  9. break loop
  10. case tok == nil:
  11. fmt.Println("token is nill")
  12. }
  13. switch se := tok.(type) {
  14. case xml.StartElement:
  15. if se.Name.Local == "page" {
  16. var page Page
  17. if err := dec.DecodeElement(&page, &se); err != nil {
  18. panic(err)
  19. }
  20. target := PageTarget{
  21. XMLName: page.XMLName,
  22. Id: page.Id,
  23. Title: page.Title,
  24. Text: page.Revision.Text,
  25. }
  26. out, err := xml.MarshalIndent(target, " ", " ")
  27. if err != nil {
  28. panic(err)
  29. }
  30. fmt.Println(string(out))
  31. }
  32. }
  33. }

<kbd>PLAYGROUND</kbd>

英文:

To parse huge file xml file, use the standard xml Decoder.

Call Token to read tokens one by one. When a start element with required name is found ("page"), call DecodeElement to decode the element and prepare result to next actions.

  1. type Page struct {
  2. XMLName xml.Name `xml:&quot;page&quot;`
  3. Title string `xml:&quot;title&quot;`
  4. Id int64 `xml:&quot;id&quot;`
  5. Revision struct {
  6. Text struct {
  7. Key float32 `xml:&quot;bytes,attr&quot;`
  8. Space string `xml:&quot;xml:space,attr&quot;`
  9. } `xml:&quot;text&quot;`
  10. } `xml:&quot;revision&quot;`
  11. }
  12. type PageTarget struct {
  13. XMLName xml.Name `xml:&quot;page&quot;`
  14. Title string `xml:&quot;title&quot;`
  15. Id int64 `xml:&quot;id&quot;`
  16. Text struct {
  17. Key float32 `xml:&quot;bytes,attr&quot;`
  18. Space string `xml:&quot;xml:space,attr&quot;`
  19. } `xml:&quot;text&quot;`
  20. }
  1. dec := xml.NewDecoder(strings.NewReader(sample))
  2. loop:
  3. for {
  4. tok, err := dec.Token()
  5. switch {
  6. case err != nil &amp;&amp; err != io.EOF:
  7. panic(err)
  8. case err == io.EOF:
  9. break loop
  10. case tok == nil:
  11. fmt.Println(&quot;token is nill&quot;)
  12. }
  13. switch se := tok.(type) {
  14. case xml.StartElement:
  15. if se.Name.Local == &quot;page&quot; {
  16. var page Page
  17. if err := dec.DecodeElement(&amp;page, &amp;se); err != nil {
  18. panic(err)
  19. }
  20. target := PageTarget{
  21. XMLName: page.XMLName,
  22. Id: page.Id,
  23. Title: page.Title,
  24. Text: page.Revision.Text,
  25. }
  26. out, err := xml.MarshalIndent(target, &quot; &quot;, &quot; &quot;)
  27. if err != nil {
  28. panic(err)
  29. }
  30. fmt.Println(string(out))
  31. }
  32. }
  33. }

<kbd>PLAYGROUND</kbd>

答案2

得分: 0

只需将其解码为结构体,然后再进行编码即可满足您的目标。

请查看此链接:https://go.dev/play/p/69vjlve4P6p

英文:

Simply decoding to the struct and encoding again will satisfy your goal.

Please check this: https://go.dev/play/p/69vjlve4P6p

huangapple
  • 本文由 发表于 2022年1月20日 07:13:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/70778945.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定