Golang xml unmarshal html table(Golang解析XML中的HTML表格)

huangapple go评论117阅读模式

Golang xml unmarshal html table



我尝试使用xml unmarshal,但是没有得到正确的结构标签、值或属性。

  1. import (
  2. "fmt"
  3. "encoding/xml"
  4. )
  5. type XMLTable struct {
  6. XMLName xml.Name `xml:"TABLE"`
  7. Row []struct{
  8. Cell string `xml:"TD"`
  9. }`xml:"TR"`
  10. }
  11. func main() {
  12. raw_html_table := `
  13. <TABLE><TR>
  14. <TD>lalalal</TD>
  15. <TD>papapap</TD>
  16. <TD>fafafa</TD>
  17. <TD>
  18. <form action=\"/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf\" method=POST>
  19. <input type=hidden name=acT value=\"Dev\">
  20. <input type=hidden name=acA value=\"Anyval\">
  21. <input type=submit name=submit value=Stop>
  22. </form>
  23. </TD>
  24. </TR>
  25. </TABLE>`
  26. table := XMLTable{}
  27. fmt.Printf("%q\n", []byte(raw_html_table)[:15])
  28. err := xml.Unmarshal([]byte(raw_html_table), &table)
  29. if err != nil {
  30. fmt.Printf("error: %v", err)
  31. }
  32. }

额外的信息是,如果单元格内容是HTML代码,我不关心它(只获取[]byte / string值)。因此,在解析之前,我可以删除单元格内容,但这种方式也不太容易。



I have a simple HTML table, and want to get all cell values even if it's HTML code inside.

Trying to use xml unmarshal, but didn't get the right struct tags, values or attributes.

  1. import (
  2. &quot;fmt&quot;
  3. &quot;encoding/xml&quot;
  4. )
  5. type XMLTable struct {
  6. XMLName xml.Name `xml:&quot;TABLE&quot;`
  7. Row []struct{
  8. Cell string `xml:&quot;TD&quot;`
  9. }`xml:&quot;TR&quot;`
  10. }
  11. func main() {
  12. raw_html_table := `
  13. &lt;TABLE&gt;&lt;TR&gt;
  14. &lt;TD&gt;lalalal&lt;/TD&gt;
  15. &lt;TD&gt;papapap&lt;/TD&gt;
  16. &lt;TD&gt;fafafa&lt;/TD&gt;
  17. &lt;TD&gt;
  18. &lt;form action=\&quot;/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf&quot; method=POST&gt;
  19. &lt;input type=hidden name=acT value=\&quot;Dev\&quot;&gt;
  20. &lt;input type=hidden name=acA value=\&quot;Anyval\&quot;&gt;
  21. &lt;input type=submit name=submit value=Stop&gt;
  22. &lt;/form&gt;
  23. &lt;/TD&gt;
  24. &lt;/TR&gt;
  25. &lt;/TABLE&gt;`
  26. table := XMLTable{}
  27. fmt.Printf(&quot;%q\n&quot;, []byte(raw_html_table)[:15])
  28. err := xml.Unmarshal([]byte(raw_html_table), &amp;table)
  29. if err != nil {
  30. fmt.Printf(&quot;error: %v&quot;, err)
  31. }
  32. }

As an additional info, I don't care about cell content if it's HTML code (take only []byte / string values). So I may delete cell content before unmarshaling, but this way is also not so easy.

Any suggestions with standard golang libs would be welcome.


得分: 4




  1. &lt;form action=\&quot;/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf&quot; method=POST&gt;





  1. &lt;input type=hidden name=acT value=\&quot;Dev\&quot;&gt;


  1. &lt;input type=&quot;hidden&quot; name=&quot;acT&quot; value=&quot;Dev&quot; /&gt;



  1. type XMLTable struct {
  2. Rows []struct {
  3. Cell string `xml:&quot;,innerxml&quot;`
  4. } `xml:&quot;TR&gt;TD&quot;`
  5. }


  1. raw_html_table := `
  2. &lt;TABLE&gt;&lt;TR&gt;
  3. &lt;TD&gt;lalalal&lt;/TD&gt;
  4. &lt;TD&gt;papapap&lt;/TD&gt;
  5. &lt;TD&gt;fafafa&lt;/TD&gt;
  6. &lt;TD&gt;
  7. &lt;form action=&quot;/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf&quot; method=&quot;POST&quot;&gt;
  8. &lt;input type=&quot;hidden&quot; name=&quot;acT&quot; value=&quot;Dev&quot; /&gt;
  9. &lt;input type=&quot;hidden&quot; name=&quot;acA&quot; value=&quot;Anyval&quot; /&gt;
  10. &lt;input type=&quot;submit&quot; name=&quot;submit&quot; value=&quot;Stop&quot; /&gt;
  11. &lt;/form&gt;
  12. &lt;/TD&gt;
  13. &lt;/TR&gt;
  14. &lt;/TABLE&gt;`
  15. table := XMLTable{}
  16. err := xml.Unmarshal([]byte(raw_html_table), &amp;table)
  17. if err != nil {
  18. fmt.Printf("error: %v\n", err)
  19. }
  20. fmt.Println("count:", len(table.Rows))
  21. for _, row := range table.Rows {
  22. fmt.Println("TD content:", row.Cell)
  23. }

输出结果(在Go Playground上尝试):

  1. count: 4
  2. TD content: lalalal
  3. TD content: papapap
  4. TD content: fafafa
  5. TD content:
  6. &lt;form action=&quot;/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf&quot; method=&quot;POST&quot;&gt;
  7. &lt;input type=&quot;hidden&quot; name=&quot;acT&quot; value=&quot;Dev&quot; /&gt;
  8. &lt;input type=&quot;hidden&quot; name=&quot;acA&quot; value=&quot;Anyval&quot; /&gt;
  9. &lt;input type=&quot;submit&quot; name=&quot;submit&quot; value=&quot;Stop&quot; /&gt;
  10. &lt;/form&gt;





Sticking to the standard lib

Your input is not valid XML, so even if you model it right, you won't be able to parse it.

First, you're using a raw string literal to define your input HTML as a string, and raw string literals cannot contain escapes. For example this:

  1. &lt;form action=\&quot;/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf&quot; method=POST&gt;

You can't use \&quot; in a raw string literal (you can, but it will mean exactly those 2 characters), and you don't have to, use a simple quotation mark: &quot;.

Next, in XML you cannot have attributes without putting their values in quotes.

Third, each element must have a matching closing element, your &lt;input&gt; elements are not closed.

So for example this line:

  1. &lt;input type=hidden name=acT value=\&quot;Dev\&quot;&gt;

Must be changed to:

  1. &lt;input type=&quot;hidden&quot; name=&quot;acT&quot; value=&quot;Dev&quot; /&gt;

Ok, after these the input is a valid XML now.

How to model it? Simple as this:

  1. type XMLTable struct {
  2. Rows []struct {
  3. Cell string `xml:&quot;,innerxml&quot;`
  4. } `xml:&quot;TR&gt;TD&quot;`
  5. }

And the full code to parse and print contents of &lt;TD&gt; elements:

  1. raw_html_table := `
  2. &lt;TABLE&gt;&lt;TR&gt;
  3. &lt;TD&gt;lalalal&lt;/TD&gt;
  4. &lt;TD&gt;papapap&lt;/TD&gt;
  5. &lt;TD&gt;fafafa&lt;/TD&gt;
  6. &lt;TD&gt;
  7. &lt;form action=&quot;/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf&quot; method=&quot;POST&quot;&gt;
  8. &lt;input type=&quot;hidden&quot; name=&quot;acT&quot; value=&quot;Dev&quot; /&gt;
  9. &lt;input type=&quot;hidden&quot; name=&quot;acA&quot; value=&quot;Anyval&quot; /&gt;
  10. &lt;input type=&quot;submit&quot; name=&quot;submit&quot; value=&quot;Stop&quot; /&gt;
  11. &lt;/form&gt;
  12. &lt;/TD&gt;
  13. &lt;/TR&gt;
  14. &lt;/TABLE&gt;`
  15. table := XMLTable{}
  16. err := xml.Unmarshal([]byte(raw_html_table), &amp;table)
  17. if err != nil {
  18. fmt.Printf(&quot;error: %v\n&quot;, err)
  19. }
  20. fmt.Println(&quot;count:&quot;, len(table.Rows))
  21. for _, row := range table.Rows {
  22. fmt.Println(&quot;TD content:&quot;, row.Cell)
  23. }

Output (try it on the Go Playground):

  1. count: 4
  2. TD content: lalalal
  3. TD content: papapap
  4. TD content: fafafa
  5. TD content:
  6. &lt;form action=&quot;/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf&quot; method=&quot;POST&quot;&gt;
  7. &lt;input type=&quot;hidden&quot; name=&quot;acT&quot; value=&quot;Dev&quot; /&gt;
  8. &lt;input type=&quot;hidden&quot; name=&quot;acA&quot; value=&quot;Anyval&quot; /&gt;
  9. &lt;input type=&quot;submit&quot; name=&quot;submit&quot; value=&quot;Stop&quot; /&gt;
  10. &lt;/form&gt;

Using a proper HTML parser

If you can't or don't want to change the input HTML, or you want to handle all HTML input not just valid XMLs, you should use a proper HTML parser instead of treating the input as XML.

Check out https://godoc.org/golang.org/x/net/html for an HTML5-compliant tokenizer and parser.


得分: 1



  1. package main
  2. import (
  3. "encoding/xml"
  4. "fmt"
  5. "strings"
  6. )
  7. type XMLTable struct {
  8. Rows []struct {
  9. Cell string `xml:",innerxml"`
  10. } `xml:"TR>TD"`
  11. }
  12. func main() {
  13. raw_html_table := `
  14. <TABLE><TR>
  15. <TD>lalalal</TD>
  16. <TD>papapap</TD>
  17. <TD>fafafa</TD>
  18. <TD>
  19. <form action="/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf" method="POST">
  20. <input type="hidden" name="acT" value="Dev">
  21. <input type="hidden" name="acA" value="Anyval">
  22. <input type="submit" name="submit" value="Stop">
  23. </form>
  24. </TD>
  25. </TR>
  26. </TABLE>`
  27. table := XMLTable{}
  28. decoder := xml.NewDecoder(strings.NewReader(raw_html_table))
  29. decoder.Entity = xml.HTMLEntity
  30. decoder.AutoClose = xml.HTMLAutoClose
  31. decoder.Strict = false
  32. err := decoder.Decode(&table)
  33. if err != nil {
  34. fmt.Printf("error: %v", err)
  35. }
  36. fmt.Printf("%#v\n", table)
  37. }



Once your input is valid HTML (your snippet is missing quotes in attributes), you can configure a xml.Decoder with entities and autoclose maps (and make it non-strict), which will end up working :

Run my modified version here.

  1. package main
  2. import (
  3. &quot;encoding/xml&quot;
  4. &quot;fmt&quot;
  5. &quot;strings&quot;
  6. )
  7. type XMLTable struct {
  8. Rows []struct {
  9. Cell string `xml:&quot;,innerxml&quot;`
  10. } `xml:&quot;TR&gt;TD&quot;`
  11. }
  12. func main() {
  13. raw_html_table := `
  14. &lt;TABLE&gt;&lt;TR&gt;
  15. &lt;TD&gt;lalalal&lt;/TD&gt;
  16. &lt;TD&gt;papapap&lt;/TD&gt;
  17. &lt;TD&gt;fafafa&lt;/TD&gt;
  18. &lt;TD&gt;
  19. &lt;form action=&quot;/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf&quot; method=&quot;POST&quot;&gt;
  20. &lt;input type=&quot;hidden&quot; name=&quot;acT&quot; value=&quot;Dev&quot;&gt;
  21. &lt;input type=&quot;hidden&quot; name=&quot;acA&quot; value=&quot;Anyval&quot;&gt;
  22. &lt;input type=&quot;submit&quot; name=&quot;submit&quot; value=&quot;Stop&quot;&gt;
  23. &lt;/form&gt;
  24. &lt;/TD&gt;
  25. &lt;/TR&gt;
  26. &lt;/TABLE&gt;`
  27. table := XMLTable{}
  28. decoder := xml.NewDecoder(strings.NewReader(raw_html_table))
  29. decoder.Entity = xml.HTMLEntity
  30. decoder.AutoClose = xml.HTMLAutoClose
  31. decoder.Strict = false
  32. err := decoder.Decode(&amp;table)
  33. if err != nil {
  34. fmt.Printf(&quot;error: %v&quot;, err)
  35. }
  36. fmt.Printf(&quot;%#v\n&quot;, table)
  37. }

  • 本文由 发表于 2017年5月17日 14:36:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/44017285.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
