How to convert HTML table to array with golang

huangapple go评论116阅读模式
英文:

How to convert HTML table to array with golang

问题

我正在尝试将一个HTML表格转换为Golang数组时遇到问题。我尝试使用x/net/html和goquery来实现,但两者都没有成功。

假设我们有以下HTML表格:

  1. <html>
  2. <body>
  3. <table>
  4. <tr>
  5. <td>Row 1, Content 1</td>
  6. <td>Row 1, Content 2</td>
  7. <td>Row 1, Content 3</td>
  8. <td>Row 1, Content 4</td>
  9. </tr>
  10. <tr>
  11. <td>Row 2, Content 1</td>
  12. <td>Row 2, Content 2</td>
  13. <td>Row 2, Content 3</td>
  14. <td>Row 2, Content 4</td>
  15. </tr>
  16. </table>
  17. </body>
  18. </html>

我想得到以下数组:

  1. ------------------------------------
  2. |Row 1, Content 1| Row 1, Content 2|
  3. ------------------------------------
  4. |Row 2, Content 1| Row 2, Content 2|
  5. ------------------------------------

正如大家所看到的,我只是忽略了内容3和4。

我的提取代码如下:

  1. func extractValue(content []byte) {
  2. doc, _ := goquery.NewDocumentFromReader(bytes.NewReader(content))
  3. doc.Find("table tr td").Each(func(i int, td *goquery.Selection) {
  4. // ...
  5. })
  6. }

我尝试添加一个控制器变量来忽略我不想转换的<td>,并调用

  1. td.NextAll()

但没有成功。你们有什么想法可以帮助我完成吗?

谢谢。

英文:

I'm having a problem trying to convert an HTML table into a Golang array. I've tried to achieve it using x/net/html and goquery, without any success on both of them.

Let's say we have this HTML table:

  1. &lt;html&gt;
  2. &lt;body&gt;
  3. &lt;table&gt;
  4. &lt;tr&gt;
  5. &lt;td&gt;Row 1, Content 1&lt;/td&gt;
  6. &lt;td&gt;Row 1, Content 2&lt;/td&gt;
  7. &lt;td&gt;Row 1, Content 3&lt;/td&gt;
  8. &lt;td&gt;Row 1, Content 4&lt;/td&gt;
  9. &lt;/tr&gt;
  10. &lt;tr&gt;
  11. &lt;td&gt;Row 2, Content 1&lt;/td&gt;
  12. &lt;td&gt;Row 2, Content 2&lt;/td&gt;
  13. &lt;td&gt;Row 2, Content 3&lt;/td&gt;
  14. &lt;td&gt;Row 2, Content 4&lt;/td&gt;
  15. &lt;/tr&gt;
  16. &lt;/table&gt;
  17. &lt;/body&gt;
  18. &lt;/html&gt;

And I'd like to end up with this array:

  1. ------------------------------------
  2. |Row 1, Content 1| Row 1, Content 2|
  3. ------------------------------------
  4. |Row 2, Content 1| Row 2, Content 2|
  5. ------------------------------------

As you guy can see, I'm just ignoring Contents 3 and 4.

My extraction code:

  1. func extractValue(content []byte) {
  2. doc, _ := goquery.NewDocumentFromReader(bytes.NewReader(content))
  3. doc.Find(&quot;table tr td&quot;).Each(func(i int, td *goquery.Selection) {
  4. // ...
  5. })
  6. }

I've tried to add a controller number which would be responsible for ignoring the &lt;td&gt; that I don't want to convert and calling

  1. td.NextAll()

but with no luck. Do you guys have any idea of what should I do to accomplish it?

Thanks.

答案1

得分: 7

你只需要使用golang.org/x/net/html包就可以了。

  1. var body = strings.NewReader(`
  2. <html>
  3. <body>
  4. <table>
  5. <tr>
  6. <td>Row 1, Content 1</td>
  7. <td>Row 1, Content 2</td>
  8. <td>Row 1, Content 3</td>
  9. <td>Row 1, Content 4</td>
  10. </tr>
  11. <tr>
  12. <td>Row 2, Content 1</td>
  13. <td>Row 2, Content 2</td>
  14. <td>Row 2, Content 3</td>
  15. <td>Row 2, Content 4</td>
  16. </tr>
  17. </table>
  18. </body>
  19. </html>`)
  20. func main() {
  21. z := html.NewTokenizer(body)
  22. content := []string{}
  23. // While have not hit the </html> tag
  24. for z.Token().Data != "html" {
  25. tt := z.Next()
  26. if tt == html.StartTagToken {
  27. t := z.Token()
  28. if t.Data == "td" {
  29. inner := z.Next()
  30. if inner == html.TextToken {
  31. text := (string)(z.Text())
  32. t := strings.TrimSpace(text)
  33. content = append(content, t)
  34. }
  35. }
  36. }
  37. }
  38. // Print to check the slice's content
  39. fmt.Println(content)
  40. }

这段代码只适用于特定的HTML模式,但将其重构为更通用的形式并不难。

英文:

You can get away with package golang.org/x/net/html only.

  1. var body = strings.NewReader(`
  2. &lt;html&gt;
  3. &lt;body&gt;
  4. &lt;table&gt;
  5. &lt;tr&gt;
  6. &lt;td&gt;Row 1, Content 1&lt;/td&gt;
  7. &lt;td&gt;Row 1, Content 2&lt;/td&gt;
  8. &lt;td&gt;Row 1, Content 3&lt;/td&gt;
  9. &lt;td&gt;Row 1, Content 4&lt;/td&gt;
  10. &lt;/tr&gt;
  11. &lt;tr&gt;
  12. &lt;td&gt;Row 2, Content 1&lt;/td&gt;
  13. &lt;td&gt;Row 2, Content 2&lt;/td&gt;
  14. &lt;td&gt;Row 2, Content 3&lt;/td&gt;
  15. &lt;td&gt;Row 2, Content 4&lt;/td&gt;
  16. &lt;/tr&gt;
  17. &lt;/table&gt;
  18. &lt;/body&gt;
  19. &lt;/html&gt;`)
  20. func main() {
  21. z := html.NewTokenizer(body)
  22. content := []string{}
  23. // While have not hit the &lt;/html&gt; tag
  24. for z.Token().Data != &quot;html&quot; {
  25. tt := z.Next()
  26. if tt == html.StartTagToken {
  27. t := z.Token()
  28. if t.Data == &quot;td&quot; {
  29. inner := z.Next()
  30. if inner == html.TextToken {
  31. text := (string)(z.Text())
  32. t := strings.TrimSpace(text)
  33. content = append(content, t)
  34. }
  35. }
  36. }
  37. }
  38. // Print to check the slice&#39;s content
  39. fmt.Println(content)
  40. }

This code is written only for this typical HTML pattern only, but refactoring it to be more general wouldn't be hard.

答案2

得分: 0

如果您需要以更结构化的方式从HTML表格中提取数据,https://github.com/nfx/go-htmltable支持行/列合并。

  1. type AM4 struct {
  2. Model string `header:"Model"`
  3. ReleaseDate string `header:"Release date"`
  4. PCIeSupport string `header:"PCIesupport[a]"`
  5. MultiGpuCrossFire bool `header:"Multi-GPU CrossFire"`
  6. MultiGpuSLI bool `header:"Multi-GPU SLI"`
  7. USBSupport string `header:"USBsupport[b]"`
  8. SATAPorts int `header:"Storage features SATAports"`
  9. RAID string `header:"Storage features RAID"`
  10. AMDStoreMI bool `header:"Storage features AMD StoreMI"`
  11. Overclocking string `header:"Processoroverclocking"`
  12. TDP string `header:"TDP"`
  13. SupportExcavator string `header:"CPU support[14] Excavator"`
  14. SupportZen string `header:"CPU support[14] Zen"`
  15. SupportZenPlus string `header:"CPU support[14] Zen+"`
  16. SupportZen2 string `header:"CPU support[14] Zen 2"`
  17. SupportZen3 string `header:"CPU support[14] Zen 3"`
  18. Architecture string `header:"Architecture"`
  19. }
  20. am4Chipsets, _ := htmltable.NewSliceFromURL[AM4]("https://en.wikipedia.org/wiki/List_of_AMD_chipsets")
  21. fmt.Println(am4Chipsets[2].Model)
  22. fmt.Println(am4Chipsets[2].SupportZen2)
  23. // 输出:
  24. // X370
  25. // Varies[c]

以上是从HTML表格中提取数据的示例代码。

英文:

If you need a more structured way of extracting data from HTML Tables, https://github.com/nfx/go-htmltable does support the row/colspans.

  1. type AM4 struct {
  2. Model string `header:&quot;Model&quot;`
  3. ReleaseDate string `header:&quot;Release date&quot;`
  4. PCIeSupport string `header:&quot;PCIesupport[a]&quot;`
  5. MultiGpuCrossFire bool `header:&quot;Multi-GPU CrossFire&quot;`
  6. MultiGpuSLI bool `header:&quot;Multi-GPU SLI&quot;`
  7. USBSupport string `header:&quot;USBsupport[b]&quot;`
  8. SATAPorts int `header:&quot;Storage features SATAports&quot;`
  9. RAID string `header:&quot;Storage features RAID&quot;`
  10. AMDStoreMI bool `header:&quot;Storage features AMD StoreMI&quot;`
  11. Overclocking string `header:&quot;Processoroverclocking&quot;`
  12. TDP string `header:&quot;TDP&quot;`
  13. SupportExcavator string `header:&quot;CPU support[14] Excavator&quot;`
  14. SupportZen string `header:&quot;CPU support[14] Zen&quot;`
  15. SupportZenPlus string `header:&quot;CPU support[14] Zen+&quot;`
  16. SupportZen2 string `header:&quot;CPU support[14] Zen 2&quot;`
  17. SupportZen3 string `header:&quot;CPU support[14] Zen 3&quot;`
  18. Architecture string `header:&quot;Architecture&quot;`
  19. }
  20. am4Chipsets, _ := htmltable.NewSliceFromURL[AM4](&quot;https://en.wikipedia.org/wiki/List_of_AMD_chipsets&quot;)
  21. fmt.Println(am4Chipsets[2].Model)
  22. fmt.Println(am4Chipsets[2].SupportZen2)
  23. // Output:
  24. // X370
  25. // Varies[c]

答案3

得分: -1

尝试使用以下方法创建一个二维数组并处理可变行大小:

  1. z := html.NewTokenizer(body)
  2. table := [][]string{}
  3. row := []string{}
  4. for z.Token().Data != "html" {
  5. tt := z.Next()
  6. if tt == html.StartTagToken {
  7. t := z.Token()
  8. if t.Data == "tr" {
  9. if len(row) > 0 {
  10. table = append(table, row)
  11. row = []string{}
  12. }
  13. }
  14. if t.Data == "td" {
  15. inner := z.Next()
  16. if inner == html.TextToken {
  17. text := (string)(z.Text())
  18. t := strings.TrimSpace(text)
  19. row = append(row, t)
  20. }
  21. }
  22. }
  23. }
  24. if len(row) > 0 {
  25. table = append(table, row)
  26. }
英文:

Try an approach like this to make a 2d array and handle variable row sizes:

  1. z := html.NewTokenizer(body)
  2. table := [][]string{}
  3. row := []string{}
  4. for z.Token().Data != &quot;html&quot; {
  5. tt := z.Next()
  6. if tt == html.StartTagToken {
  7. t := z.Token()
  8. if t.Data == &quot;tr&quot; {
  9. if len(row) &gt; 0 {
  10. table = append(table, row)
  11. row = []string{}
  12. }
  13. }
  14. if t.Data == &quot;td&quot; {
  15. inner := z.Next()
  16. if inner == html.TextToken {
  17. text := (string)(z.Text())
  18. t := strings.TrimSpace(text)
  19. row = append(row, t)
  20. }
  21. }
  22. }
  23. }
  24. if len(row) &gt; 0 {
  25. table = append(table, row)
  26. }

huangapple
  • 本文由 发表于 2016年3月13日 02:14:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/35961491.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定