How to convert HTML table to array with golang

huangapple go评论75阅读模式
英文:

How to convert HTML table to array with golang

问题

我正在尝试将一个HTML表格转换为Golang数组时遇到问题。我尝试使用x/net/html和goquery来实现,但两者都没有成功。

假设我们有以下HTML表格:

<html>
  <body>
    <table>
      <tr>
        <td>Row 1, Content 1</td>
        <td>Row 1, Content 2</td>
        <td>Row 1, Content 3</td>
        <td>Row 1, Content 4</td>
      </tr>
      <tr>
        <td>Row 2, Content 1</td>
        <td>Row 2, Content 2</td>
        <td>Row 2, Content 3</td>
        <td>Row 2, Content 4</td>
      </tr>
    </table>
  </body>
</html>

我想得到以下数组:

------------------------------------
|Row 1, Content 1| Row 1, Content 2|
------------------------------------
|Row 2, Content 1| Row 2, Content 2|
------------------------------------

正如大家所看到的,我只是忽略了内容3和4。

我的提取代码如下:

func extractValue(content []byte) {
  doc, _ := goquery.NewDocumentFromReader(bytes.NewReader(content))

  doc.Find("table tr td").Each(func(i int, td *goquery.Selection) {
    // ...
  })
}

我尝试添加一个控制器变量来忽略我不想转换的<td>,并调用

td.NextAll()

但没有成功。你们有什么想法可以帮助我完成吗?

谢谢。

英文:

I'm having a problem trying to convert an HTML table into a Golang array. I've tried to achieve it using x/net/html and goquery, without any success on both of them.

Let's say we have this HTML table:

&lt;html&gt;
  &lt;body&gt;
    &lt;table&gt;
      &lt;tr&gt;
        &lt;td&gt;Row 1, Content 1&lt;/td&gt;
        &lt;td&gt;Row 1, Content 2&lt;/td&gt;
        &lt;td&gt;Row 1, Content 3&lt;/td&gt;
        &lt;td&gt;Row 1, Content 4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;Row 2, Content 1&lt;/td&gt;
        &lt;td&gt;Row 2, Content 2&lt;/td&gt;
        &lt;td&gt;Row 2, Content 3&lt;/td&gt;
        &lt;td&gt;Row 2, Content 4&lt;/td&gt;
      &lt;/tr&gt;
    &lt;/table&gt;
  &lt;/body&gt;
&lt;/html&gt;

And I'd like to end up with this array:

------------------------------------
|Row 1, Content 1| Row 1, Content 2|
------------------------------------
|Row 2, Content 1| Row 2, Content 2|
------------------------------------

As you guy can see, I'm just ignoring Contents 3 and 4.

My extraction code:

func extractValue(content []byte) {
  doc, _ := goquery.NewDocumentFromReader(bytes.NewReader(content))

  doc.Find(&quot;table tr td&quot;).Each(func(i int, td *goquery.Selection) {
	// ...
  })
}

I've tried to add a controller number which would be responsible for ignoring the &lt;td&gt; that I don't want to convert and calling

td.NextAll()

but with no luck. Do you guys have any idea of what should I do to accomplish it?

Thanks.

答案1

得分: 7

你只需要使用golang.org/x/net/html包就可以了。

var body = strings.NewReader(`
    <html>
    <body>
    <table>
    <tr>
    <td>Row 1, Content 1</td>
    <td>Row 1, Content 2</td>
    <td>Row 1, Content 3</td>
    <td>Row 1, Content 4</td>
    </tr>
    <tr>
    <td>Row 2, Content 1</td>
    <td>Row 2, Content 2</td>
    <td>Row 2, Content 3</td>
    <td>Row 2, Content 4</td>
    </tr>
    </table>
    </body>
    </html>`)
    
func main() {
    z := html.NewTokenizer(body)
    content := []string{}

    // While have not hit the </html> tag
    for z.Token().Data != "html" {
        tt := z.Next()
        if tt == html.StartTagToken {
            t := z.Token()
            if t.Data == "td" {
                inner := z.Next()
                if inner == html.TextToken {
                    text := (string)(z.Text())
                    t := strings.TrimSpace(text)
                    content = append(content, t)
                }
            }
        }
    }
    // Print to check the slice's content
    fmt.Println(content)
}

这段代码只适用于特定的HTML模式,但将其重构为更通用的形式并不难。

英文:

You can get away with package golang.org/x/net/html only.

var body = strings.NewReader(`                                                                                                                            
        &lt;html&gt;                                                                                                                                            
        &lt;body&gt;                                                                                                                                            
        &lt;table&gt;                                                                                                                                           
        &lt;tr&gt;                                                                                                                                              
        &lt;td&gt;Row 1, Content 1&lt;/td&gt;                                                                                                                          
        &lt;td&gt;Row 1, Content 2&lt;/td&gt;                                                                                                                          
        &lt;td&gt;Row 1, Content 3&lt;/td&gt;                                                                                                                          
        &lt;td&gt;Row 1, Content 4&lt;/td&gt;                                                                                                                          
        &lt;/tr&gt;                                                                                                                                             
        &lt;tr&gt;                                                                                                                                              
        &lt;td&gt;Row 2, Content 1&lt;/td&gt;                                                                                                        
        &lt;td&gt;Row 2, Content 2&lt;/td&gt;                                                                                                                          
        &lt;td&gt;Row 2, Content 3&lt;/td&gt;                                                                                                                          
        &lt;td&gt;Row 2, Content 4&lt;/td&gt;                                                                                                                          
        &lt;/tr&gt;  
        &lt;/table&gt;                                                                                                                                          
        &lt;/body&gt;                                                                                                                                           
        &lt;/html&gt;`)          

func main() {
    z := html.NewTokenizer(body)
    content := []string{}

    // While have not hit the &lt;/html&gt; tag
    for z.Token().Data != &quot;html&quot; {
        tt := z.Next()
        if tt == html.StartTagToken {
            t := z.Token()
            if t.Data == &quot;td&quot; {
                inner := z.Next()
                if inner == html.TextToken {
                    text := (string)(z.Text())
                    t := strings.TrimSpace(text)
                    content = append(content, t)
                }
            }
        }
    }
    // Print to check the slice&#39;s content
    fmt.Println(content)
}

This code is written only for this typical HTML pattern only, but refactoring it to be more general wouldn't be hard.

答案2

得分: 0

如果您需要以更结构化的方式从HTML表格中提取数据,https://github.com/nfx/go-htmltable支持行/列合并。

type AM4 struct {
    Model             string `header:"Model"`
    ReleaseDate       string `header:"Release date"`
    PCIeSupport       string `header:"PCIesupport[a]"`
    MultiGpuCrossFire bool   `header:"Multi-GPU CrossFire"`
    MultiGpuSLI       bool   `header:"Multi-GPU SLI"`
    USBSupport        string `header:"USBsupport[b]"`
    SATAPorts         int    `header:"Storage features SATAports"`
    RAID              string `header:"Storage features RAID"`
    AMDStoreMI        bool   `header:"Storage features AMD StoreMI"`
    Overclocking      string `header:"Processoroverclocking"`
    TDP               string `header:"TDP"`
    SupportExcavator  string `header:"CPU support[14] Excavator"`
    SupportZen        string `header:"CPU support[14] Zen"`
    SupportZenPlus    string `header:"CPU support[14] Zen+"`
    SupportZen2       string `header:"CPU support[14] Zen 2"`
    SupportZen3       string `header:"CPU support[14] Zen 3"`
    Architecture      string `header:"Architecture"`
}
am4Chipsets, _ := htmltable.NewSliceFromURL[AM4]("https://en.wikipedia.org/wiki/List_of_AMD_chipsets")
fmt.Println(am4Chipsets[2].Model)
fmt.Println(am4Chipsets[2].SupportZen2)

// 输出:
// X370
// Varies[c]

以上是从HTML表格中提取数据的示例代码。

英文:

If you need a more structured way of extracting data from HTML Tables, https://github.com/nfx/go-htmltable does support the row/colspans.

type AM4 struct {
    Model             string `header:&quot;Model&quot;`
    ReleaseDate       string `header:&quot;Release date&quot;`
    PCIeSupport       string `header:&quot;PCIesupport[a]&quot;`
    MultiGpuCrossFire bool   `header:&quot;Multi-GPU CrossFire&quot;`
    MultiGpuSLI       bool   `header:&quot;Multi-GPU SLI&quot;`
    USBSupport        string `header:&quot;USBsupport[b]&quot;`
    SATAPorts         int    `header:&quot;Storage features SATAports&quot;`
    RAID              string `header:&quot;Storage features RAID&quot;`
    AMDStoreMI        bool   `header:&quot;Storage features AMD StoreMI&quot;`
    Overclocking      string `header:&quot;Processoroverclocking&quot;`
    TDP               string `header:&quot;TDP&quot;`
    SupportExcavator  string `header:&quot;CPU support[14] Excavator&quot;`
    SupportZen        string `header:&quot;CPU support[14] Zen&quot;`
    SupportZenPlus    string `header:&quot;CPU support[14] Zen+&quot;`
    SupportZen2       string `header:&quot;CPU support[14] Zen 2&quot;`
    SupportZen3       string `header:&quot;CPU support[14] Zen 3&quot;`
    Architecture      string `header:&quot;Architecture&quot;`
}
am4Chipsets, _ := htmltable.NewSliceFromURL[AM4](&quot;https://en.wikipedia.org/wiki/List_of_AMD_chipsets&quot;)
fmt.Println(am4Chipsets[2].Model)
fmt.Println(am4Chipsets[2].SupportZen2)

// Output:
// X370
// Varies[c]

答案3

得分: -1

尝试使用以下方法创建一个二维数组并处理可变行大小:

	z := html.NewTokenizer(body)
	table := [][]string{}
	row := []string{}

	for z.Token().Data != "html" {
		tt := z.Next()
		if tt == html.StartTagToken {
			t := z.Token()

			if t.Data == "tr" {
				if len(row) > 0 {
					table = append(table, row)
					row = []string{}
				}
			}

			if t.Data == "td" {
				inner := z.Next()

				if inner == html.TextToken {
					text := (string)(z.Text())
					t := strings.TrimSpace(text)
					row = append(row, t)
				}
			}

		}
	}
	if len(row) > 0 {
		table = append(table, row)
	}
英文:

Try an approach like this to make a 2d array and handle variable row sizes:

	z := html.NewTokenizer(body)
	table := [][]string{}
	row := []string{}

	for z.Token().Data != &quot;html&quot; {
		tt := z.Next()
		if tt == html.StartTagToken {
			t := z.Token()

			if t.Data == &quot;tr&quot; {
				if len(row) &gt; 0 {
					table = append(table, row)
					row = []string{}
				}
			}

			if t.Data == &quot;td&quot; {
				inner := z.Next()

				if inner == html.TextToken {
					text := (string)(z.Text())
					t := strings.TrimSpace(text)
					row = append(row, t)
				}
			}

		}
	}
	if len(row) &gt; 0 {
		table = append(table, row)
	}

huangapple
  • 本文由 发表于 2016年3月13日 02:14:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/35961491.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定