2016年3月13日 02:14:38go评论116阅读模式

英文:

How to convert HTML table to array with golang

问题

我正在尝试将一个HTML表格转换为Golang数组时遇到问题。我尝试使用x/net/html和goquery来实现，但两者都没有成功。

假设我们有以下HTML表格：

<html>
  <body>
    <table>
      <tr>
        <td>Row 1, Content 1</td>
        <td>Row 1, Content 2</td>
        <td>Row 1, Content 3</td>
        <td>Row 1, Content 4</td>
      </tr>
      <tr>
        <td>Row 2, Content 1</td>
        <td>Row 2, Content 2</td>
        <td>Row 2, Content 3</td>
        <td>Row 2, Content 4</td>
      </tr>
    </table>
  </body>
</html>

我想得到以下数组：

------------------------------------
|Row 1, Content 1| Row 1, Content 2|
------------------------------------
|Row 2, Content 1| Row 2, Content 2|
------------------------------------

正如大家所看到的，我只是忽略了内容3和4。

我的提取代码如下：

func extractValue(content []byte) {
  doc, _ := goquery.NewDocumentFromReader(bytes.NewReader(content))
  doc.Find("table tr td").Each(func(i int, td *goquery.Selection) {
    // ...
  })
}

我尝试添加一个控制器变量来忽略我不想转换的<td>，并调用

td.NextAll()

但没有成功。你们有什么想法可以帮助我完成吗？

谢谢。

英文:

I'm having a problem trying to convert an HTML table into a Golang array. I've tried to achieve it using x/net/html and goquery, without any success on both of them.

Let's say we have this HTML table:

&lt;html&gt;
  &lt;body&gt;
    &lt;table&gt;
      &lt;tr&gt;
        &lt;td&gt;Row 1, Content 1&lt;/td&gt;
        &lt;td&gt;Row 1, Content 2&lt;/td&gt;
        &lt;td&gt;Row 1, Content 3&lt;/td&gt;
        &lt;td&gt;Row 1, Content 4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;Row 2, Content 1&lt;/td&gt;
        &lt;td&gt;Row 2, Content 2&lt;/td&gt;
        &lt;td&gt;Row 2, Content 3&lt;/td&gt;
        &lt;td&gt;Row 2, Content 4&lt;/td&gt;
      &lt;/tr&gt;
    &lt;/table&gt;
  &lt;/body&gt;
&lt;/html&gt;

And I'd like to end up with this array:

------------------------------------
|Row 1, Content 1| Row 1, Content 2|
------------------------------------
|Row 2, Content 1| Row 2, Content 2|
------------------------------------

As you guy can see, I'm just ignoring Contents 3 and 4.

My extraction code:

func extractValue(content []byte) {
  doc, _ := goquery.NewDocumentFromReader(bytes.NewReader(content))
  doc.Find(&quot;table tr td&quot;).Each(func(i int, td *goquery.Selection) {
	// ...
  })
}

I've tried to add a controller number which would be responsible for ignoring the <td> that I don't want to convert and calling

td.NextAll()

but with no luck. Do you guys have any idea of what should I do to accomplish it?

Thanks.

答案1

得分: 7

你只需要使用golang.org/x/net/html包就可以了。

var body = strings.NewReader(`
    <html>
    <body>
    <table>
    <tr>
    <td>Row 1, Content 1</td>
    <td>Row 1, Content 2</td>
    <td>Row 1, Content 3</td>
    <td>Row 1, Content 4</td>
    </tr>
    <tr>
    <td>Row 2, Content 1</td>
    <td>Row 2, Content 2</td>
    <td>Row 2, Content 3</td>
    <td>Row 2, Content 4</td>
    </tr>
    </table>
    </body>
    </html>`)
    
func main() {
    z := html.NewTokenizer(body)
    content := []string{}
    // While have not hit the </html> tag
    for z.Token().Data != "html" {
        tt := z.Next()
        if tt == html.StartTagToken {
            t := z.Token()
            if t.Data == "td" {
                inner := z.Next()
                if inner == html.TextToken {
                    text := (string)(z.Text())
                    t := strings.TrimSpace(text)
                    content = append(content, t)
                }
            }
        }
    }
    // Print to check the slice's content
    fmt.Println(content)
}

这段代码只适用于特定的HTML模式，但将其重构为更通用的形式并不难。

英文:

You can get away with package golang.org/x/net/html only.

var body = strings.NewReader(`                                                                                                                            
        &lt;html&gt;                                                                                                                                            
        &lt;body&gt;                                                                                                                                            
        &lt;table&gt;                                                                                                                                           
        &lt;tr&gt;                                                                                                                                              
        &lt;td&gt;Row 1, Content 1&lt;/td&gt;                                                                                                                          
        &lt;td&gt;Row 1, Content 2&lt;/td&gt;                                                                                                                          
        &lt;td&gt;Row 1, Content 3&lt;/td&gt;                                                                                                                          
        &lt;td&gt;Row 1, Content 4&lt;/td&gt;                                                                                                                          
        &lt;/tr&gt;                                                                                                                                             
        &lt;tr&gt;                                                                                                                                              
        &lt;td&gt;Row 2, Content 1&lt;/td&gt;                                                                                                        
        &lt;td&gt;Row 2, Content 2&lt;/td&gt;                                                                                                                          
        &lt;td&gt;Row 2, Content 3&lt;/td&gt;                                                                                                                          
        &lt;td&gt;Row 2, Content 4&lt;/td&gt;                                                                                                                          
        &lt;/tr&gt;  
        &lt;/table&gt;                                                                                                                                          
        &lt;/body&gt;                                                                                                                                           
        &lt;/html&gt;`)          
func main() {
    z := html.NewTokenizer(body)
    content := []string{}
    // While have not hit the &lt;/html&gt; tag
    for z.Token().Data != &quot;html&quot; {
        tt := z.Next()
        if tt == html.StartTagToken {
            t := z.Token()
            if t.Data == &quot;td&quot; {
                inner := z.Next()
                if inner == html.TextToken {
                    text := (string)(z.Text())
                    t := strings.TrimSpace(text)
                    content = append(content, t)
                }
            }
        }
    }
    // Print to check the slice&#39;s content
    fmt.Println(content)
}

This code is written only for this typical HTML pattern only, but refactoring it to be more general wouldn't be hard.

答案2

得分: 0

如果您需要以更结构化的方式从HTML表格中提取数据，https://github.com/nfx/go-htmltable支持行/列合并。

type AM4 struct {
    Model             string `header:"Model"`
    ReleaseDate       string `header:"Release date"`
    PCIeSupport       string `header:"PCIesupport[a]"`
    MultiGpuCrossFire bool   `header:"Multi-GPU CrossFire"`
    MultiGpuSLI       bool   `header:"Multi-GPU SLI"`
    USBSupport        string `header:"USBsupport[b]"`
    SATAPorts         int    `header:"Storage features SATAports"`
    RAID              string `header:"Storage features RAID"`
    AMDStoreMI        bool   `header:"Storage features AMD StoreMI"`
    Overclocking      string `header:"Processoroverclocking"`
    TDP               string `header:"TDP"`
    SupportExcavator  string `header:"CPU support[14] Excavator"`
    SupportZen        string `header:"CPU support[14] Zen"`
    SupportZenPlus    string `header:"CPU support[14] Zen+"`
    SupportZen2       string `header:"CPU support[14] Zen 2"`
    SupportZen3       string `header:"CPU support[14] Zen 3"`
    Architecture      string `header:"Architecture"`
}
am4Chipsets, _ := htmltable.NewSliceFromURL[AM4]("https://en.wikipedia.org/wiki/List_of_AMD_chipsets")
fmt.Println(am4Chipsets[2].Model)
fmt.Println(am4Chipsets[2].SupportZen2)
// 输出:
// X370
// Varies[c]

以上是从HTML表格中提取数据的示例代码。

英文:

If you need a more structured way of extracting data from HTML Tables, https://github.com/nfx/go-htmltable does support the row/colspans.

type AM4 struct {
    Model             string `header:&quot;Model&quot;`
    ReleaseDate       string `header:&quot;Release date&quot;`
    PCIeSupport       string `header:&quot;PCIesupport[a]&quot;`
    MultiGpuCrossFire bool   `header:&quot;Multi-GPU CrossFire&quot;`
    MultiGpuSLI       bool   `header:&quot;Multi-GPU SLI&quot;`
    USBSupport        string `header:&quot;USBsupport[b]&quot;`
    SATAPorts         int    `header:&quot;Storage features SATAports&quot;`
    RAID              string `header:&quot;Storage features RAID&quot;`
    AMDStoreMI        bool   `header:&quot;Storage features AMD StoreMI&quot;`
    Overclocking      string `header:&quot;Processoroverclocking&quot;`
    TDP               string `header:&quot;TDP&quot;`
    SupportExcavator  string `header:&quot;CPU support[14] Excavator&quot;`
    SupportZen        string `header:&quot;CPU support[14] Zen&quot;`
    SupportZenPlus    string `header:&quot;CPU support[14] Zen+&quot;`
    SupportZen2       string `header:&quot;CPU support[14] Zen 2&quot;`
    SupportZen3       string `header:&quot;CPU support[14] Zen 3&quot;`
    Architecture      string `header:&quot;Architecture&quot;`
}
am4Chipsets, _ := htmltable.NewSliceFromURL[AM4](&quot;https://en.wikipedia.org/wiki/List_of_AMD_chipsets&quot;)
fmt.Println(am4Chipsets[2].Model)
fmt.Println(am4Chipsets[2].SupportZen2)
// Output:
// X370
// Varies[c]

答案3

得分: -1

尝试使用以下方法创建一个二维数组并处理可变行大小：

	z := html.NewTokenizer(body)
	table := [][]string{}
	row := []string{}
	for z.Token().Data != "html" {
		tt := z.Next()
		if tt == html.StartTagToken {
			t := z.Token()
			if t.Data == "tr" {
				if len(row) > 0 {
					table = append(table, row)
					row = []string{}
				}
			}
			if t.Data == "td" {
				inner := z.Next()
				if inner == html.TextToken {
					text := (string)(z.Text())
					t := strings.TrimSpace(text)
					row = append(row, t)
				}
			}
		}
	}
	if len(row) > 0 {
		table = append(table, row)
	}

英文:

Try an approach like this to make a 2d array and handle variable row sizes:

	z := html.NewTokenizer(body)
	table := [][]string{}
	row := []string{}
	for z.Token().Data != &quot;html&quot; {
		tt := z.Next()
		if tt == html.StartTagToken {
			t := z.Token()
			if t.Data == &quot;tr&quot; {
				if len(row) &gt; 0 {
					table = append(table, row)
					row = []string{}
				}
			}
			if t.Data == &quot;td&quot; {
				inner := z.Next()
				if inner == html.TextToken {
					text := (string)(z.Text())
					t := strings.TrimSpace(text)
					row = append(row, t)
				}
			}
		}
	}
	if len(row) &gt; 0 {
		table = append(table, row)
	}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to convert HTML table to array with golang

问题

答案1

答案2

答案3

Go语言访问被拒绝

当我将`opencv.NewWindow()`移动到子函数中时，为什么我的程序会立即终止？

从kubeconfig文件中获取服务器地址

为什么 Golang 的 exec 无法获取 Java 进程的 PID？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。