问题

encoding/xml库是解析HTML表格文件的最佳选择，是否存在一些示例来演示如何进行解析？

package main

import (
	"encoding/xml"
	"fmt"
	"strings"
)

type Table struct {
	XMLName xml.Name `xml:"table"`
	Rows    []Row    `xml:"tbody>tr"`
}

type Row struct {
	Cells []Cell `xml:"td"`
}

type Cell struct {
	Data string `xml:",innerxml"`
}

func main() {
	html := `
	<html><head>
	<meta charset="utf-8">

	</head>
	<body>
	<a name="Test1">
	<center>
	<b>Test 1</b> <table border="0">
	  <tbody><tr>
	  <th> Type </th>
	  <th> Region </th>
	  </tr>
	  <tr>
	  <td> <table border="0">
	  <thead>
	  <tr>
	    <th><b>Type</b></th>
	    <th> &nbsp; </th>
	    <th> Count </th>
	    <th> Percent </th>
	  </tr>
	  </thead>
	  <tbody><tr>
	    <td> <b>T1</b> </td>
	    <th> &nbsp; </th>
	    <td class="numeric" bgcolor="#ff0000"> 34,314 </td>
	    <td class="numeric" bgcolor="#ff0000"> 31.648% </td>
	  </tr>
	  <tr>
	    <td> <b>T2</b> </td>
	    <th> &nbsp; </th>
	    <td class="numeric" bgcolor="#bf3f00"> 25,820 </td>
	    <td class="numeric" bgcolor="#bf3f00"> 23.814% </td>
	  </tr>
	  <tr>
	    <td> <b>T3</b> </td>
	    <th> &nbsp; </th>
	    <td class="numeric" bgcolor="#24da00"> 4,871 </td>
	    <td class="numeric" bgcolor="#24da00"> 4.493% </td>
	  </tr>

	</tbody></table><br>
	</td>
	  <td> <table border="0">
	  <thead>
	  <tr>
	    <th><b> Type</b></th>
	    <th> &nbsp; </th>
	    <th> Count </th>
	    <th> Percent </th>
	  </tr>
	  </thead>
	  <tbody><tr>
	    <td> <b>T4</b> </td>
	    <th> &nbsp; </th>
	    <td class="numeric" bgcolor="#ff0000"> 34,314 </td>
	    <td class="numeric" bgcolor="#ff0000"> 31.648% </td>
	  </tr>
	  <tr>
	    <td> <b>T5</b> </td>
	    <th> &nbsp; </th>
	    <td class="numeric" bgcolor="#53ab00"> 11,187 </td>
	    <td class="numeric" bgcolor="#53ab00"> 10.318% </td>
	  </tr>
	  <tr>
	    <td> <b>T6</b> </td>
	    <th> &nbsp; </th>
	    <td class="numeric" bgcolor="#bf3f00"> 25,820 </td>
	    <td class="numeric" bgcolor="#bf3f00"> 23.814% </td>
	  </tr>

	</tbody></table><br>
	</td>
	  </tr>
	</tbody></table>
	</center>

	  </a>
	</body></html>
	`

	decoder := xml.NewDecoder(strings.NewReader(html))
	var table Table
	err := decoder.Decode(&table)
	if err != nil {
		fmt.Println("Error decoding XML:", err)
		return
	}

	for _, row := range table.Rows {
		for _, cell := range row.Cells {
			fmt.Println(cell.Data)
		}
		fmt.Println()
	}
}

以上是使用encoding/xml库解析HTML表格文件的示例代码。

英文:

Is encoding/xml the best library to parse HTML table files like this one and exist some examples how to do it?

&lt;html&gt;&lt;head&gt;
&lt;meta charset=&quot;utf-8&quot;&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;a name=&quot;Test1&quot;&gt;
&lt;center&gt;
&lt;b&gt;Test 1&lt;/b&gt; &lt;table border=&quot;0&quot;&gt;
&lt;tbody&gt;&lt;tr&gt;
&lt;th&gt; Type &lt;/th&gt;
&lt;th&gt; Region &lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt; &lt;table border=&quot;0&quot;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;b&gt;Type&lt;/b&gt;&lt;/th&gt;
&lt;th&gt; &amp;nbsp; &lt;/th&gt;
&lt;th&gt; Count &lt;/th&gt;
&lt;th&gt; Percent &lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;&lt;tr&gt;
&lt;td&gt; &lt;b&gt;T1&lt;/b&gt; &lt;/td&gt;
&lt;th&gt; &amp;nbsp; &lt;/th&gt;
&lt;td class=&quot;numeric&quot; bgcolor=&quot;#ff0000&quot;&gt; 34,314 &lt;/td&gt;
&lt;td class=&quot;numeric&quot; bgcolor=&quot;#ff0000&quot;&gt; 31.648% &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt; &lt;b&gt;T2&lt;/b&gt; &lt;/td&gt;
&lt;th&gt; &amp;nbsp; &lt;/th&gt;
&lt;td class=&quot;numeric&quot; bgcolor=&quot;#bf3f00&quot;&gt; 25,820 &lt;/td&gt;
&lt;td class=&quot;numeric&quot; bgcolor=&quot;#bf3f00&quot;&gt; 23.814% &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt; &lt;b&gt;T3&lt;/b&gt; &lt;/td&gt;
&lt;th&gt; &amp;nbsp; &lt;/th&gt;
&lt;td class=&quot;numeric&quot; bgcolor=&quot;#24da00&quot;&gt; 4,871 &lt;/td&gt;
&lt;td class=&quot;numeric&quot; bgcolor=&quot;#24da00&quot;&gt; 4.493% &lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;&lt;br&gt;
&lt;/td&gt;
&lt;td&gt; &lt;table border=&quot;0&quot;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;b&gt; Type&lt;/b&gt;&lt;/th&gt;
&lt;th&gt; &amp;nbsp; &lt;/th&gt;
&lt;th&gt; Count &lt;/th&gt;
&lt;th&gt; Percent &lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;&lt;tr&gt;
&lt;td&gt; &lt;b&gt;T4&lt;/b&gt; &lt;/td&gt;
&lt;th&gt; &amp;nbsp; &lt;/th&gt;
&lt;td class=&quot;numeric&quot; bgcolor=&quot;#ff0000&quot;&gt; 34,314 &lt;/td&gt;
&lt;td class=&quot;numeric&quot; bgcolor=&quot;#ff0000&quot;&gt; 31.648% &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt; &lt;b&gt;T5&lt;/b&gt; &lt;/td&gt;
&lt;th&gt; &amp;nbsp; &lt;/th&gt;
&lt;td class=&quot;numeric&quot; bgcolor=&quot;#53ab00&quot;&gt; 11,187 &lt;/td&gt;
&lt;td class=&quot;numeric&quot; bgcolor=&quot;#53ab00&quot;&gt; 10.318% &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt; &lt;b&gt;T6&lt;/b&gt; &lt;/td&gt;
&lt;th&gt; &amp;nbsp; &lt;/th&gt;
&lt;td class=&quot;numeric&quot; bgcolor=&quot;#bf3f00&quot;&gt; 25,820 &lt;/td&gt;
&lt;td class=&quot;numeric&quot; bgcolor=&quot;#bf3f00&quot;&gt; 23.814% &lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;&lt;br&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/center&gt;
&lt;/a&gt;
&lt;/body&gt;&lt;/html&gt;

Thank you in advance.

答案1

得分: 4

根据你的HTML代码的不同，解析HTML的方式也会有所不同。

严格来说，只有一种HTML可以被符合XML解析器解析，那就是XHTML。但尽管XHTML曾被认为是HTML标准，但它并没有真正流行起来，如今已被认为是过时的（取而代之的是备受炒作的“HTML5”及其周边生态系统）。HTML的基本问题在于，尽管它看起来像XML，但却有不同的规则。一个明显的区别是<br>在HTML中是合法的，但在XML中是一个未结束的元素（在XML中应该写作<br/>），而且还有许多其他的差异。

另一方面，根据你的示例，它看起来很像XML。因此，如果你可以保证你的数据在HTML格式下始终是格式良好的XML，那么你可以直接使用encoding/xml包。否则，你可以选择使用go.net/html，正如@elithrar所建议的，或者找到其他的包。

英文:

Depends on your HTML.

Strictly speaking, the only one kind of HTML which is guaranteed to be parsed by a conforming XML parser is XHTML, but despite the fact XHTML once has been thought of as coming to be the HTML standard, it has not really taken off the ground and these days it's considered obsolete (in favor of the much hyped "HTML5" thing and all the ecosystem around it). The basic problem with HTML is that while it looks like XML it has different rules. One glaring distinction is that <br> is a perfectly legal HTML but is an unterminated element in XML (in the latter, it has to be spelled <br/>), and there are a lot more differences.

On the other hand, your particular example looks quite XML'ish to me, so if you can guarantee your data, while being HTML, will always be a well-formed XML at the same time, you can just use the encoding/xml package. Otherwise go for go.net/html, as suggested by @elithrar, or find some other package.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Go解析HTML文件

问题

答案1

Go中的字典

Golang：为什么 sql.Tx 没有实现 driver.Tx 接口？

Go语言中对数组的子切片使用内置的range函数时，行为不一致。

在Go语言中向客户端的UDP套接字写入数据。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论