英文:
How to convert HTML table to array with golang
问题
我正在尝试将一个HTML表格转换为Golang数组时遇到问题。我尝试使用x/net/html和goquery来实现,但两者都没有成功。
假设我们有以下HTML表格:
<html>
<body>
<table>
<tr>
<td>Row 1, Content 1</td>
<td>Row 1, Content 2</td>
<td>Row 1, Content 3</td>
<td>Row 1, Content 4</td>
</tr>
<tr>
<td>Row 2, Content 1</td>
<td>Row 2, Content 2</td>
<td>Row 2, Content 3</td>
<td>Row 2, Content 4</td>
</tr>
</table>
</body>
</html>
我想得到以下数组:
------------------------------------
|Row 1, Content 1| Row 1, Content 2|
------------------------------------
|Row 2, Content 1| Row 2, Content 2|
------------------------------------
正如大家所看到的,我只是忽略了内容3和4。
我的提取代码如下:
func extractValue(content []byte) {
doc, _ := goquery.NewDocumentFromReader(bytes.NewReader(content))
doc.Find("table tr td").Each(func(i int, td *goquery.Selection) {
// ...
})
}
我尝试添加一个控制器变量来忽略我不想转换的<td>
,并调用
td.NextAll()
但没有成功。你们有什么想法可以帮助我完成吗?
谢谢。
英文:
I'm having a problem trying to convert an HTML table into a Golang array. I've tried to achieve it using x/net/html and goquery, without any success on both of them.
Let's say we have this HTML table:
<html>
<body>
<table>
<tr>
<td>Row 1, Content 1</td>
<td>Row 1, Content 2</td>
<td>Row 1, Content 3</td>
<td>Row 1, Content 4</td>
</tr>
<tr>
<td>Row 2, Content 1</td>
<td>Row 2, Content 2</td>
<td>Row 2, Content 3</td>
<td>Row 2, Content 4</td>
</tr>
</table>
</body>
</html>
And I'd like to end up with this array:
------------------------------------
|Row 1, Content 1| Row 1, Content 2|
------------------------------------
|Row 2, Content 1| Row 2, Content 2|
------------------------------------
As you guy can see, I'm just ignoring Contents 3 and 4.
My extraction code:
func extractValue(content []byte) {
doc, _ := goquery.NewDocumentFromReader(bytes.NewReader(content))
doc.Find("table tr td").Each(func(i int, td *goquery.Selection) {
// ...
})
}
I've tried to add a controller number which would be responsible for ignoring the <td>
that I don't want to convert and calling
td.NextAll()
but with no luck. Do you guys have any idea of what should I do to accomplish it?
Thanks.
答案1
得分: 7
你只需要使用golang.org/x/net/html
包就可以了。
var body = strings.NewReader(`
<html>
<body>
<table>
<tr>
<td>Row 1, Content 1</td>
<td>Row 1, Content 2</td>
<td>Row 1, Content 3</td>
<td>Row 1, Content 4</td>
</tr>
<tr>
<td>Row 2, Content 1</td>
<td>Row 2, Content 2</td>
<td>Row 2, Content 3</td>
<td>Row 2, Content 4</td>
</tr>
</table>
</body>
</html>`)
func main() {
z := html.NewTokenizer(body)
content := []string{}
// While have not hit the </html> tag
for z.Token().Data != "html" {
tt := z.Next()
if tt == html.StartTagToken {
t := z.Token()
if t.Data == "td" {
inner := z.Next()
if inner == html.TextToken {
text := (string)(z.Text())
t := strings.TrimSpace(text)
content = append(content, t)
}
}
}
}
// Print to check the slice's content
fmt.Println(content)
}
这段代码只适用于特定的HTML模式,但将其重构为更通用的形式并不难。
英文:
You can get away with package golang.org/x/net/html
only.
var body = strings.NewReader(`
<html>
<body>
<table>
<tr>
<td>Row 1, Content 1</td>
<td>Row 1, Content 2</td>
<td>Row 1, Content 3</td>
<td>Row 1, Content 4</td>
</tr>
<tr>
<td>Row 2, Content 1</td>
<td>Row 2, Content 2</td>
<td>Row 2, Content 3</td>
<td>Row 2, Content 4</td>
</tr>
</table>
</body>
</html>`)
func main() {
z := html.NewTokenizer(body)
content := []string{}
// While have not hit the </html> tag
for z.Token().Data != "html" {
tt := z.Next()
if tt == html.StartTagToken {
t := z.Token()
if t.Data == "td" {
inner := z.Next()
if inner == html.TextToken {
text := (string)(z.Text())
t := strings.TrimSpace(text)
content = append(content, t)
}
}
}
}
// Print to check the slice's content
fmt.Println(content)
}
This code is written only for this typical HTML pattern only, but refactoring it to be more general wouldn't be hard.
答案2
得分: 0
如果您需要以更结构化的方式从HTML表格中提取数据,https://github.com/nfx/go-htmltable支持行/列合并。
type AM4 struct {
Model string `header:"Model"`
ReleaseDate string `header:"Release date"`
PCIeSupport string `header:"PCIesupport[a]"`
MultiGpuCrossFire bool `header:"Multi-GPU CrossFire"`
MultiGpuSLI bool `header:"Multi-GPU SLI"`
USBSupport string `header:"USBsupport[b]"`
SATAPorts int `header:"Storage features SATAports"`
RAID string `header:"Storage features RAID"`
AMDStoreMI bool `header:"Storage features AMD StoreMI"`
Overclocking string `header:"Processoroverclocking"`
TDP string `header:"TDP"`
SupportExcavator string `header:"CPU support[14] Excavator"`
SupportZen string `header:"CPU support[14] Zen"`
SupportZenPlus string `header:"CPU support[14] Zen+"`
SupportZen2 string `header:"CPU support[14] Zen 2"`
SupportZen3 string `header:"CPU support[14] Zen 3"`
Architecture string `header:"Architecture"`
}
am4Chipsets, _ := htmltable.NewSliceFromURL[AM4]("https://en.wikipedia.org/wiki/List_of_AMD_chipsets")
fmt.Println(am4Chipsets[2].Model)
fmt.Println(am4Chipsets[2].SupportZen2)
// 输出:
// X370
// Varies[c]
以上是从HTML表格中提取数据的示例代码。
英文:
If you need a more structured way of extracting data from HTML Tables, https://github.com/nfx/go-htmltable does support the row/colspans.
type AM4 struct {
Model string `header:"Model"`
ReleaseDate string `header:"Release date"`
PCIeSupport string `header:"PCIesupport[a]"`
MultiGpuCrossFire bool `header:"Multi-GPU CrossFire"`
MultiGpuSLI bool `header:"Multi-GPU SLI"`
USBSupport string `header:"USBsupport[b]"`
SATAPorts int `header:"Storage features SATAports"`
RAID string `header:"Storage features RAID"`
AMDStoreMI bool `header:"Storage features AMD StoreMI"`
Overclocking string `header:"Processoroverclocking"`
TDP string `header:"TDP"`
SupportExcavator string `header:"CPU support[14] Excavator"`
SupportZen string `header:"CPU support[14] Zen"`
SupportZenPlus string `header:"CPU support[14] Zen+"`
SupportZen2 string `header:"CPU support[14] Zen 2"`
SupportZen3 string `header:"CPU support[14] Zen 3"`
Architecture string `header:"Architecture"`
}
am4Chipsets, _ := htmltable.NewSliceFromURL[AM4]("https://en.wikipedia.org/wiki/List_of_AMD_chipsets")
fmt.Println(am4Chipsets[2].Model)
fmt.Println(am4Chipsets[2].SupportZen2)
// Output:
// X370
// Varies[c]
答案3
得分: -1
尝试使用以下方法创建一个二维数组并处理可变行大小:
z := html.NewTokenizer(body)
table := [][]string{}
row := []string{}
for z.Token().Data != "html" {
tt := z.Next()
if tt == html.StartTagToken {
t := z.Token()
if t.Data == "tr" {
if len(row) > 0 {
table = append(table, row)
row = []string{}
}
}
if t.Data == "td" {
inner := z.Next()
if inner == html.TextToken {
text := (string)(z.Text())
t := strings.TrimSpace(text)
row = append(row, t)
}
}
}
}
if len(row) > 0 {
table = append(table, row)
}
英文:
Try an approach like this to make a 2d array and handle variable row sizes:
z := html.NewTokenizer(body)
table := [][]string{}
row := []string{}
for z.Token().Data != "html" {
tt := z.Next()
if tt == html.StartTagToken {
t := z.Token()
if t.Data == "tr" {
if len(row) > 0 {
table = append(table, row)
row = []string{}
}
}
if t.Data == "td" {
inner := z.Next()
if inner == html.TextToken {
text := (string)(z.Text())
t := strings.TrimSpace(text)
row = append(row, t)
}
}
}
}
if len(row) > 0 {
table = append(table, row)
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论