英文:
Golang xml unmarshal html table
问题
我有一个简单的HTML表格,并且希望获取所有单元格的值,即使其中包含HTML代码。
我尝试使用xml unmarshal,但是没有得到正确的结构标签、值或属性。
import (
"fmt"
"encoding/xml"
)
type XMLTable struct {
XMLName xml.Name `xml:"TABLE"`
Row []struct{
Cell string `xml:"TD"`
}`xml:"TR"`
}
func main() {
raw_html_table := `
<TABLE><TR>
<TD>lalalal</TD>
<TD>papapap</TD>
<TD>fafafa</TD>
<TD>
<form action=\"/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf\" method=POST>
<input type=hidden name=acT value=\"Dev\">
<input type=hidden name=acA value=\"Anyval\">
<input type=submit name=submit value=Stop>
</form>
</TD>
</TR>
</TABLE>`
table := XMLTable{}
fmt.Printf("%q\n", []byte(raw_html_table)[:15])
err := xml.Unmarshal([]byte(raw_html_table), &table)
if err != nil {
fmt.Printf("error: %v", err)
}
}
额外的信息是,如果单元格内容是HTML代码,我不关心它(只获取[]byte
/ string
值)。因此,在解析之前,我可以删除单元格内容,但这种方式也不太容易。
欢迎使用标准的Go语言库提出任何建议。
英文:
I have a simple HTML table, and want to get all cell values even if it's HTML code inside.
Trying to use xml unmarshal, but didn't get the right struct tags, values or attributes.
import (
"fmt"
"encoding/xml"
)
type XMLTable struct {
XMLName xml.Name `xml:"TABLE"`
Row []struct{
Cell string `xml:"TD"`
}`xml:"TR"`
}
func main() {
raw_html_table := `
<TABLE><TR>
<TD>lalalal</TD>
<TD>papapap</TD>
<TD>fafafa</TD>
<TD>
<form action=\"/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf" method=POST>
<input type=hidden name=acT value=\"Dev\">
<input type=hidden name=acA value=\"Anyval\">
<input type=submit name=submit value=Stop>
</form>
</TD>
</TR>
</TABLE>`
table := XMLTable{}
fmt.Printf("%q\n", []byte(raw_html_table)[:15])
err := xml.Unmarshal([]byte(raw_html_table), &table)
if err != nil {
fmt.Printf("error: %v", err)
}
}
As an additional info, I don't care about cell content if it's HTML code (take only []byte
/ string
values). So I may delete cell content before unmarshaling, but this way is also not so easy.
Any suggestions with standard golang libs would be welcome.
答案1
得分: 4
坚持使用标准库
您的输入不是有效的XML,因此即使您正确建模,也无法解析它。
首先,您使用原始的字符串文字将输入HTML定义为string
,而原始字符串文字不能包含转义字符。例如:
<form action=\"/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf" method=POST>
您不能在原始字符串文字中使用\"
(您可以使用,但它将仅表示这两个字符),您不必这样做,可以使用简单的引号:"
。
其次,在XML中,您不能没有将属性值放在引号中。
第三,每个元素必须有一个匹配的闭合元素,您的<input>
元素没有关闭。
因此,例如,这一行:
<input type=hidden name=acT value=\"Dev\">
必须更改为:
<input type="hidden" name="acT" value="Dev" />
好的,经过这些更改,输入现在是有效的XML。
如何建模呢?就像这样简单:
type XMLTable struct {
Rows []struct {
Cell string `xml:",innerxml"`
} `xml:"TR>TD"`
}
解析和打印<TD>
元素内容的完整代码如下:
raw_html_table := `
<TABLE><TR>
<TD>lalalal</TD>
<TD>papapap</TD>
<TD>fafafa</TD>
<TD>
<form action="/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf" method="POST">
<input type="hidden" name="acT" value="Dev" />
<input type="hidden" name="acA" value="Anyval" />
<input type="submit" name="submit" value="Stop" />
</form>
</TD>
</TR>
</TABLE>`
table := XMLTable{}
err := xml.Unmarshal([]byte(raw_html_table), &table)
if err != nil {
fmt.Printf("error: %v\n", err)
}
fmt.Println("count:", len(table.Rows))
for _, row := range table.Rows {
fmt.Println("TD content:", row.Cell)
}
输出结果(在Go Playground上尝试):
count: 4
TD content: lalalal
TD content: papapap
TD content: fafafa
TD content:
<form action="/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf" method="POST">
<input type="hidden" name="acT" value="Dev" />
<input type="hidden" name="acA" value="Anyval" />
<input type="submit" name="submit" value="Stop" />
</form>
使用正确的HTML解析器
如果您不能或不想更改输入的HTML,或者您想处理所有HTML输入而不仅仅是有效的XML,您应该使用正确的HTML解析器,而不是将输入视为XML。
请查看https://godoc.org/golang.org/x/net/html,了解符合HTML5标准的标记化器和解析器。
英文:
Sticking to the standard lib
Your input is not valid XML, so even if you model it right, you won't be able to parse it.
First, you're using a raw string literal to define your input HTML as a string
, and raw string literals cannot contain escapes. For example this:
<form action=\"/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf" method=POST>
You can't use \"
in a raw string literal (you can, but it will mean exactly those 2 characters), and you don't have to, use a simple quotation mark: "
.
Next, in XML you cannot have attributes without putting their values in quotes.
Third, each element must have a matching closing element, your <input>
elements are not closed.
So for example this line:
<input type=hidden name=acT value=\"Dev\">
Must be changed to:
<input type="hidden" name="acT" value="Dev" />
Ok, after these the input is a valid XML now.
How to model it? Simple as this:
type XMLTable struct {
Rows []struct {
Cell string `xml:",innerxml"`
} `xml:"TR>TD"`
}
And the full code to parse and print contents of <TD>
elements:
raw_html_table := `
<TABLE><TR>
<TD>lalalal</TD>
<TD>papapap</TD>
<TD>fafafa</TD>
<TD>
<form action="/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf" method="POST">
<input type="hidden" name="acT" value="Dev" />
<input type="hidden" name="acA" value="Anyval" />
<input type="submit" name="submit" value="Stop" />
</form>
</TD>
</TR>
</TABLE>`
table := XMLTable{}
err := xml.Unmarshal([]byte(raw_html_table), &table)
if err != nil {
fmt.Printf("error: %v\n", err)
}
fmt.Println("count:", len(table.Rows))
for _, row := range table.Rows {
fmt.Println("TD content:", row.Cell)
}
Output (try it on the Go Playground):
count: 4
TD content: lalalal
TD content: papapap
TD content: fafafa
TD content:
<form action="/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf" method="POST">
<input type="hidden" name="acT" value="Dev" />
<input type="hidden" name="acA" value="Anyval" />
<input type="submit" name="submit" value="Stop" />
</form>
Using a proper HTML parser
If you can't or don't want to change the input HTML, or you want to handle all HTML input not just valid XMLs, you should use a proper HTML parser instead of treating the input as XML.
Check out https://godoc.org/golang.org/x/net/html for an HTML5-compliant tokenizer and parser.
答案2
得分: 1
一旦您的输入是有效的HTML(您的代码片段在属性中缺少引号),您可以使用实体和自动关闭映射来配置xml.Decoder
(并使其非严格),这将起作用:
package main
import (
"encoding/xml"
"fmt"
"strings"
)
type XMLTable struct {
Rows []struct {
Cell string `xml:",innerxml"`
} `xml:"TR>TD"`
}
func main() {
raw_html_table := `
<TABLE><TR>
<TD>lalalal</TD>
<TD>papapap</TD>
<TD>fafafa</TD>
<TD>
<form action="/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf" method="POST">
<input type="hidden" name="acT" value="Dev">
<input type="hidden" name="acA" value="Anyval">
<input type="submit" name="submit" value="Stop">
</form>
</TD>
</TR>
</TABLE>`
table := XMLTable{}
decoder := xml.NewDecoder(strings.NewReader(raw_html_table))
decoder.Entity = xml.HTMLEntity
decoder.AutoClose = xml.HTMLAutoClose
decoder.Strict = false
err := decoder.Decode(&table)
if err != nil {
fmt.Printf("error: %v", err)
}
fmt.Printf("%#v\n", table)
}
以上是您要翻译的内容。
英文:
Once your input is valid HTML (your snippet is missing quotes in attributes), you can configure a xml.Decoder
with entities and autoclose maps (and make it non-strict), which will end up working :
package main
import (
"encoding/xml"
"fmt"
"strings"
)
type XMLTable struct {
Rows []struct {
Cell string `xml:",innerxml"`
} `xml:"TR>TD"`
}
func main() {
raw_html_table := `
<TABLE><TR>
<TD>lalalal</TD>
<TD>papapap</TD>
<TD>fafafa</TD>
<TD>
<form action="/addedUrl/;jsessionid=KJHSDFKJLSDF293847odhf" method="POST">
<input type="hidden" name="acT" value="Dev">
<input type="hidden" name="acA" value="Anyval">
<input type="submit" name="submit" value="Stop">
</form>
</TD>
</TR>
</TABLE>`
table := XMLTable{}
decoder := xml.NewDecoder(strings.NewReader(raw_html_table))
decoder.Entity = xml.HTMLEntity
decoder.AutoClose = xml.HTMLAutoClose
decoder.Strict = false
err := decoder.Decode(&table)
if err != nil {
fmt.Printf("error: %v", err)
}
fmt.Printf("%#v\n", table)
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论