英文:
How to get '<' and '>' in XML string?
问题
这是一个XML字符串,你想要获取其中的"<"和">"的值吗?你在解析XML时遇到了问题,并且不能更改这些字符串。有人可以帮助你吗?以下是你的代码:
package main
import (
"encoding/xml"
"fmt"
)
func main() {
type Example struct {
XMLName xml.Name `xml:"Shop"`
ShopName string `xml:"ShopName"`
}
myString1 := `<Shop>
<ShopName>Fresh Fruit <Fruit Shop></ShopName>
</Shop>`
myString2 := `<Shop>
<ShopName>Fresh Fruit < Fruit Shop ></ShopName>
</Shop>`
//example 1
var example1 Example
err := xml.Unmarshal([]byte(myString1), &example1)
if err != nil {
fmt.Println("error: %example1", err)
} else {
fmt.Println(example1.ShopName)
}
//example 2
var example2 Example
err = xml.Unmarshal([]byte(myString2), &example2)
if err != nil {
fmt.Printf("error: %example2", err)
return
} else {
fmt.Println(example2.ShopName)
}
}
你遇到的错误如下:
error: %example1 XML syntax error on line 2: attribute name without = in element
error: &{%!e(string=expected element name after <) %!e(int=2)}xample2
你想要得到的结果是:
Fresh Fruit <Fruit Shop>
Fresh Fruit < Fruit Shop >
英文:
Is it posible to get '<' and '>' value in this XML string? I have problem with unmarshal, and I can't change the strings. Is there anyone who can help me in this? Here my code:
package main
import (
"encoding/xml"
"fmt"
)
func main() {
type Example struct {
XMLName xml.Name `xml:"Shop"`
ShopName string `xml:"ShopName"`
}
myString1 := `<Shop>
<ShopName>Fresh Fruit <Fruit Shop></ShopName>
</Shop>`
myString2 :=`<Shop>
<ShopName>Fresh Fruit < Fruit Shop ></ShopName>
</Shop>`
//example 1
var example1 Example
err := xml.Unmarshal([]byte(myString1), &example1)
if err != nil {
fmt.Println("error: %example1", err)
}else{
fmt.Println(example1.ShopName)
}
//example 2
var example2 Example
err = xml.Unmarshal([]byte(myString2), &example2)
if err != nil {
fmt.Printf("error: %example2", err)
return
}else{
fmt.Println(example2.ShopName)
}
}
I get an error bellow:
error: %example1 XML syntax error on line 2: attribute name without = in element
error: &{%!e(string=expected element name after <) %!e(int=2)}xample2
What I want to get:
Fresh Fruit <Fruit Shop>
Fresh Fruit < Fruit Shop >
答案1
得分: 1
你提供的输入明显是无效的XML。XML的创建过程中存在一个错误。
思路
既然你说必须按照现有的方式处理它...这里有一个建议:
- 使用正则表达式将所有的闭合标签替换为在输入中基本上不会出现的内容(例如
@#lt#@/tagname@#gt#@
)。在此过程中,将所有不同的标签名称保存到一个切片中。 - 使用标签名称切片替换开始标签。
- 现在转义所有剩余的
<
和>
。 - 最后,将原始标签替换回来:将
@#lt#@
替换为<
,将@#gt#@
替换为>
。
现在你应该有一个可解析的有效XML。
概念验证
package main
import (
"bytes"
"fmt"
"log"
"regexp"
"sort"
)
var (
rlt = []byte("@#lt#@")
rgt = []byte("@#gt#@")
lt = []byte("&lt;")
gt = []byte("&gt;")
)
// 用于按长度排序字符串
type ByLength []string
func (s ByLength) Len() int {
return len(s)
}
func (s ByLength) Swap(i, j int) {
s[i], s[j] = s[j], s[i]
}
func (s ByLength) Less(i, j int) bool {
return len(s[i]) < len(s[j])
}
func main() {
s := `<Shop>
<ShopName>Fresh Fruit <Fruit Shop></ShopName>
<ShopName attr="val1">Fresh Fruit <Shop test></ShopName>
</Shop>`
r1, err := regexp.Compile("</([^<>]*)>")
if err != nil {
log.Fatal(err)
}
names := []string{}
out := r1.ReplaceAllFunc([]byte(s), func(b []byte) []byte {
name := b[2 : len(b)-1]
// TODO: 仅在列表中不存在时才添加名称
names = append(names, string(name))
// 可能可以优化
bytes := make([]byte, 0, len(name)+12)
bytes = append(bytes, rlt...)
bytes = append(bytes, name...)
bytes = append(bytes, rgt...)
return bytes
})
// 按长度降序排序名称,否则我们可能会替换名称的一部分,比如 <Shop 和 <ShopName
sort.Sort(sort.Reverse(ByLength(names)))
for _, name := range names {
// 仅替换完全匹配的开始标签
out = bytes.Replace(out, []byte(fmt.Sprintf("<%s>", name)), []byte(fmt.Sprintf("@#lt#@%s@#gt#@", name)), -1)
// 替换带有属性的开始标签
r3, err := regexp.Compile(fmt.Sprintf("<%s( [^<>=]+=\"[^<>]+)>", name))
if err != nil {
// 处理错误
}
out = r3.ReplaceAll(out, []byte(fmt.Sprintf("@#lt#@%s$1@#gt#@", name)))
}
out = bytes.Replace(out, []byte{'<'}, lt, -1)
out = bytes.Replace(out, []byte{'>'}, gt, -1)
out = bytes.Replace(out, rlt, []byte{'<'}, -1)
out = bytes.Replace(out, rgt, []byte{'>'}, -1)
fmt.Println(string(out))
}
注意事项
- 这只是一个概念验证。它没有针对性能进行优化。
- 你可能仍然会遇到无法正确转义的内容。那么你需要进一步优化。如果内容中存在以下内容,它将被错误地视为标签:
<tagname>
或<tagname something ="something>
。因此,预计仍然会有一些无效的XML。记录无效的XML,以便改进算法。
英文:
The input you have is definitely invalid XML. There is a bug in the creation routine of the XML.
Idea
Since you say you have to deal with it the way it is... here a suggestion:
- replace all closing tags via regex to something you will basically never have in your input (e.g.
@#lt#@/tagname@#gt#@
). While doing that save all the distinct tag names to a slice. - With the slice of tag names replace the start tags
- Now escape all remaining
<
and>
- Last but not least replace the original tags back:
@#lt#@
to<
and@#gt#@
to>
Now you should have valid xml that is parseable.
Proof of Concept
package main
import (
"bytes"
"fmt"
"log"
"regexp"
"sort"
)
var (
rlt = []byte("@#lt#@")
rgt = []byte("@#gt#@")
lt = []byte("&lt;")
gt = []byte("&gt;")
)
// used for sorting strings by length
type ByLength []string
func (s ByLength) Len() int {
return len(s)
}
func (s ByLength) Swap(i, j int) {
s[i], s[j] = s[j], s[i]
}
func (s ByLength) Less(i, j int) bool {
return len(s[i]) < len(s[j])
}
func main() {
s := `<Shop>
<ShopName>Fresh Fruit <Fruit Shop></ShopName>
<ShopName attr="val1">Fresh Fruit <Shop test></ShopName>
</Shop>`
r1, err := regexp.Compile("</([^<>]*)>")
if err != nil {
log.Fatal(err)
}
names := []string{}
out := r1.ReplaceAllFunc([]byte(s), func(b []byte) []byte {
name := b[2 : len(b)-1]
// TODO: only append name if not already in list
names = append(names, string(name))
// probably optimizable
bytes := make([]byte, 0, len(name)+12)
bytes = append(bytes, rlt...)
bytes = append(bytes, name...)
bytes = append(bytes, rgt...)
return bytes
})
// sort names descending by length otherwise we risk replacing parts of names like with <Shop and <ShopName
sort.Sort(sort.Reverse(ByLength(names)))
for _, name := range names {
// replace only exact start tags
out = bytes.Replace(out, []byte(fmt.Sprintf("<%s>", name)), []byte(fmt.Sprintf("@#lt#@%s@#gt#@", name)), -1)
// replace start tags with attributes
r3, err := regexp.Compile(fmt.Sprintf("<%s( [^<>=]+=\"[^<>]+)>", name))
if err != nil {
// handle error
}
out = r3.ReplaceAll(out, []byte(fmt.Sprintf("@#lt#@%s$1@#gt#@", name)))
}
out = bytes.Replace(out, []byte{'<'}, lt, -1)
out = bytes.Replace(out, []byte{'>'}, gt, -1)
out = bytes.Replace(out, rlt, []byte{'<'}, -1)
out = bytes.Replace(out, rgt, []byte{'>'}, -1)
fmt.Println(string(out))
}
Notes
- this is a proof of concept. This is not optimised for performance.
- you might still run into content that might not be escaped properly. Then you will need to further optimise. If there is something like this in the content it will be falsely considered a tag:
<tagname>
or<tagname something ="something>
. Therefore expect some xml to still to be invalid. Log invalid xml so you can improve the algorithm.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论