解析具有可变子类型的XML并保持子元素的顺序

huangapple go评论89阅读模式
英文:

Unmarshal XML with variable child types while preserving order of children

问题

我有一个具有可变子元素集的XML结构。我想要将这些子对象解组成一个结构,同时保留它们的顺序。我目前正在使用encoding/xml来解组XML,但这不是一个严格的要求。

示例XML结构:

<protocol>
    // ...更多的数据包元素
    <packet family="Npc" action="Player">
        <comment>Main NPC update message</comment>
        <array name="positions" type="NpcUpdatePosition"/>
        <break/>
        <array name="attacks" type="NpcUpdateAttack"/>
        <break/>
        <array name="chats" type="NpcUpdateChat"/>
        <break/>
        <field name="hp" type="short" optional="true"/>
        <field name="tp" type="short" optional="true"/>
    </packet>
    // ...更多的数据包元素
</protocol>

我所指的可变元素是packet元素的子元素。

我的模型如下:

type Protocol struct {
	Packets []ProtocolPacket `xml:"packet"`
}

type ProtocolPacket struct {
	Family       string                `xml:"family,attr"`
	Action       string                `xml:"action,attr"`
	Instructions /* ??? */             `xml:",any"`
	Comment      string                `xml:"comment"`
}

在这个XML规范中,有许多不同的元素,如上面示例中显示的arraybreakfield,它们需要合并成一个单一的切片,并保持它们的顺序。这些通常被称为"指令"(instructions)。(在示例中,comment是一个特殊情况,只能出现一次)。

我完全不知道如何对"指令"列表建模。我想到的一个想法是创建一个接口ProtocolInstruction,其中包含一个自定义的解组器,根据元素类型分配一个实现,但我认为这种模式不会起作用,因为你需要提前知道接收器类型,以便解组函数满足适当的接口。

我遇到了这个问题,但是建议的答案没有保留不同元素名称之间的元素顺序。我想到的另一个想法是使用这种方法,但为每种类型编写自定义的解组器,递增一个计数器并存储元素索引-这样即使顺序没有被保留,它至少可以被检索到。然而,这似乎是很多工作和一个混乱的实现,所以我正在寻找其他的方法。

有没有办法在Go中解组可变的子XML元素并保持它们的顺序?

英文:

I have an XML structure with a variable set of child elements. I want to unmarshal these child objects into a structure while preserving their order. I am currently using encoding/xml to unmarshal the xml, but this is not a strict requirement.

Sample XML structure:

&lt;protocol&gt;
    // ... more packet elements
    &lt;packet family=&quot;Npc&quot; action=&quot;Player&quot;&gt;
        &lt;comment&gt;Main NPC update message&lt;/comment&gt;
        &lt;array name=&quot;positions&quot; type=&quot;NpcUpdatePosition&quot;/&gt;
        &lt;break/&gt;
        &lt;array name=&quot;attacks&quot; type=&quot;NpcUpdateAttack&quot;/&gt;
        &lt;break/&gt;
        &lt;array name=&quot;chats&quot; type=&quot;NpcUpdateChat&quot;/&gt;
        &lt;break/&gt;
        &lt;field name=&quot;hp&quot; type=&quot;short&quot; optional=&quot;true&quot;/&gt;
        &lt;field name=&quot;tp&quot; type=&quot;short&quot; optional=&quot;true&quot;/&gt;
    &lt;/packet&gt;
    // ... more packet elements
&lt;/protocol&gt;

The variable elements to which I'm referring are child elements of the packet elements.

My models look like this:

type Protocol struct {
	Packets []ProtocolPacket `xml:&quot;packet&quot;`
}

type ProtocolPacket struct {
	Family       string                `xml:&quot;family,attr&quot;`
	Action       string                `xml:&quot;action,attr&quot;`
	Instructions /* ??? */             `xml:&quot;,any&quot;`
	Comment      string                `xml:&quot;comment&quot;`
}

In this XML spec, there are a number of different elements such as array, break, and field, shown in the above sample, that need to be coalesced into a single slice while maintaining their order. These are referred to generally as "instructions". (comment in the example is a special case that should only ever be seen once).

I'm completely stumped on how to model the list of "instructions". One idea I had was to create an interface ProtocolInstruction with a custom unmarshaller that assigned an implementation depending on the element type, but I don't think this pattern would work, as you need to know the receiver type ahead of time for the unmarshal function to satisfy the appropriate interface.

I came across this question, but the suggested answer does not preserve the order of elements between different element names. Another idea I had was to use this method but write custom unmarshallers for each type that increment a counter and store the element index - that way even if order isn't preserved, it can at least be retrieved. However, this seems like a lot of work and a messy implementation, so I'm searching for alternatives.

Is there any way to unmarshal variable child XML elements while preserving their order in go?

答案1

得分: 1

解决方案1

借鉴解析额外属性的最高评分答案,你可以创建简单的结构体:

type Protocol struct {
	Packets []Packet `xml:"packet"`
}

type Packet struct {
	Family  string `xml:"family,attr"`
	Action  string `xml:"action,attr"`
	Comment string `xml:"comment"`

	Instructions []Instruction `xml:",any"`
}

type Instruction struct {
	Name  xml.Name
	Attrs []xml.Attr `xml:",any,attr"`
}

Packet 结构体中未被更精确规则处理的任何元素都将传递给 Instruction,它将解码该元素的名称和其属性的切片。

解析示例 XML 将生成一个包含 Packets.Instructions 的 protocol 变量,其中包含相当原始的 XML 值(你可以在稍后的 String 方法中看到我如何处理):

var protocol Protocol
xml.Unmarshal([]byte(opXML), &protocol)

for _, it := range protocol.Packets[0].Instructions {
    fmt.Println(it)
}
{name:array attrs:{name:positions type:NpcUpdatePosition}}
{name:break attrs:{}}
{name:array attrs:{name:attacks type:NpcUpdateAttack}}
{name:break attrs:{}}
{name:array attrs:{name:chats type:NpcUpdateChat}}
{name:break attrs:{}}
{name:field attrs:{name:hp type:short optional:true}}
{name:field attrs:{name:tp type:short optional:true}}

Instruction 的 String 方法:

func (it Instruction) String() (s string) {
    s += fmt.Sprintf("{name:%s", it.Name.Local)
    s += " attrs:{"
    sep := ""
    for _, attr := range it.Attrs {
        s += fmt.Sprintf("%s%s:%s", sep, attr.Name.Local, attr.Value)
        sep = " "
    }
    s += "}}"
    return
}

解决方案2

相同问题的被接受的答案展示了如何制作自己的解码器,就像你建议的那样。我不知道你期望的结构是什么样的。我不懂泛型(也许有更简洁的泛型解决方案),所以我想出了以下解决方案。Protocol 和 Packet 结构体保持不变,Instruction 发生了大变化:

type Instruction struct {
    name string

    arrayAttrs struct{ name, type_ string }
    fieldAttrs struct {
        name, type_ string
        optional    bool
    }
}

以及它的解码器:

type Instruction struct {
    name string

    array *Array
    field *Field
}

type Array struct {
    name, type_ string
}

type Field struct {
    name, type_ string
    optional    bool
}

func (it *Instruction) UnmarshalXML(d *xml.Decoder, start xml.StartElement) error {
    it.name = start.Name.Local

    switch it.name {
    case "array":
        it.array = &Array{}
        for _, attr := range start.Attr {
            value := attr.Value
            switch attr.Name.Local {
            case "name":
                it.array.name = value
            case "type":
                it.array.type_ = value
            }
        }
    case "field":
        it.field = &Field{}
        for _, attr := range start.Attr {
            value := attr.Value
            switch attr.Name.Local {
            case "name":
                it.field.name = value
            case "type":
                it.field.type_ = value
            case "optional":
                vb, _ := strconv.ParseBool(value)
                it.field.optional = vb
            }
        }
    }

    return d.Skip()
}

func (it Instruction) String() (s string) {
    switch it.name {
    case "array":
        s = fmt.Sprintf("{array: {name:%s type:%s}}", it.array.name, it.array.type_)
    case "break":
        s = "{break: {}}"
    case "field":
        s = fmt.Sprintf("{field: {name:%s type:%s optional:%t}}", it.field.name, it.field.type_, it.field.optional)
    }
    return
}

在主函数中使用相同的解码代码(上面的代码):

{array: {name:positions type:NpcUpdatePosition}}
{break: {}}
{array: {name:attacks type:NpcUpdateAttack}}
{break: {}}
{array: {name:chats type:NpcUpdateChat}}
{break: {}}
{field: {name:hp type:short optional:true}}
{field: {name:tp type:short optional:true}}

解决方案3

从 JSON 文档中的 RawMessage(Unmarshal)示例 中汲取灵感,看起来接受任意类型可以允许我尝试过的最简单的结构体表示:

type Protocol struct {
    Packets []Packet `xml:"packet"`
}

type Packet struct {
    Family  string `xml:"family,attr"`
    Action  string `xml:"action,attr"`
    Comment string `xml:"comment"`

    Instructions []any `xml:",any"`
}

type Array struct {
    Name string `xml:"name,attr"`
    Type string `xml:"type,attr"`
}

type Break struct{}

type Field struct {
    Name     string `xml:"name,attr"`
    Type     string `xml:"type,attr"`
    Optional bool   `xml:"optional,attr"`
}

这样使用结构体看起来更直观(对我来说):

var p Protocol
must(xml.Unmarshal([]byte(sXML), &p))
for _, it := range p.Packets[0].Instructions {
    fmt.Printf("%T: %+v\n", it, it)
}

输出:

*main.Array: &{Name:positions Type:NpcUpdatePosition}
*main.Break: &{}
*main.Array: &{Name:attacks Type:NpcUpdateAttack}
*main.Break: &{}
*main.Array: &{Name:chats Type:NpcUpdateChat}
*main.Break: &{}
*main.Field: &{Name:hp Type:short Optional:true}
*main.Field: &{Name:tp Type:short Optional:true}

所以,我猜这意味着 UnmarshalXML 必须承载逻辑和工作的平衡:

func (p *Packet) UnmarshalXML(d *xml.Decoder, start xml.StartElement) error {
    for _, attr := range start.Attr {
        switch attr.Name.Local {
        case "family":
            p.Family = attr.Value
        case "action":
            p.Action = attr.Value
        }
    }

    for {
        t, err := d.Token()
        if atEOF(err) {
            break
        }

        if ee, ok := t.(xml.EndElement); ok {
            if ee.Name.Local == start.Name.Local {
                break
            }
        }

        se, ok := t.(xml.StartElement)
        if !ok {
            continue
        }

        if se.Name.Local == "comment" {
            var s string
            must(d.DecodeElement(&s, &se))
            p.Comment = s
            continue
        }

        var dst any
        switch se.Name.Local {
        default:
            continue
        case "array":
            dst = new(Array)
        case "break":
            dst = new(Break)
        case "field":
            dst = new(Field)
        }
        must(d.DecodeElement(dst, &se))

        p.Instructions = append(p.Instructions, dst)
    }

    return nil
}

我仍然不理解文档中关于 xml.Unmarshaler 类型的_实现说明_:

UnmarshalXML 解码以给定起始元素开头的单个 XML 元素。如果它返回一个错误,外部对 Unmarshal 的调用将停止并返回该错误。UnmarshalXML 必须精确地消耗一个 XML 元素。一种常见的实现策略是使用 d.DecodeElement 解码为与预期 XML 匹配的单独值,并将数据从该值复制到接收器。另一种常见的策略是使用 d.Token 逐个处理 XML 对象的令牌。UnmarshalXML 不能使用 d.RawToken。

通过试错法,我学到了“UnmarshalXML 必须精确地消耗一个 XML 元素”的含义。为了满足这个约束,我添加了一个检查,看解码器是否遇到了一个与起始元素名称匹配的结束元素:

if ee, ok := t.(xml.EndElement); ok {
    if ee.Name.Local == start.Name.Local {
        break
    }
}

不过,我现在意识到这对于嵌套元素是行不通的。一个简单的深度计数器/跟踪器应该可以解决这个问题。

英文:

Solution 1

Drawing on the highest-rated answer (so far) to unmarshal extra attributes, you could create the simple structs:

type Protocol struct {
	Packets []Packet `xml:&quot;packet&quot;`
}

type Packet struct {
	Family  string `xml:&quot;family,attr&quot;`
	Action  string `xml:&quot;action,attr&quot;`
	Comment string `xml:&quot;comment&quot;`

	Instructions []Instruction `xml:&quot;,any&quot;`
}

type Instruction struct {
	Name  xml.Name
	Attrs []xml.Attr `xml:&quot;,any,attr&quot;`
}

Any elements in a packet not handled by the more precise rules at the top of the Packet struct will be passed to Instruction which will decode the element into its name and a slice of its attributes.

Unmarshalling your sample XML will produce a var protocol with Packets.Instructions that contain rather raw XML values (which you can see me handling in the String method, later):

var protocol Protocol
xml.Unmarshal([]byte(opXML), &amp;protocol)

for _, it := range protocol.Packets[0].Instructions {
	fmt.Println(it)
}
{name:array attrs:{name:positions type:NpcUpdatePosition}}
{name:break attrs:{}}
{name:array attrs:{name:attacks type:NpcUpdateAttack}}
{name:break attrs:{}}
{name:array attrs:{name:chats type:NpcUpdateChat}}
{name:break attrs:{}}
{name:field attrs:{name:hp type:short optional:true}}
{name:field attrs:{name:tp type:short optional:true}}

The String method for Instruction:

func (it Instruction) String() (s string) {
	s += fmt.Sprintf(&quot;{name:%s&quot;, it.Name.Local)
	s += &quot; attrs:{&quot;
	sep := &quot;&quot;
	for _, attr := range it.Attrs {
		s += fmt.Sprintf(&quot;%s%s:%s&quot;, sep, attr.Name.Local, attr.Value)
		sep = &quot; &quot;
	}
	s += &quot;}}&quot;
	return
}

Solution 2

The accepted answer for the same question exmeplifies making your own unmarshaller, like you suggested. I don't know what kind of structure you expect. I don't know generics (maybe there's a cleaner solution with generics), so I came up with the following. The Protocol and Packet structs remain the same, the big change comes with Instruction:

type Instruction struct {
	name string

	arrayAttrs struct{ name, type_ string }
	fieldAttrs struct {
		name, type_ string
		optional    bool
	}
}

and its unmarshaller:

type Instruction struct {
	name string

	array *Array
	field *Field
}

type Array struct {
	name, type_ string
}

type Field struct {
	name, type_ string
	optional    bool
}

func (it *Instruction) UnmarshalXML(d *xml.Decoder, start xml.StartElement) error {
	it.name = start.Name.Local

	switch it.name {
	case &quot;array&quot;:
		it.array = &amp;Array{}
		for _, attr := range start.Attr {
			value := attr.Value
			switch attr.Name.Local {
			case &quot;name&quot;:
				it.array.name = value
			case &quot;type&quot;:
				it.array.type_ = value
			}
		}
	case &quot;field&quot;:
		it.field = &amp;Field{}
		for _, attr := range start.Attr {
			value := attr.Value
			switch attr.Name.Local {
			case &quot;name&quot;:
				it.field.name = value
			case &quot;type&quot;:
				it.field.type_ = value
			case &quot;optional&quot;:
				vb, _ := strconv.ParseBool(value)
				it.field.optional = vb
			}
		}
	}

	return d.Skip()
}

func (it Instruction) String() (s string) {
	switch it.name {
	case &quot;array&quot;:
		s = fmt.Sprintf(&quot;{array: {name:%s type:%s}}&quot;, it.array.name, it.array.type_)
	case &quot;break&quot;:
		s = &quot;{break: {}}&quot;
	case &quot;field&quot;:
		s = fmt.Sprintf(&quot;{field: {name:%s type:%s optional:%t}}&quot;, it.field.name, it.field.type_, it.field.optional)
	}
	return
}

Using the same unmarshalling code in main (from above):

{array: {name:positions type:NpcUpdatePosition}}
{break: {}}
{array: {name:attacks type:NpcUpdateAttack}}
{break: {}}
{array: {name:chats type:NpcUpdateChat}}
{break: {}}
{field: {name:hp type:short optional:true}}
{field: {name:tp type:short optional:true}}

Solution 3

Drawing inspiration from the RawMessage (Unmarshal) example in the JSON documentation, it looks like embracing the any type can allow the simplest struct representation I've tried so far:

type Protocol struct {
	Packets []Packet `xml:&quot;packet&quot;`
}

type Packet struct {
	Family  string `xml:&quot;family,attr&quot;`
	Action  string `xml:&quot;action,attr&quot;`
	Comment string `xml:&quot;comment&quot;`

	Instructions []any `xml:&quot;,any&quot;`
}

type Array struct {
	Name string `xml:&quot;name,attr&quot;`
	Type string `xml:&quot;type,attr&quot;`
}

type Break struct{}

type Field struct {
	Name     string `xml:&quot;name,attr&quot;`
	Type     string `xml:&quot;type,attr&quot;`
	Optional bool   `xml:&quot;optional,attr&quot;`
}

which makes using the structs look more straight-forward (for my sensibilities):

var p Protocol
must(xml.Unmarshal([]byte(sXML), &amp;p))
for _, it := range p.Packets[0].Instructions {
	fmt.Printf(&quot;%T: %+v\n&quot;, it, it)
}

to get:

*main.Array: &amp;{Name:positions Type:NpcUpdatePosition}
*main.Break: &amp;{}
*main.Array: &amp;{Name:attacks Type:NpcUpdateAttack}
*main.Break: &amp;{}
*main.Array: &amp;{Name:chats Type:NpcUpdateChat}
*main.Break: &amp;{}
*main.Field: &amp;{Name:hp Type:short Optional:true}
*main.Field: &amp;{Name:tp Type:short Optional:true}

So, I guess that means that UnmarshalXML must carry the balance of logic and work:

func (p *Packet) UnmarshalXML(d *xml.Decoder, start xml.StartElement) error {
	for _, attr := range start.Attr {
		switch attr.Name.Local {
		case &quot;family&quot;:
			p.Family = attr.Value
		case &quot;action&quot;:
			p.Action = attr.Value
		}
	}

	for {
		t, err := d.Token()
		if atEOF(err) {
			break
		}

		if ee, ok := t.(xml.EndElement); ok {
			if ee.Name.Local == start.Name.Local {
				break
			}
		}

		se, ok := t.(xml.StartElement)
		if !ok {
			continue
		}

		if se.Name.Local == &quot;comment&quot; {
			var s string
			must(d.DecodeElement(&amp;s, &amp;se))
			p.Comment = s
			continue
		}

		var dst any
		switch se.Name.Local {
		default:
			continue
		case &quot;array&quot;:
			dst = new(Array)
		case &quot;break&quot;:
			dst = new(Break)
		case &quot;field&quot;:
			dst = new(Field)
		}
		must(d.DecodeElement(dst, &amp;se))

		p.Instructions = append(p.Instructions, dst)
	}

	return nil
}

I still don't understand the implementation notes in the documentation for the xml.Unmarshaler type:

> UnmarshalXML decodes a single XML element beginning with the given start element. If it returns an error, the outer call to Unmarshal stops and returns that error. UnmarshalXML must consume exactly one XML element. One common implementation strategy is to unmarshal into a separate value with a layout matching the expected XML using d.DecodeElement, and then to copy the data from that value into the receiver. Another common strategy is to use d.Token to process the XML object one token at a time. UnmarshalXML may not use d.RawToken.

One thing I learned through trial-and-error was the meaning of 'UnmarshalXML must consume exactly one XML element.'. To satisfy that constraint I added the check to see if the decoder encountered an end element with a name that matches the starting element:

if ee, ok := t.(xml.EndElement); ok {
	if ee.Name.Local == start.Name.Local {
		break
	}
}

though, I now realize this wouldn't work nested elements. A simple depth counter/tracker should clear that up.

huangapple
  • 本文由 发表于 2023年7月13日 05:59:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76674710.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定