处理来自HTTP响应体的字节顺序标记

huangapple go评论87阅读模式
英文:

HXT: Handling byte order mark from HTTP response body

问题

使用HXT解析HTTP调用的XML响应体时,发现响应体可能包含XML字节顺序标记(BOM),导致出现错误消息。您想知道如何在可能存在BOM的情况下解析XML,而不会打印错误消息。

您可以尝试使用以下方法来解决这个问题:

  1. 去除BOM:在解析XML之前,您可以检查响应体是否以BOM开头,并在需要时去除它。这可以通过检查前几个字节是否为BOM的方式实现。然后,再将处理过的响应体传递给XML解析器。

  2. 配置XML解析器:某些XML解析器允许您配置其行为以处理BOM。您可以查看HXT文档以了解是否有类似的配置选项,以便在解析XML时不打印错误消息。

这两种方法都可以帮助您在解析XML时避免打印BOM相关的错误消息。

英文:

Using HXT I'm parsing the XML response body of an HTTP call that was made using http-conduit.

val <- runX $ readString [withValidate no] (Data.ByteString.UTF8.toString . toStrict $ getResponseBody response) >>> getChildren >>> ...

Depending on the version of the API, I found that the response body includes a byte order mark before the XML:

error: ""279<?xml version=\"1.0\" encoding=\"utf-8\"?><Enume..."" (line 1, column 1):
unexpected "279"
expecting xml declaration, comment, processing instruction, "<!DOCTYPE" or "<"

Since the BOM may or may not be there, I did the following:

...
let resBody = Data.ByteString.UTF8.toString . toStrict $ getResponseBody response
    parseBody body = runX $ readString [withValidate no] body >>> getChildren >>> ...
xs <- parseBody resBody
val <- case xs of
  x : _ -> pure x
  _ -> head <$> (parseBody $ drop 1 resBody)
...

It works, but it's printing the error message when the BOM is present. What are the options for parsing the XML with a possible BOM so that it's not printing error messages?

答案1

得分: 0

Sure, here is the translated content:

好的,鉴于你愿意假设编码是UTF-8,就像你在这里所做的一样,那么可能最简单的方法就是进行模式匹配以丢弃BOM:

match toString ... of
    '\65279':s -> s
    s -> s

另外,我刚刚查阅了XML规范,以查看编码应该如何处理,让我说一下:咦,恶心。似乎没有一种编码无关的方式来指定要使用的编码,所以在解析过程中唯一真正正确和健壮的方法是尝试多次,希望有一个成功。

英文:

Okay, given that you're willing to assume the encoding is UTF-8 as you do here, then probably the simplest is to just pattern match to discard a BOM:

case toString ... of
    '279':s -> s
    s -> s

As an aside, having just looked through the XML spec to see how encodings are supposed to be handled, let me just say: eugh, gross. There appears to be no encoding-agnostic way to specify what encoding to use, so the only really correct, robust thing to do is try a bunch during parsing and hope one succeeds.

huangapple
  • 本文由 发表于 2023年5月25日 04:46:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76327292.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定