如何确定一个XML文件的编码并将非标准字符转换为可读(UTF-8?)格式?

huangapple go评论54阅读模式
英文:

How to determine an XML file's encoding and convert non-standard characters to a readable (UTF-8?) format?

问题

  1. 如何确定XML文件中使用的真实编码?
  2. 我们的ETL软件允许我指定源文件(在这种情况下是XML文件)和最终目标(我们的内部软件)的字符编码,所以我假设一旦我能确定文件中使用的编码,我就能将这些非标准字符转换为它们的UTF-8等效字符?
  3. 是否愚蠢向供应商询问他们在创建XML文件时使用的编码?

谢谢提前的任何指导!

英文:

I am rather new to XML file encoding and am trying to wrap my head around a few things.

My company receives two data feeds (both in XML format) from two separate external vendors. The data from these XML files is run through our internal ETL software and then pushed to some of our internal software.

The issues I am having appear to be encoding-related and the presence of non-standard characters.

XML FILE 1

  • The vendor's software is ancient, and so the process they use to generate the XML files likely is too.
  • The XML file has no XML declaration tag at the top.
  • When I open the XML file in Sublime the console shows: "unable to auto detect encoding for C:\file1.xml, using fallback encoding Western (Windows 1252)"
  • Opening the XML file in a browser results in: "error on line 500 at column 30: Encoding error"
  • Our ETL software jobs fail when attempting to process any nodes with non-standard characters.
  • Below is a screenshot of the expected character, plus what I see when opening the XML file in various text editors:

xml_file_1

XML FILE 2

  • The vendor's software is ancient, and so the process they use to generate the XML files likely is too.
  • The XML file has this declaration tag at the top: <?xml version="1.0" standalone="yes"?>
  • When I open the XML file in Sublime the console shows no error of any kind.
  • Opening the XML file in a browser results in no visible errors.
  • Our ETL software jobs run fine, but insert the improperly encoded character rather than the expected character.
  • Below is a screenshot of the expected character, plus what I see when opening the XML file in various text editors:

xml_file_2

QUESTIONS

  1. How can I determine the true encoding used in an XML file?

  2. Our ETL software lets me specify character encoding for source files (in this case the XML files) as well as the eventual targets (our internal software), so I assume once I am able to determine the encoding used on the file I will be able to transform these non-standard characters to their UTF-8 equivalents?

  3. Would I be stupid to ask the vendors what encoding they are using when creating XML files?

Thanks in advance for any guidance!

答案1

得分: 1

如何确定XML文件中使用的真正编码?

一般情况下,答案是你无法确定。例如,如果二进制文件中的所有字节都在ASCII范围x20-x7F内,那么该文件可能是ASCII,也可能是ASCII的任何一个分配一些代码点不同的国家变体(例如,英国变体具有£,而美国ASCII具有#)。实际上,国家变体现在已经相当罕见,特别是在XML中,但如果您的XML是在某个配置为这种方式的古老大型计算机上创建的,那么这绝不是不可能的。然而,我主要使用这个示例来证明在一般情况下,仅仅通过查看文件中的字节是无法推断编码的。相同的推理也适用于检测ISO 8859的地区变体,例如西里尔字母变体,这绝对是当前使用中的。

幸运的是,实际上情况没有那么糟糕;有一些工具能够做出相当合理的猜测,或者至少找到一个候选解码方式,即使它对一些字符解码不正确。

然而,在您的情况下,推断编码的最好线索可能是知道谁向您发送了文件。有很大机会来自特定来源的所有文件都使用相同的编码。标准工具可能不知道这一点,但您可能知道。

我向供应商询问他们在创建XML文件时使用的编码是否愚蠢?

绝对不愚蠢。如果他们愿意向您发送数据,那么他们应该准备告诉您如何阅读它。供应商实际上是否有足够的智力来理解并回答这个问题是另一回事。我唯一的经验是一个供应商对XML或编码一无所知,对我的建议感激不尽,并用它来修复他们的软件。

英文:

>How can I determine the true encoding used in an XML file?

In the general case, the answer is that you can't. For example, if all the octets in the binary file are in the ASCII range x20-x7F, then the file might be ASCII, or it might be any of the national variants of ASCII that assign some of the codepoints differently (for example the UK variant has £ where US ASCII has #). In fact national variants are rather a rarity nowadays, especially with XML, but if your XML is created on some ancient mainframe that's configured that way, then it's by no means impossible. However, I use this example primarily just to demonstrate that in the general case, there is no way of inferring the encoding by simply looking at the octets in the file. The same reasoning applies to detecting regional variants of ISO 8859, for example the Cyrillic variant, which most definitely is in current use.

Fortunately it's not quite that bleak in practice; there are tools that are able to make a fairly reasonable guess, or at least to find a candidate decoding that doesn't crash, even if it decodes some characters incorrectly.

In your situation, however, the best clue to inferring an encoding might be knowing who sent you the file. There's a good chance that all the files from a particular origin use the same encoding. Standard tools won't know that, but you probably do.

>Would I be stupid to ask the vendors what encoding they are using when creating XML files?

Absolutely not. It's a perfectly reasonable expectation that if they're prepared to send you data, they should be prepared to tell you how to read it. Whether the vendors actually have the brain-power to understand and answer the question is another matter. My only experience of this is a vendor who had no understanding at all of XML or encodings, and was grateful for my advice and used it to fix their software.

答案2

得分: 0

有各种各样的库,既有开源的,也有闭源的,用于检测编码。在这里发布推荐不被鼓励,所以我能做的最好建议是搜索“编码检测”,不提及XML。你可能会找到一些适合Java或Windows的内置解决方案。

英文:

There are a variety of libraries, both open source and not, which detect encodings. Posting recommendations here is not encouraged, so the best I can do is recommend that you search for 'encoding detection', not mentioning XML. You might find something adequate built into Java or Windows.

huangapple
  • 本文由 发表于 2023年2月24日 08:39:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/75551649.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定