CSVHelper – 特殊字符问题

huangapple go评论62阅读模式
英文:

CSVHelper - Trouble with special Characters

问题

对于我们当前的项目,我正在使用CSVHelper Nuget,一切都运行得很完美,唯一的例外是当字段包含特殊字符(ä,ü,...)时。我该如何更改它以使其正常工作,而不显示?作为字母替换?(我尝试过Current和Invariant Culture,但没有用)。

我尝试更改在从文件读取字节流时的Culture,也尝试在解析CSV时使用不同的Culture。

英文:

For our current project, i am using the CSVHelper Nuget and everything works perfectly with it with the only exception when the field contains special characters (ä,ü,...). How can I change it to make it work and not show ? as the letter replacement? (I tried Current and Invariant Culture but it didn't matter).

I tried changing the Culture when reading the byte stream from the file and I tried using different Cultures when parsing the CSV.

答案1

得分: 1

我经常遇到这样的问题,当有人将Excel文件保存为 CSV (逗号分隔)(*.csv) 而不是 CSV UTF-8 (逗号分隔)(*.csv) 时。 这往往意味着根据保存的国家/地区,它通常被保存为 Windows 1252 编码。 在大多数情况下,你可以在使用 StreamReader 读取文件时使用 ISO-8859-1 编码,也被称为 Latin-1 编码。 如果仍然有一些字符无法正确读取,你可能需要使用保存文件时使用的确切编码。

ISO-8859-1(也称为 Latin-1)与 Windows-1252(也称为 CP1252)相同,除了代码点128-159(0x80-0x9F)之外。 ISO-8859-1 在此范围内分配了几个控制代码。 Windows-1252 将多个字符、标点符号、算术和商业符号分配给这些代码点。 https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

在 .NET Core 中,你似乎在可用的编码数量上有一些限制。

运行在 .NET Core 上时,示例 产生以下输出:

Info.CodePage Info.Name Info.DisplayName
1200 utf-16 Unicode
1201 utf-16BE Unicode (Big-Endian)
12000 utf-32 Unicode (UTF-32)
12001 utf-32BE Unicode (UTF-32 Big-Endian)
20127 us-ascii US-ASCII
28591 iso-8859-1 Western European (ISO)
65000 utf-7 Unicode (UTF-7)
65001 utf-8 Unicode (UTF-8)
void Main()
{
	using var reader = new StreamReader(@"C:\Users\myName\Documents\TestUmlauts.csv", 
		Encoding.Latin1);
	using var csv = new CsvReader(reader, CultureInfo.InvariantCulture);
	
	var records = csv.GetRecords<Foo>();
}

public class Foo 
{
	public int Id { get; set; }
	public string Name { get; set; }
}
英文:

I often have this issue when someone saves an Excel file as CSV (Comma delimited)(*.csv) rather than as CSV UTF-8 (Comma delimited)(*.csv). Depending on the country it is saved in, this often means it was saved as Windows 1252 encoding. In most cases, you can get away with using ISO-8859-1 encoding, also known as Latin-1 encoding, when reading the file with StreamReader. If you still have some characters that are not getting read correctly, you may have to use the exact encoding that was used to save the file.
> ISO-8859-1 (also called Latin-1) is identical to Windows-1252 (also called CP1252) except for the code points 128-159 (0x80-0x9F). ISO-8859-1 assigns several control codes in this range. Windows-1252 has several characters, punctuation, arithmetic and business symbols assigned to these code points. https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

In .NET Core it looks like you are a bit limited as to the number of encodings available to you.

>The example produces the following output when run on .NET Core:

Info.CodePage Info.Name Info.DisplayName
1200 utf-16 Unicode
1201 utf-16BE Unicode (Big-Endian)
12000 utf-32 Unicode (UTF-32)
12001 utf-32BE Unicode (UTF-32 Big-Endian)
20127 us-ascii US-ASCII
28591 iso-8859-1 Western European (ISO)
65000 utf-7 Unicode (UTF-7)
65001 utf-8 Unicode (UTF-8)
void Main()
{
	using var reader = new StreamReader(@&quot;C:\Users\myName\Documents\TestUmlauts.csv&quot;, 
		Encoding.Latin1);
	using var csv = new CsvReader(reader, CultureInfo.InvariantCulture);
	
	var records = csv.GetRecords&lt;Foo&gt;();
}

public class Foo 
{
	public int Id { get; set; }
	public string Name { get; set; }
}

huangapple
  • 本文由 发表于 2023年5月14日 18:38:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76247011.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定