英文:
CSVHelper - Trouble with special Characters
问题
对于我们当前的项目,我正在使用CSVHelper Nuget,一切都运行得很完美,唯一的例外是当字段包含特殊字符(ä,ü,...)时。我该如何更改它以使其正常工作,而不显示?作为字母替换?(我尝试过Current和Invariant Culture,但没有用)。
我尝试更改在从文件读取字节流时的Culture,也尝试在解析CSV时使用不同的Culture。
英文:
For our current project, i am using the CSVHelper Nuget and everything works perfectly with it with the only exception when the field contains special characters (ä,ü,...). How can I change it to make it work and not show ? as the letter replacement? (I tried Current and Invariant Culture but it didn't matter).
I tried changing the Culture when reading the byte stream from the file and I tried using different Cultures when parsing the CSV.
答案1
得分: 1
我经常遇到这样的问题,当有人将Excel文件保存为 CSV (逗号分隔)(*.csv) 而不是 CSV UTF-8 (逗号分隔)(*.csv) 时。 这往往意味着根据保存的国家/地区,它通常被保存为 Windows 1252 编码。 在大多数情况下,你可以在使用 StreamReader 读取文件时使用 ISO-8859-1 编码,也被称为 Latin-1 编码。 如果仍然有一些字符无法正确读取,你可能需要使用保存文件时使用的确切编码。
ISO-8859-1(也称为 Latin-1)与 Windows-1252(也称为 CP1252)相同,除了代码点128-159(0x80-0x9F)之外。 ISO-8859-1 在此范围内分配了几个控制代码。 Windows-1252 将多个字符、标点符号、算术和商业符号分配给这些代码点。 https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
在 .NET Core 中,你似乎在可用的编码数量上有一些限制。
运行在 .NET Core 上时,示例 产生以下输出:
| Info.CodePage | Info.Name | Info.DisplayName |
|---|---|---|
| 1200 | utf-16 | Unicode |
| 1201 | utf-16BE | Unicode (Big-Endian) |
| 12000 | utf-32 | Unicode (UTF-32) |
| 12001 | utf-32BE | Unicode (UTF-32 Big-Endian) |
| 20127 | us-ascii | US-ASCII |
| 28591 | iso-8859-1 | Western European (ISO) |
| 65000 | utf-7 | Unicode (UTF-7) |
| 65001 | utf-8 | Unicode (UTF-8) |
void Main()
{
using var reader = new StreamReader(@"C:\Users\myName\Documents\TestUmlauts.csv",
Encoding.Latin1);
using var csv = new CsvReader(reader, CultureInfo.InvariantCulture);
var records = csv.GetRecords<Foo>();
}
public class Foo
{
public int Id { get; set; }
public string Name { get; set; }
}
英文:
I often have this issue when someone saves an Excel file as CSV (Comma delimited)(*.csv) rather than as CSV UTF-8 (Comma delimited)(*.csv). Depending on the country it is saved in, this often means it was saved as Windows 1252 encoding. In most cases, you can get away with using ISO-8859-1 encoding, also known as Latin-1 encoding, when reading the file with StreamReader. If you still have some characters that are not getting read correctly, you may have to use the exact encoding that was used to save the file.
> ISO-8859-1 (also called Latin-1) is identical to Windows-1252 (also called CP1252) except for the code points 128-159 (0x80-0x9F). ISO-8859-1 assigns several control codes in this range. Windows-1252 has several characters, punctuation, arithmetic and business symbols assigned to these code points. https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
In .NET Core it looks like you are a bit limited as to the number of encodings available to you.
>The example produces the following output when run on .NET Core:
| Info.CodePage | Info.Name | Info.DisplayName |
|---|---|---|
| 1200 | utf-16 | Unicode |
| 1201 | utf-16BE | Unicode (Big-Endian) |
| 12000 | utf-32 | Unicode (UTF-32) |
| 12001 | utf-32BE | Unicode (UTF-32 Big-Endian) |
| 20127 | us-ascii | US-ASCII |
| 28591 | iso-8859-1 | Western European (ISO) |
| 65000 | utf-7 | Unicode (UTF-7) |
| 65001 | utf-8 | Unicode (UTF-8) |
void Main()
{
using var reader = new StreamReader(@"C:\Users\myName\Documents\TestUmlauts.csv",
Encoding.Latin1);
using var csv = new CsvReader(reader, CultureInfo.InvariantCulture);
var records = csv.GetRecords<Foo>();
}
public class Foo
{
public int Id { get; set; }
public string Name { get; set; }
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论