英文:
CSVHelper - Trouble with special Characters
问题
对于我们当前的项目,我正在使用CSVHelper Nuget,一切都运行得很完美,唯一的例外是当字段包含特殊字符(ä,ü,...)时。我该如何更改它以使其正常工作,而不显示?作为字母替换?(我尝试过Current和Invariant Culture,但没有用)。
我尝试更改在从文件读取字节流时的Culture,也尝试在解析CSV时使用不同的Culture。
英文:
For our current project, i am using the CSVHelper Nuget and everything works perfectly with it with the only exception when the field contains special characters (ä,ü,...). How can I change it to make it work and not show ? as the letter replacement? (I tried Current and Invariant Culture but it didn't matter).
I tried changing the Culture when reading the byte stream from the file and I tried using different Cultures when parsing the CSV.
答案1
得分: 1
我经常遇到这样的问题,当有人将Excel文件保存为 CSV (逗号分隔)(*.csv)
而不是 CSV UTF-8 (逗号分隔)(*.csv)
时。 这往往意味着根据保存的国家/地区,它通常被保存为 Windows 1252 编码。 在大多数情况下,你可以在使用 StreamReader
读取文件时使用 ISO-8859-1
编码,也被称为 Latin-1
编码。 如果仍然有一些字符无法正确读取,你可能需要使用保存文件时使用的确切编码。
ISO-8859-1(也称为 Latin-1)与 Windows-1252(也称为 CP1252)相同,除了代码点128-159(0x80-0x9F)之外。 ISO-8859-1 在此范围内分配了几个控制代码。 Windows-1252 将多个字符、标点符号、算术和商业符号分配给这些代码点。 https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
在 .NET Core 中,你似乎在可用的编码数量上有一些限制。
运行在 .NET Core 上时,示例 产生以下输出:
Info.CodePage | Info.Name | Info.DisplayName |
---|---|---|
1200 | utf-16 | Unicode |
1201 | utf-16BE | Unicode (Big-Endian) |
12000 | utf-32 | Unicode (UTF-32) |
12001 | utf-32BE | Unicode (UTF-32 Big-Endian) |
20127 | us-ascii | US-ASCII |
28591 | iso-8859-1 | Western European (ISO) |
65000 | utf-7 | Unicode (UTF-7) |
65001 | utf-8 | Unicode (UTF-8) |
void Main()
{
using var reader = new StreamReader(@"C:\Users\myName\Documents\TestUmlauts.csv",
Encoding.Latin1);
using var csv = new CsvReader(reader, CultureInfo.InvariantCulture);
var records = csv.GetRecords<Foo>();
}
public class Foo
{
public int Id { get; set; }
public string Name { get; set; }
}
英文:
I often have this issue when someone saves an Excel file as CSV (Comma delimited)(*.csv)
rather than as CSV UTF-8 (Comma delimited)(*.csv)
. Depending on the country it is saved in, this often means it was saved as Windows 1252 encoding. In most cases, you can get away with using ISO-8859-1
encoding, also known as Latin-1
encoding, when reading the file with StreamReader
. If you still have some characters that are not getting read correctly, you may have to use the exact encoding that was used to save the file.
> ISO-8859-1 (also called Latin-1) is identical to Windows-1252 (also called CP1252) except for the code points 128-159 (0x80-0x9F). ISO-8859-1 assigns several control codes in this range. Windows-1252 has several characters, punctuation, arithmetic and business symbols assigned to these code points. https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
In .NET Core it looks like you are a bit limited as to the number of encodings available to you.
>The example produces the following output when run on .NET Core:
Info.CodePage | Info.Name | Info.DisplayName |
---|---|---|
1200 | utf-16 | Unicode |
1201 | utf-16BE | Unicode (Big-Endian) |
12000 | utf-32 | Unicode (UTF-32) |
12001 | utf-32BE | Unicode (UTF-32 Big-Endian) |
20127 | us-ascii | US-ASCII |
28591 | iso-8859-1 | Western European (ISO) |
65000 | utf-7 | Unicode (UTF-7) |
65001 | utf-8 | Unicode (UTF-8) |
void Main()
{
using var reader = new StreamReader(@"C:\Users\myName\Documents\TestUmlauts.csv",
Encoding.Latin1);
using var csv = new CsvReader(reader, CultureInfo.InvariantCulture);
var records = csv.GetRecords<Foo>();
}
public class Foo
{
public int Id { get; set; }
public string Name { get; set; }
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论