处理C#中的转义序列

huangapple go评论130阅读模式
英文:

Handle escape sequences in C#

问题

我有一个C#端点,它接受字符串输入作为rawText。
在将“file”转换为“string”后,使用第三方的“aspose”库发送输入,发送的格式如下,例如 -

{rawText = "\u0007\u0007\r\r\r\r\r\u0007Random Name\rRandom Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com"}

我知道在C#中字符串是UTF16编码的,所以当它到达端点时被转换为 -

requestobj.RawText = "\a\a\r\r\r\r\r\aRandom Name\r10504 Random Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com"

我的推理是否正确,因为C#字符串是UTF16编码的?最好的方法是如何去掉字符串开头的\a\a\r\r\r\r\r\a?我将这个文本传递给另一个第三方API,这个附加的额外文本会导致结果不正确。

我尝试过使用以下方法,但我想要一个更通用的解决方案来处理所有可能的\n\r\a等情况。

var newText = Regex.Replace(inputValue, @"\\a", "");
inputValue = inputValue.Replace(@"\a", "").Replace(@"\r", "");
英文:

I have a C# endpoint that takes rawText as string input.
The input is send after converting a file to string using 3rd party aspose library, input that is sent is of following format, eg -

{rawText = "\u0007\u0007\r\r\r\r\r\u0007Random Name\rRandom Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com"}

I know strings are UTF16 encoded in C#, so when it reaches the endpoint it is converted to -

requestobj.RawText = "\a\a\r\r\r\r\r\aRandom Name\r10504 Random Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com"

Is my reasoning correct that is due to C# strings being utf16 encoded? and what is the best way to can I remove the \a\a\r\r\r\r\r\a at string begining. I am passing this text to another 3rd party api which does not return correct result with this prepended extra text.

I have tried to use below, but I want a more generic solution for handling all possibilities of \n\r\a etc.

var newText = Regex.Replace(inputValue, @"\\a", "");
inputValue = inputValue.Replace(@"\a", "").Replace(@"\r", "");

答案1

得分: 2

这些是转义序列,而不是UTF8编码。编码是指字符如何转换为字节。转义序列用于输入在源代码中难以键入或看不见的字符。调试器也使用它们来显示这些字符。在问题的情况下,没有进行任何转换。同样的BELL字符(0x07)可以表示为\a\u0007。调试器选择了更短的版本。

要仅替换开头的这3个字符,您可以使用这个正则表达式 @"^[\r\n\a]+"。为了避免在正则表达式中对转义序列进行双引号引用,可以使用\翻译为转义字符的逐字字符串。

var input=@"\a\a\r\r\r\r\r\aRandom Name\r10504 Random Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com";
var pattern=@"^[\r\n\a]+";
var newText=Regex.Replace(input,pattern,@"");

这将产生以下结果:

Random Name 10504 Random Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com

要删除任何位置的字符,请删除起始锚点^

还可以替换所有控制字符。有一个特定的Unicode类别用于控制字符,它是\p{Cc}Cc是控制字符类别的简写。

var pattern=@"\p{Cc}+";
var newText=Regex.Replace(input,pattern,@"");

如文档所解释,此类别匹配任何

控制码字符,Unicode值为U+007F或在U+0000到U+001F或U+0080到U+009F范围内的字符。以Unicode标识“Cc”(其他,控制)。

英文:

Those are escape sequences, not UTF8 encoding. Encoding refers to how characters are converted to bytes. Escape sequences are used to enter characters that are hard to type or invisible in source code. They're also used by debuggers to display such characters. Nothing got converted in the question's case. The same BELL character (0x07) can be represented as both \a or \u0007. The debugger chose the shorter version.

To replace just these 3 characters at the start you can use this regular expression @"^[\r\n\a]+". To avoid double quoting the escape sequences in the regular expression, a verbatim string can be used which doesn't translate \ as an escape character.

var input="\a\a\r\r\r\r\r\aRandom Name\r10504 Random Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com";
var pattern=@"^[\r\n\a]+";
var newText=Regex.Replace(input,pattern,"");

This produces

Random Name 10504 Random Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com

To remove characters at any position, remove the start anchor ^.

It's also possible to replace all control characters. There's a specific Unicode category for control characters with \p{Cc}. Cc is the shorthand for the control character category.

var pattern=@"\p{Cc}+";
var newText=Regex.Replace(input,pattern,"");

As the docs explain, this category matches any

> Control code character, with a Unicode value of U+007F or in the range U+0000 through U+001F or U+0080 through U+009F. Signified by the Unicode designation "Cc" (other, control).

答案2

得分: 0

正如Panagiotis所指出的,字符串中转义代码的表示仅仅是关于可视化表示,不会改变字符串的含义或编码。是的,C#(以及.NET一般)使用Unicode/UTF-16来编码内存中的字符串,但这对你的问题既不相关,也在大多数情况下不重要。

除此之外,你的主要问题似乎是这个:

最好的方法是什么,我怎样才能删除字符串开头的\a\a\r\r\r\r\r\a?

像大多数类似的问题一样,有很多方法可以解决这个问题。正则表达式(如Panagiotis建议的)肯定可以完成任务,但它们可能会很棘手,并且通常比更直接的选项慢。在特定问题上,正则表达式是最合适的解决方案,但这未必是这种情况。我没有感觉你正在寻找最快的解决方案... 但探索一下选项也无妨。

所以这里有一些想法。

如果你想要从字符串开头删除少量已知的字符,那么有一个适用于这种情况的字符串方法:TrimStart()。特别是接受一组要删除的字符的版本:

string cleanText = inputText.TrimStart('\a', '\r', '\n');

这对于少量已知字符来说是可以的。但如果你想要删除任何控制字符,你可以计算它们的数量,然后从字符串中跳过相同数量的字符:

// 计算字符串开头的控制字符数量:
int count = 0;
for (; count < inputText.Length && Char.IsControl(inputText, count); count++)
{ }

// 这个方法是安全的:
string cleanText = 
    count == 0 ? inputText : 
    count >= inputText.Length ? string.Empty :
    inputText[count..];

这恰好是执行这个特定任务的最快方法之一,但它不太美观。而且除非你频繁执行这个操作,否则你可能不会在每次操作中错过几毫秒。

既然性能不是关键问题,让我向你介绍一个最慢的选项之一:LINQ。

string cleanText = new string(inputText.SkipWhile(c => char.IsControl(c)).ToArray());

尽管性能实际上非常糟糕,但这种方法更可读。SkipWhile()会在条件满足的情况下跳过项目,其余的字符会被收集到一个数组中,然后用来创建一个新的字符串。它看起来很漂亮但很慢。就像我的猫一样。

英文:

As Panagiotis pointed out the representation of escape codes in a string is simply about visual representation and doesn't change the meaning or the encoding of the string. Yes, C# (and .NET in general) uses Unicode/UTF-16 to encode the strings in memory, but that's neither relevant to your question nor important in most cases.

That aside, your main question seems to be this:

> what is the best way to can I remove the \a\a\r\r\r\r\r\a at string begining.

As with most such questions there are a lot of ways to approach this. Regular expressions (as Panagiotis suggested) can certainly do the job, but they can be finicky and are often slower than more direct options. There are times when a regular expression is the best fit for a particular problem, but this isn't necessarily one of those times. I don't get the impression you're looking for the fastest possible solution... but it doesn't hurt to explore options.

So here are a couple of ideas.

If you're looking to remove a small number of known characters from the start of the string then there's a string method for that: TrimStart(). Specifically the version that accepts a set of characters to remove:

string cleanText = inputText.TrimLeft(&#39;\a&#39;, &#39;\r&#39;, &#39;\n&#39;);

That's fine for a small number of known characters. But if you're looking to remove any control character from the start of the string you can count them and skip that many characters from the string:

// Count control characters at the start of the string:
int count = 0;
for (; count &lt; inputText.Length &amp;&amp; Char.IsControl(inputText, count); count++)
{ }

// This monster is safe:
string cleanText = 
	count == 0 ? inputText : 
	count &gt;= inputText.Length ? string.Empty :
	inputText[count..];

This happens to be one of the fastest methods to do that particular job, but it's not the prettiest. And unless you're doing this frequently you're probably not going to miss a few extra milliseconds each time.

And since performance isn't a critical issue, let me introduce you to one of the slowest options: LINQ.

string cleanText = new string(inputText.SkipWhile(c =&gt; char.IsControl(c)).ToArray());

While the performance on this is frankly terrible, it's quite a bit more readable than the high-perforance version. SkipWhile() skips items while the condition is met, the rest of the characters are collected into an array and used to create a new string. It's pretty but slow. Just like my cat.

huangapple
  • 本文由 发表于 2023年7月11日 14:36:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76659248.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定