Regex to remove enter from line starting with specific character in Powershell

huangapple go评论65阅读模式
英文:

Regex to remove enter from line starting with specific character in Powershell

问题

我有一个包含数据的巨大CSV文件,其中一些行不正确并包含了换行符。当文件被导入Excel时,我需要手动纠正数百行。我有一个在Notepad++中工作的正则表达式,可以从不以特定字符串“;;”开头的行中删除换行符。然而,相同的正则表达式在PowerShell脚本中不起作用。

输入示例:

    ;BP;7165378;XX_RAW;200SSS952;EU-PL;PL02;PL02;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;15:00:00;;;;Jhon Name;;;;;;;;9444253;;;;;;;;;;;;;"Jhon Name";;;;;;;;;;Jhon Name;;;;;;;;Final Check Approved;;;;;;;;;09.01.2023;;;;;Approve;;;;;;12077;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

代码:

    $content = Get-Content -path "C:\Users\TUF17\Desktop\File\Fix\xx_fix_temp.csv" 
    $content -Replace '"\R(?!;)"', ' ' | Out-File "C:\Users\TUF17\Desktop\File\Fix\xx_noenters.csv"

注意:我已经保留了原始代码部分的内容,不进行翻译。

英文:

I have huge csv file with data, and some of lines are incorrect and contains enters. When file is imported into Excel then I need to correct hundreds lines manually. I have regex which is work in Notepad++ and remove enters from line which is not start with specific string in this case ";" However same regex is not working in PowerShell script.

Example of input

;BP;7165378;XX_RAW;200SSS952;EU-PL;PL02;PL02;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
15:00:00;;;;Jhon Name;;;;;;;;9444253;;;;;;;;;;;;;"Jhon Name";;;;;;;;;;Jhon Name;;;;;;;;Final Check Approved;;;;;;;;;09.01.2023;;;;;Approve;;;;;;12077;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

How it should look:

;BP;7165378;XX_RAW;200SSS952;EU-PL;PL02;PL02;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;15:00:00;;;;Jhon Name;;;;;;;;9444253;;;;;;;;;;;;;"Jhon Name";;;;;;;;;;Jhon Name;;;;;;;;Final Check Approved;;;;;;;;;09.01.2023;;;;;Approve;;;;;;12077;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

Code:

$content = Get-Content -path "C:\Users\TUF17\Desktop\File\Fix\xx_fix_temp.csv" 
$content -Replace '"\R(?!;)"', ' ' |  Out-File "C:\Users\TUF17\Desktop\File\Fix\xx_noenters.csv" 

答案1

得分: 2

代码部分不需要翻译。

"这与您的 PowerShell 脚本中的行延续符 \ 有关。

我还建议如果您想将文件内容作为单个字符串而不是字符串数组获取,以便更容易进行替换,可以添加 -Raw 参数。

我假设您在使用 .csv 文件。

$content = Get-Content -Path "C:\Users\TUF17\Desktop\File\Fix\xx_fix_temp.csv" -Raw
$content -Replace '(?m)(^[^;].*)\r?\n(?!;)', '$1 ' | Out-File "C:\Users\TUF17\Desktop\File\Fix\xx_noenters.csv"
英文:

It has to do with line continuation \ in your ps script.

I would also suggest adding -Raw if you want to get content of file as single string, rather than an array of strings, for easier replacing.

I'm assuming it's a .csv file you are using.

$content = Get-Content -Path "C:\Users\TUF17\Desktop\File\Fix\xx_fix_temp.csv" -Raw
$content -Replace '(?m)(^[^;].*)\r?\n(?!;)', '$1 ' |  Out-File "C:\Users\TUF17\Desktop\File\Fix\xx_noenters.csv"

答案2

得分: 2

<!-- language-all: sh -->

在问题的有益评论基础上进行构建:

  • 为了跨越文本文件的多行进行替换,您需要使用Get-Content -Raw读取整个文件或执行基于状态的逐行处理,例如使用switch语句的-File参数。

    • 注意:您也可以通过将Get-Content不使用-Raw)与ForEach-Object调用结合使用来进行基于状态的逐行处理,但这样的解决方案速度较慢 - 参见此答案
  • 您的**正则表达式&#39;&quot;\R(?!;)&quot;&#39;有两个问题**:

    • 它意外地使用嵌入的&quot;引用。仅使用_&#39;...&#39;_引用。PowerShell对于正则表达式文本没有特殊的语法 - 它只是使用_字符串_。
      为避免与PowerShell自身的字符串插值混淆,最好使用保留的&#39;...&#39;字符串而不是可展开(插值)的&quot;...&quot;字符串 - 请参阅概念性的about_Quoting_Rules帮助主题。

    • \R是不受支持的正则表达式转义序列;您可能是指**\r**,即CR字符(回车,U+000D)。

      • 如果您想匹配CRLF,即Windows格式的换行_序列_,请使用\r\n

      • 如果您想匹配LF(LINE FEED,U+000A)单独(Unix格式的换行),请使用\n

      • 如果您想匹配_两种_换行格式,请使用\r?\n

      • 顺便说一下:虽然单独使用CR在实践中很少见,但PowerShell也将单独的CR字符视为换行,这就是Get-Content 不使用-Raw(按行读取)的原因,因为它不会起作用。


Get-Content -Raw解决方案(比switch -File更简单更快,但需要整个文件在内存中存储两次):

# 根据需要调整&#39;\r&#39;部分(请参阅上文)。
(Get-Content -Raw -LiteralPath $inFile) -replace &#39;\r(?!;)&#39; |
  Set-Content -NoNewLine -Encoding utf8 -LiteralPath $outFile

注意:

  • 通过未指定-replace的替换操作数,该命令移除所有不跟随(?!;))的换行,从而有效地将直接跟随CR的下一行连接到前一行,这是基于您的示例输出的期望行为。

  • 对于保存_文本_,Set-ContentOut-File稍快一些(在这里几乎没有区别,因为只写入一个_单一的_大字符串)。

    • -NoNewLine防止将额外的尾随换行追加到文件。
    • -Encoding utf8指定输出字符编码。请注意,PowerShell从不保留_输入_字符编码,因此除非在_输出_上使用-Encoding,否则您将得到相应cmdlet的_默认_字符编码,在_Windows PowerShell_中,这在各个cmdlet之间变化;在_PowerShell (Core) 7+_中,_一致的_默认值现在是无BOM的UTF-8。请注意,在_Windows PowerShell_中-Encoding utf8总是创建一个带有BOM的文件;有关背景信息和解决方法,请参阅此答案
英文:

<!-- language-all: sh -->

Building on the helpful comments on the question:

  • In order to perform replacements across lines of a text file, you need to either read the file in full - with Get-Content -Raw - or perform stateful line-by-line processing, such as with the -File parameter of a switch statement.

    • Note: While you could also do stateful line-by-line processing by combining Get-Content (without -Raw) with a ForEach-Object call, such a solution would be much slower - see this answer.
  • Your regex, &#39;&quot;\R(?!;)&quot;&#39;, has two problems:

    • It accidentally uses embedded &quot; quoting. Use only &#39;...&#39; quoting. PowerShell has no special syntax for regex literals - it simply uses strings.
      To avoid confusion with PowerShell's own up-front string interpolation, it is better to use verbatim &#39;...&#39; strings rather than expandable (interpolating) &quot;...&quot; strings - see the conceptual about_Quoting_Rules help topic.

    • \R is an unsupported regex escape sequence; you presumably meant \r, i.e. a CR char. (CARRIAGE RETURN, U+000D)

      • If you instead want to match CRLF, a Windows-format newline sequence, use \r\n

      • If you want to match LF (LINE FEED, U+000A)) alone (a Unix-format newline), use \n

      • If you want to match both newline formats, use \r?\n

      • As an aside: While use of CR alone is rare in practice, PowerShell treats stand-alone CR characters as newlines as well, which is why Get-Content without -Raw, which reads line by line (as you've tried) wouldn't work.


Get-Content -Raw solution (easier and faster than switch -File, but requires the whole file to fit into memory twice):

# Adjust the &#39;\r&#39; part as needed (see above).
(Get-Content -Raw -LiteralPath $inFile) -replace &#39;\r(?!;)&#39; |
  Set-Content -NoNewLine -Encoding utf8 -LiteralPath $outFile

Note:

  • By not specifying a substitution operand to -replace, the command removes all newlines not followed by a ; ((?!;)), effectively joining the line that follows the CR directly to the previous line, which is the desired behavior based on your sample output.

  • For saving text, Set-Content is a bit faster than Out-File (it'll make no appreciable difference here, given that only a single, large string is written).

    • -NoNewLine prevents a(n additional) trailing newline from getting appended to the file.
    • -Encoding utf8 specifies the output character encoding. Note that PowerShell never preserves the input character encoding, so unless you use -Encoding on output, you'll get the respective cmdlet's default character encoding, which in Windows PowerShell varies from cmdlet to cmdlet; in PowerShell (Core) 7+, the consistent default is now BOM-less UTF-8. Note that in Windows PowerShell -Encoding utf8 always create a file with a BOM; see this answer for background information and workarounds.

huangapple
  • 本文由 发表于 2023年2月18日 21:15:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/75493578.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定