从文本文件提取URL

huangapple go评论81阅读模式
英文:

Extract url from text file

问题

我可以帮你翻译以下代码部分:

我有一个包含文本的大型文本文件,其中包含文本“在浏览器中查看此电子邮件”,然后是一个URL。它可能会变化,有时URL的一部分会移到下一行。

此外,当它移到下一行时,末尾会有一个等号,需要删除,但不能删除其他可能存在的等号。

一些示例:

```none
在浏览器中查看此电子邮件(https://us15.campaign-archive.com/?e=3D1460&u=3Df6e2bb1612577510b&id=3D2c8be)

在浏览器中查看此电子邮件<https://mail.com/?e=3D14=
60&u=3Df612577510b&id=3D2c8be>

在浏览器中查看此电子邮件(https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be)

我需要使用PowerShell提取该URL,不包括括号(圆括号),有时括号可能是< >,以便我可以将其下载为HTML文件。

如果($str -match ''(?<=()https?://[^)]+'){

# ...从中删除任何换行符,并输出结果。

$Matches.0 -replace ''\r?\n'
}

如果($str -match ''(?<)https?://[^>]+'){

# ...从中删除任何换行符,并输出结果。

$Matches.0 -replace ''\r?\n'
}


<details>
<summary>英文:</summary>

I have a large text file that contains the text View this email in your browser then a URL. It can vary and sometimes part of the URL goes onto the next line.

Also, when it does go onto the next line there is an equals symbol at the end which needs to be removed but not any other equals symbols which may be there.

Few examples:

```none
View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be)

View this email in your browser &lt;https://mail.com/?e=3D14=
60&amp;u=3Df612577510b&amp;id=3D2c8be&gt;

View this email in your browser (https://eg.com/?e=3D1460&amp;u=3Df6510b&amp;id=3D2c8be)

I need to extract that URL using PowerShell, without the brackets (parentheses), which sometimes can be < > so that I can download it as a HTML file.

 if ($str -match &#39;(?&lt;=\()https?://[^)]+&#39;) {
 #  # ... remove any line breaks from it, and output the result.
  $Matches.0 -replace &#39;\r?\n&#39;
 }

 if ($str -match &#39;(?&lt;=\&lt;)https?://[^&gt;]+&#39;) {
 #  # ... remove any line breaks from it, and output the result.
  $Matches.0 -replace &#39;\r?\n&#39;
 }

答案1

得分: 0

此解决方案适用于您提供的示例:

$text = @(
    '查看此电子邮件在浏览器中的显示 (https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be)',
    '查看此电子邮件在浏览器中的显示 &lt;https://mail.com/?e=3D14=
60&amp;u=3Df612577510b&amp;id=3D2c8be&gt;',
    '查看此电子邮件在浏览器中的显示 (https://eg.com/?e=3D1460&amp;u=3Df6510b&amp;id=3D2c8be)'
)

$text = $text | ForEach-Object {
    $PSItem.Replace('&lt;','(').Replace('&gt;',')').Replace("=`n",'').Split('(')[1].Replace(')','')
}

输出如下

https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be
https://mail.com/?e=3D1460&amp;u=3Df612577510b&amp;id=3D2c8be
https://eg.com/?e=3D1460&amp;u=3Df6510b&amp;id=3D2c8be

我只是使用了替换而没有使用正则表达式您在拆分URL方面遇到困难的部分通过以下方法解决

.Replace("=`n")
英文:

this solution works for the examples you provided:

    $text = @(
    &#39;View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be)&#39;,
    &#39;View this email in your browser &lt;https://mail.com/?e=3D14=
60&amp;u=3Df612577510b&amp;id=3D2c8be&gt;&#39;,
    &#39;View this email in your browser (https://eg.com/?e=3D1460&amp;u=3Df6510b&amp;id=3D2c8be)&#39;
)

$text = $text | ForEach-Object {
    $PSItem.Replace(&#39;&lt;&#39;,&#39;(&#39;).Replace(&#39;&gt;&#39;,&#39;)&#39;).Replace(&quot;=`n&quot;,&#39;&#39;).Split(&#39;(&#39;)[1].Replace(&#39;)&#39;,&#39;&#39;)
}

The output looks like this:

https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be
https://mail.com/?e=3D1460&u=3Df612577510b&id=3D2c8be
https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be

I simply use replace without regex.
The part where you struggled with the split url is solved by doing

.Replace(&quot;=`n&quot;)

答案2

得分: 0

以下是翻译好的部分:

  • 由于您正在尝试进行跨行匹配,因此需要确保将文本文件作为整体读取,即作为单个多行字符串,您可以使用Get-Content cmdlet 的 -Raw 开关来实现此目的。

  • 除此之外,您的正则表达式中唯一缺少的部分是在匹配并移除前置的 = 之前也要匹配新行。

以下从输入文件 file.txt 提取所有URL,并将它们输出为一个字符串数组,去掉了换行符和行尾的 =

# 注意 &#39;=&#39; 在 &#39;\r?\n&#39; 之前
[regex]::Matches(
  (Get-Content -Raw file.txt),
  &#39;(?&lt;=[&lt;(])https://[^&gt;)]+&#39;
).Value -replace &#39;=\r?\n&#39;
  • 直接使用[regex]::Matches() .NET API 允许您一次提取 所有 匹配项,而PowerShell的 -match 操作符只会查找 一个 匹配项。

  • 然后使用 -replace 来移除匹配项中的新行 (\r?\n),以及前面的 =

有关URL匹配正则表达式的解释以及进行实验的能力,请参阅 这个 regex101.com 页面


使用多行字符串文字的示例:

[regex]::Matches(&#39;
View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be)

View this email in your browser &lt;https://mail.com/?e=3D14=
60&amp;u=3Df612577510b&amp;id=3D2c8be&gt;

View this email in your browser (https://eg.com/?e=3D1460&amp;u=3Df6510b&amp;id=3D2c8be)
  &#39;,
  &#39;(?&lt;=[&lt;(])https://[^&gt;)]+&#39;
).Value -replace &#39;=\r?\n&#39;

输出:

https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be
https://mail.com/?e=3D1460&u=3Df612577510b&id=3D2c8be
https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be
英文:

<!-- language-all: sh -->

  • Since you're trying to match across lines, you need to make sure that your text file is read as a whole, i.e. as a single, multiline string, which you can do with the -Raw switch of the Get-Content cmdlet.

  • Apart from that, the only thing missing from your regex was to also match and remove a preceding = before newlines.

The following extracts all URLs from input file file.txt, and outputs them - with the newline and line-ending = removed - as an array of strings:

# Note the &#39;=&#39; before &#39;\r?\n&#39;
[regex]::Matches(
  (Get-Content -Raw file.txt),
  &#39;(?&lt;=[&lt;(])https://[^&gt;)]+&#39;
).Value -replace &#39;=\r?\n&#39;
  • Direct use of the [regex]::Matches() .NET API allows you to extract all matches at once, whereas PowerShell's -match operator only ever looks for one match.

  • -replace is then used to remove newlines (\r?\n) from the matches, along with a preceding =.

For an explanation of the URL-matching regex and the ability to experiment with it, see this regex101.com page.


Example with a multiline string literal:

[regex]::Matches(&#39;
View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be)

View this email in your browser &lt;https://mail.com/?e=3D14=
60&amp;u=3Df612577510b&amp;id=3D2c8be&gt;

View this email in your browser (https://eg.com/?e=3D1460&amp;u=3Df6510b&amp;id=3D2c8be)
  &#39;,
  &#39;(?&lt;=[&lt;(])https://[^&gt;)]+&#39;
).Value -replace &#39;=\r?\n&#39;

Output:

https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be
https://mail.com/?e=3D1460&u=3Df612577510b&id=3D2c8be
https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be

huangapple
  • 本文由 发表于 2023年5月11日 18:48:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76226800.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定