2023年5月11日 18:48:58go评论177阅读模式

英文:

Extract url from text file

问题

我可以帮你翻译以下代码部分：

我有一个包含文本的大型文本文件，其中包含文本“在浏览器中查看此电子邮件”，然后是一个URL。它可能会变化，有时URL的一部分会移到下一行。

此外，当它移到下一行时，末尾会有一个等号，需要删除，但不能删除其他可能存在的等号。

一些示例：

```none
在浏览器中查看此电子邮件（https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be）

在浏览器中查看此电子邮件&lt;https://mail.com/?e=3D14=
60&amp;u=3Df612577510b&amp;id=3D2c8be&gt;

在浏览器中查看此电子邮件（https://eg.com/?e=3D1460&amp;u=3Df6510b&amp;id=3D2c8be）

我需要使用PowerShell提取该URL，不包括括号（圆括号），有时括号可能是< >，以便我可以将其下载为HTML文件。

如果（$str -match ''（?<=()https?://[^)]+'）{

# ...从中删除任何换行符，并输出结果。

$Matches.0 -replace ''\r?\n'
}

如果（$str -match ''（?<)https?://[^>]+'）{

# ...从中删除任何换行符，并输出结果。

$Matches.0 -replace ''\r?\n'
}


<details>
<summary>英文:</summary>

I have a large text file that contains the text View this email in your browser then a URL. It can vary and sometimes part of the URL goes onto the next line.

Also, when it does go onto the next line there is an equals symbol at the end which needs to be removed but not any other equals symbols which may be there.

Few examples:

```none
View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be)

View this email in your browser &lt;https://mail.com/?e=3D14=
60&amp;u=3Df612577510b&amp;id=3D2c8be&gt;

View this email in your browser (https://eg.com/?e=3D1460&amp;u=3Df6510b&amp;id=3D2c8be)

I need to extract that URL using PowerShell, without the brackets (parentheses), which sometimes can be < > so that I can download it as a HTML file.

 if ($str -match &#39;(?&lt;=\()https?://[^)]+&#39;) {
 #  # ... remove any line breaks from it, and output the result.
  $Matches.0 -replace &#39;\r?\n&#39;
 }

 if ($str -match &#39;(?&lt;=\&lt;)https?://[^&gt;]+&#39;) {
 #  # ... remove any line breaks from it, and output the result.
  $Matches.0 -replace &#39;\r?\n&#39;
 }

答案1

得分: 0

此解决方案适用于您提供的示例：

$text = @(
    '查看此电子邮件在浏览器中的显示 (https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be)',
    '查看此电子邮件在浏览器中的显示 &lt;https://mail.com/?e=3D14=
60&amp;u=3Df612577510b&amp;id=3D2c8be&gt;',
    '查看此电子邮件在浏览器中的显示 (https://eg.com/?e=3D1460&amp;u=3Df6510b&amp;id=3D2c8be)'
)

$text = $text | ForEach-Object {
    $PSItem.Replace('&lt;','(').Replace('&gt;',')').Replace("=`n",'').Split('(')[1].Replace(')','')
}

输出如下：

https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be
https://mail.com/?e=3D1460&amp;u=3Df612577510b&amp;id=3D2c8be
https://eg.com/?e=3D1460&amp;u=3Df6510b&amp;id=3D2c8be

我只是使用了替换而没有使用正则表达式。您在拆分URL方面遇到困难的部分通过以下方法解决：

.Replace("=`n")

英文:

this solution works for the examples you provided:

    $text = @(
    &#39;View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be)&#39;,
    &#39;View this email in your browser &lt;https://mail.com/?e=3D14=
60&amp;u=3Df612577510b&amp;id=3D2c8be&gt;&#39;,
    &#39;View this email in your browser (https://eg.com/?e=3D1460&amp;u=3Df6510b&amp;id=3D2c8be)&#39;
)

$text = $text | ForEach-Object {
    $PSItem.Replace(&#39;&lt;&#39;,&#39;(&#39;).Replace(&#39;&gt;&#39;,&#39;)&#39;).Replace(&quot;=`n&quot;,&#39;&#39;).Split(&#39;(&#39;)[1].Replace(&#39;)&#39;,&#39;&#39;)
}

The output looks like this:

https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be
https://mail.com/?e=3D1460&u=3Df612577510b&id=3D2c8be
https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be

I simply use replace without regex.
The part where you struggled with the split url is solved by doing

.Replace(&quot;=`n&quot;)

答案2

得分: 0

以下是翻译好的部分：

由于您正在尝试进行跨行匹配，因此需要确保将文本文件作为整体读取，即作为单个多行字符串，您可以使用Get-Content cmdlet 的 -Raw 开关来实现此目的。
除此之外，您的正则表达式中唯一缺少的部分是在匹配并移除前置的 = 之前也要匹配新行。

以下从输入文件 file.txt 提取所有URL，并将它们输出为一个字符串数组，去掉了换行符和行尾的 =：

# 注意 &#39;=&#39; 在 &#39;\r?\n&#39; 之前
[regex]::Matches(
  (Get-Content -Raw file.txt),
  &#39;(?&lt;=[&lt;(])https://[^&gt;)]+&#39;
).Value -replace &#39;=\r?\n&#39;

直接使用[regex]::Matches() .NET API 允许您一次提取所有匹配项，而PowerShell的 -match 操作符只会查找一个匹配项。
- 有关未来引入 -matchall 操作符的提案，请参阅 GitHub issue #7867。
然后使用 -replace 来移除匹配项中的新行 (\r?\n)，以及前面的 =。

有关URL匹配正则表达式的解释以及进行实验的能力，请参阅这个 regex101.com 页面。

使用多行字符串文字的示例：

[regex]::Matches(&#39;
View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be)

View this email in your browser &lt;https://mail.com/?e=3D14=
60&amp;u=3Df612577510b&amp;id=3D2c8be&gt;

View this email in your browser (https://eg.com/?e=3D1460&amp;u=3Df6510b&amp;id=3D2c8be)
  &#39;,
  &#39;(?&lt;=[&lt;(])https://[^&gt;)]+&#39;
).Value -replace &#39;=\r?\n&#39;

输出：

https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be
https://mail.com/?e=3D1460&u=3Df612577510b&id=3D2c8be
https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be

英文:

Since you're trying to match across lines, you need to make sure that your text file is read as a whole, i.e. as a single, multiline string, which you can do with the -Raw switch of the Get-Content cmdlet.
Apart from that, the only thing missing from your regex was to also match and remove a preceding = before newlines.

The following extracts all URLs from input file file.txt, and outputs them - with the newline and line-ending = removed - as an array of strings:

# Note the &#39;=&#39; before &#39;\r?\n&#39;
[regex]::Matches(
  (Get-Content -Raw file.txt),
  &#39;(?&lt;=[&lt;(])https://[^&gt;)]+&#39;
).Value -replace &#39;=\r?\n&#39;

Direct use of the [regex]::Matches() .NET API allows you to extract all matches at once, whereas PowerShell's -match operator only ever looks for one match.
- See GitHub issue #7867 for a proposal to introduce a -matchall operator in the future.
-replace is then used to remove newlines (\r?\n) from the matches, along with a preceding =.

For an explanation of the URL-matching regex and the ability to experiment with it, see this regex101.com page.

Example with a multiline string literal:

[regex]::Matches(&#39;
View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be)

View this email in your browser &lt;https://mail.com/?e=3D14=
60&amp;u=3Df612577510b&amp;id=3D2c8be&gt;

View this email in your browser (https://eg.com/?e=3D1460&amp;u=3Df6510b&amp;id=3D2c8be)
  &#39;,
  &#39;(?&lt;=[&lt;(])https://[^&gt;)]+&#39;
).Value -replace &#39;=\r?\n&#39;

Output:

https://us15.campaign-archive.com/?e=3D1460&amp;u=3Df6e2bb1612577510b&amp;id=3D2c8be
https://mail.com/?e=3D1460&u=3Df612577510b&id=3D2c8be
https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从文本文件提取URL

问题

# ...从中删除任何换行符，并输出结果。

# ...从中删除任何换行符，并输出结果。

答案1

答案2

如何防止背景图像在模态模式下放大？

为什么 Java 正则表达式会匹配下划线？

在HTML中如何为文本中的某些单词添加颜色？

Trim text inside flexbox

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论