2023年2月8日 10:44:14go评论58阅读模式

英文:

partial extraction of urls due to bad handling of accented and punctuation characters

问题

在film_disponibili.txt文件中，我有以下这些行：

La forza dell&#39;amore
La forza della volont&#224;
La fuga - Girl in Flight

在html_di_test.txt文件中，我有以下HTML代码：

&lt;a href=&quot;https://cb01.clinic/la-forza-dellamore-b-n-1936/&quot;&gt;La forza dell&amp;#8217;amore [B/N] (1936)&lt;/a&gt;
&lt;a href=&quot;https://cb01.clinic/la-forza-della-volonta-hd-1988/&quot;&gt;La forza della volont&#224; [HD] (1988)&lt;/a&gt;
&lt;a href=&quot;https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/&quot;&gt;La fuga &amp;#8211; Girl in Flight [HD] (2016)&lt;/a&gt;
&lt;a href=&quot;https://cb01.clinic/lo-studente-1982/&quot;&gt;Lo studente (1982)&lt;/a&gt;

我尝试将这些URL从html_di_test.txt提取并写入名为urls_estratti.txt的文件，但我只得到了以下URL：

https://cb01.clinic/la-forza-della-volonta-hd-1988/

缺失的URL是：

https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/

这是我正在尝试设置的代码：

$films = Get-Content .\film_disponibili.txt
$html = Get-Content .\html_di_test.txt -Raw

$urls = foreach ($film in $films) {
    $film_decoded = [System.Web.HttpUtility]::HtmlDecode($film)
    $regex = '.*&lt;a href=&quot;(https://cb01.clinic/.*)&quot;&gt;' + [regex]::Escape($film_decoded) + '.*'
    $match = [regex]::Match($html, $regex)
    if ($match.Success) { $match.Groups[1].Value }
}

$urls | Set-Content .\urls_estratti.txt

条件是：如果HTML代码中存在像La fuga - Girl in Flight这样的字符串，脚本应该写入这个URL：https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/。但这并没有完全发生，我注意到有问题与重音字符、撇号和名称中间的破折号有关，但我无法完全处理它们。

英文:

Inside film_disponibili.txt I have these lines

La forza dell&#39;amore
La forza della volont&#224;
La fuga - Girl in Flight

Inside html_di_test.txt I have this html code

&lt;a href=&quot;https://cb01.clinic/la-forza-dellamore-b-n-1936/&quot;&gt;La forza dell&amp;#8217;amore [B/N] (1936)&lt;/a&gt;
&lt;a href=&quot;https://cb01.clinic/la-forza-della-volonta-hd-1988/&quot;&gt;La forza della volont&#224; [HD] (1988)&lt;/a&gt;
&lt;a href=&quot;https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/&quot;&gt;La fuga &amp;#8211; Girl in Flight [HD] (2016)&lt;/a&gt;
&lt;a href=&quot;https://cb01.clinic/lo-studente-1982/&quot;&gt;Lo studente (1982)&lt;/a&gt;

I try to writes into a file called urls_estratti.txt only these urls extracted from html_di_test.txt

https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-forza-della-volonta-hd-1988/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/

but I get only this url

https://cb01.clinic/la-forza-della-volonta-hd-1988/

these urls are missing and I don't understand why

https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/

This is the code that I'm trying to set

    $films = Get-Content .\film_disponibili.txt
$html = Get-Content .\html_di_test.txt -Raw

$urls = foreach ($film in $films) {
    $film_decoded = [System.Web.HttpUtility]::HtmlDecode($film)
    $regex = &#39;.*&lt;a href=&quot;(https://cb01.clinic/.*)&quot;&gt;&#39; + [regex]::Escape($film_decoded) + &#39;.*&#39;
    $match = [regex]::Match($html, $regex)
    if ($match.Success) { $match.Groups[1].Value }
}

$urls | Set-Content .\urls_estratti.txt

Condition is this: if there is a string like La fuga - Girl in Flight into

 &lt;a href=&quot;https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/&quot;&gt;La fuga &amp;#8211; Girl in Flight [HD] (2016)&lt;/a&gt; `

script should be write this url: https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/

But this doesn't totally happen, I saw that there is a problem with the accented characters, the apostrophe and the dash in the middle of the name, but I can't handle them completely

答案1

得分: 1

如评论中所述，&#8217; 是一个智能引号，&#8211; 是一个短破折号，分别对应于 ' 或 -，因此导致了你的问题。

你可以将你的 film_disponibili.txt 文件更改为以下内容：

La forza dell\p{P}amore
La forza della volont&#224;
La fuga \p{P} Girl in Flight

使用 \p{P} 你可以匹配任何类型的标点字符。

解决这个问题的另一种方法是使用 Htmlfile 来解析 html_di_test.txt。

代码如下所示：

$toMatch = Get-Content .\film_disponibili.txt
$content = Get-Content .\html_di_test.txt -Raw

$html = New-Object -ComObject htmlfile
$html.write([System.Text.Encoding]::Unicode.GetBytes($content))

$predicate = [Func[object, bool]]{ $_.textContent -match $args[0] }
$html.getElementsByTagName(&#39;a&#39;) |
    Where-Object { [Linq.Enumerable]::Any($toMatch, $predicate) } |
    ForEach-Object href

输出将是：

https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-forza-della-volonta-hd-1988/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/

英文:

As stated in comments &#8217; is a smart quote and &#8211; is a en dash which don't match ' or - respectively hence your problem.

You could change your film_disponibili.txt file to the following:

La forza dell\p{P}amore
La forza della volont&#224;
La fuga \p{P} Girl in Flight

Using \p{P} you can match any kind of punctuation character.

Then another take on solving this problem is using Htmlfile to parse html_di_test.txt.

Code would look as follows:

$toMatch = Get-Content .\film_disponibili.txt
$content = Get-Content .\html_di_test.txt -Raw

$html = New-Object -ComObject htmlfile
$html.write([System.Text.Encoding]::Unicode.GetBytes($content))

$predicate = [Func[object, bool]]{ $_.textContent -match $args[0] }
$html.getElementsByTagName(&#39;a&#39;) |
    Where-Object { [Linq.Enumerable]::Any($toMatch, $predicate) } |
    ForEach-Object href

And output would be:

https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-forza-della-volonta-hd-1988/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

由于对带重音符号和标点符号的处理不当，部分URL提取不完整。

问题

答案1

从Bios-Serial中提取数字并使其成为四位数。

mongodump on Powershell with –query option : "error parsing query as extended json", "error parsing command line"

Powershell: Spilt a string to an array of sub-strings such that none when encoded to Uft8 has a byte length > n

正则表达式 – 匹配不在字符类序列中的数字

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论