由于对带重音符号和标点符号的处理不当,部分URL提取不完整。

huangapple go评论46阅读模式
英文:

partial extraction of urls due to bad handling of accented and punctuation characters

问题

film_disponibili.txt文件中,我有以下这些行:

La forza dell'amore
La forza della volontà
La fuga - Girl in Flight

html_di_test.txt文件中,我有以下HTML代码:

<a href="https://cb01.clinic/la-forza-dellamore-b-n-1936/">La forza dell’amore [B/N] (1936)</a>
<a href="https://cb01.clinic/la-forza-della-volonta-hd-1988/">La forza della volontà [HD] (1988)</a>
<a href="https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/">La fuga – Girl in Flight [HD] (2016)</a>
<a href="https://cb01.clinic/lo-studente-1982/">Lo studente (1982)</a>

我尝试将这些URL从html_di_test.txt提取并写入名为urls_estratti.txt的文件,但我只得到了以下URL:

https://cb01.clinic/la-forza-della-volonta-hd-1988/

缺失的URL是:

https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/

这是我正在尝试设置的代码:

$films = Get-Content .\film_disponibili.txt
$html = Get-Content .\html_di_test.txt -Raw

$urls = foreach ($film in $films) {
    $film_decoded = [System.Web.HttpUtility]::HtmlDecode($film)
    $regex = '.*<a href="(https://cb01.clinic/.*)">' + [regex]::Escape($film_decoded) + '.*'
    $match = [regex]::Match($html, $regex)
    if ($match.Success) { $match.Groups[1].Value }
}

$urls | Set-Content .\urls_estratti.txt

条件是:如果HTML代码中存在像La fuga - Girl in Flight这样的字符串,脚本应该写入这个URL:https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/。但这并没有完全发生,我注意到有问题与重音字符、撇号和名称中间的破折号有关,但我无法完全处理它们。

英文:

Inside film_disponibili.txt I have these lines

La forza dell'amore
La forza della volontà
La fuga - Girl in Flight

Inside html_di_test.txt I have this html code

<a href="https://cb01.clinic/la-forza-dellamore-b-n-1936/">La forza dell’amore [B/N] (1936)</a>
<a href="https://cb01.clinic/la-forza-della-volonta-hd-1988/">La forza della volontà [HD] (1988)</a>
<a href="https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/">La fuga – Girl in Flight [HD] (2016)</a>
<a href="https://cb01.clinic/lo-studente-1982/">Lo studente (1982)</a>

I try to writes into a file called urls_estratti.txt only these urls extracted from html_di_test.txt

https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-forza-della-volonta-hd-1988/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/

but I get only this url

https://cb01.clinic/la-forza-della-volonta-hd-1988/

these urls are missing and I don't understand why

https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/

This is the code that I'm trying to set

    $films = Get-Content .\film_disponibili.txt
$html = Get-Content .\html_di_test.txt -Raw

$urls = foreach ($film in $films) {
    $film_decoded = [System.Web.HttpUtility]::HtmlDecode($film)
    $regex = '.*<a href="(https://cb01.clinic/.*)">' + [regex]::Escape($film_decoded) + '.*'
    $match = [regex]::Match($html, $regex)
    if ($match.Success) { $match.Groups[1].Value }
}

$urls | Set-Content .\urls_estratti.txt

Condition is this: if there is a string like La fuga - Girl in Flight into

 <a href="https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/">La fuga – Girl in Flight [HD] (2016)</a> `

script should be write this url: https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/

But this doesn't totally happen, I saw that there is a problem with the accented characters, the apostrophe and the dash in the middle of the name, but I can't handle them completely

答案1

得分: 1

如评论中所述,’ 是一个智能引号,– 是一个短破折号,分别对应于 '-,因此导致了你的问题。

你可以将你的 film_disponibili.txt 文件更改为以下内容:

La forza dell\p{P}amore
La forza della volontà
La fuga \p{P} Girl in Flight

使用 \p{P} 你可以匹配任何类型的标点字符。

解决这个问题的另一种方法是使用 Htmlfile 来解析 html_di_test.txt

代码如下所示:

$toMatch = Get-Content .\film_disponibili.txt
$content = Get-Content .\html_di_test.txt -Raw

$html = New-Object -ComObject htmlfile
$html.write([System.Text.Encoding]::Unicode.GetBytes($content))

$predicate = [Func[object, bool]]{ $_.textContent -match $args[0] }
$html.getElementsByTagName('a') |
    Where-Object { [Linq.Enumerable]::Any($toMatch, $predicate) } |
    ForEach-Object href

输出将是:

https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-forza-della-volonta-hd-1988/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/
英文:

As stated in comments ’ is a smart quote and – is a en dash which don't match ' or - respectively hence your problem.

You could change your film_disponibili.txt file to the following:

La forza dell\p{P}amore
La forza della volontà
La fuga \p{P} Girl in Flight

Using \p{P} you can match any kind of punctuation character.

Then another take on solving this problem is using Htmlfile to parse html_di_test.txt.

Code would look as follows:

$toMatch = Get-Content .\film_disponibili.txt
$content = Get-Content .\html_di_test.txt -Raw

$html = New-Object -ComObject htmlfile
$html.write([System.Text.Encoding]::Unicode.GetBytes($content))

$predicate = [Func[object, bool]]{ $_.textContent -match $args[0] }
$html.getElementsByTagName('a') |
    Where-Object { [Linq.Enumerable]::Any($toMatch, $predicate) } |
    ForEach-Object href

And output would be:

https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-forza-della-volonta-hd-1988/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/

huangapple
  • 本文由 发表于 2023年2月8日 10:44:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75380928.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定