英文:
partial extraction of urls due to bad handling of accented and punctuation characters
问题
在film_disponibili.txt
文件中,我有以下这些行:
La forza dell'amore
La forza della volontà
La fuga - Girl in Flight
在html_di_test.txt
文件中,我有以下HTML代码:
<a href="https://cb01.clinic/la-forza-dellamore-b-n-1936/">La forza dell&#8217;amore [B/N] (1936)</a>
<a href="https://cb01.clinic/la-forza-della-volonta-hd-1988/">La forza della volontà [HD] (1988)</a>
<a href="https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/">La fuga &#8211; Girl in Flight [HD] (2016)</a>
<a href="https://cb01.clinic/lo-studente-1982/">Lo studente (1982)</a>
我尝试将这些URL从html_di_test.txt
提取并写入名为urls_estratti.txt
的文件,但我只得到了以下URL:
https://cb01.clinic/la-forza-della-volonta-hd-1988/
缺失的URL是:
https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/
这是我正在尝试设置的代码:
$films = Get-Content .\film_disponibili.txt
$html = Get-Content .\html_di_test.txt -Raw
$urls = foreach ($film in $films) {
$film_decoded = [System.Web.HttpUtility]::HtmlDecode($film)
$regex = '.*<a href="(https://cb01.clinic/.*)">' + [regex]::Escape($film_decoded) + '.*'
$match = [regex]::Match($html, $regex)
if ($match.Success) { $match.Groups[1].Value }
}
$urls | Set-Content .\urls_estratti.txt
条件是:如果HTML代码中存在像La fuga - Girl in Flight
这样的字符串,脚本应该写入这个URL:https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/
。但这并没有完全发生,我注意到有问题与重音字符、撇号和名称中间的破折号有关,但我无法完全处理它们。
英文:
Inside film_disponibili.txt
I have these lines
La forza dell'amore
La forza della volontà
La fuga - Girl in Flight
Inside html_di_test.txt
I have this html code
<a href="https://cb01.clinic/la-forza-dellamore-b-n-1936/">La forza dell&#8217;amore [B/N] (1936)</a>
<a href="https://cb01.clinic/la-forza-della-volonta-hd-1988/">La forza della volontà [HD] (1988)</a>
<a href="https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/">La fuga &#8211; Girl in Flight [HD] (2016)</a>
<a href="https://cb01.clinic/lo-studente-1982/">Lo studente (1982)</a>
I try to writes into a file called urls_estratti.txt
only these urls extracted from html_di_test.txt
https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-forza-della-volonta-hd-1988/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/
but I get only this url
https://cb01.clinic/la-forza-della-volonta-hd-1988/
these urls are missing and I don't understand why
https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/
This is the code that I'm trying to set
$films = Get-Content .\film_disponibili.txt
$html = Get-Content .\html_di_test.txt -Raw
$urls = foreach ($film in $films) {
$film_decoded = [System.Web.HttpUtility]::HtmlDecode($film)
$regex = '.*<a href="(https://cb01.clinic/.*)">' + [regex]::Escape($film_decoded) + '.*'
$match = [regex]::Match($html, $regex)
if ($match.Success) { $match.Groups[1].Value }
}
$urls | Set-Content .\urls_estratti.txt
Condition is this: if there is a string like La fuga - Girl in Flight
into
<a href="https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/">La fuga &#8211; Girl in Flight [HD] (2016)</a> `
script should be write this url: https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/
But this doesn't totally happen, I saw that there is a problem with the accented characters, the apostrophe and the dash in the middle of the name, but I can't handle them completely
答案1
得分: 1
如评论中所述,&#8217;
是一个智能引号,&#8211;
是一个短破折号,分别对应于 '
或 -
,因此导致了你的问题。
你可以将你的 film_disponibili.txt
文件更改为以下内容:
La forza dell\p{P}amore
La forza della volontà
La fuga \p{P} Girl in Flight
使用 \p{P}
你可以匹配任何类型的标点字符。
解决这个问题的另一种方法是使用 Htmlfile
来解析 html_di_test.txt
。
代码如下所示:
$toMatch = Get-Content .\film_disponibili.txt
$content = Get-Content .\html_di_test.txt -Raw
$html = New-Object -ComObject htmlfile
$html.write([System.Text.Encoding]::Unicode.GetBytes($content))
$predicate = [Func[object, bool]]{ $_.textContent -match $args[0] }
$html.getElementsByTagName('a') |
Where-Object { [Linq.Enumerable]::Any($toMatch, $predicate) } |
ForEach-Object href
输出将是:
https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-forza-della-volonta-hd-1988/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/
英文:
As stated in comments &#8217;
is a smart quote and &#8211;
is a en dash which don't match '
or -
respectively hence your problem.
You could change your film_disponibili.txt
file to the following:
La forza dell\p{P}amore
La forza della volontà
La fuga \p{P} Girl in Flight
Using \p{P}
you can match any kind of punctuation character.
Then another take on solving this problem is using Htmlfile
to parse html_di_test.txt
.
Code would look as follows:
$toMatch = Get-Content .\film_disponibili.txt
$content = Get-Content .\html_di_test.txt -Raw
$html = New-Object -ComObject htmlfile
$html.write([System.Text.Encoding]::Unicode.GetBytes($content))
$predicate = [Func[object, bool]]{ $_.textContent -match $args[0] }
$html.getElementsByTagName('a') |
Where-Object { [Linq.Enumerable]::Any($toMatch, $predicate) } |
ForEach-Object href
And output would be:
https://cb01.clinic/la-forza-dellamore-b-n-1936/
https://cb01.clinic/la-forza-della-volonta-hd-1988/
https://cb01.clinic/la-fuga-girl-in-flight-hd-2016/
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论