xidel: hacker news 上结果顺序错误

huangapple go评论40阅读模式
英文:

xidel: wrong order of results on hacker news

问题

To scrape hacker news, I use:

xidel -e '//span[@class="titleline"]/a/@href|//span[@class="titleline"]' https://news.ycombinator.com/newest

But the output is not in the expected order, the URL comes after the text, so it's very difficult to parse.

Does I miss something to have the good order?

I have:

There Is No Such Thing as a Microservice (youtube.com)
https://www.youtube.com/watch?v=FXCLLsCGY0s

I expect:

https://www.youtube.com/watch?v=FXCLLsCGY0s
There Is No Such Thing as a Microservice (youtube.com)

Or even better:

https://www.youtube.com/watch?v=FXCLLsCGY0s There Is No Such Thing as a Microservice (youtube.com)
英文:

To scrape hacker news, I use:

xidel -e '//span[@class="titleline"]/a/@href|//span[@class="titleline"]' https://news.ycombinator.com/newest 

But the output in not in the expected order, the URL come after the text, so it's very difficult to parse.

Does I miss something to have the good order?

I have:

There Is No Such Thing as a Microservice (youtube.com)

I expect:

https://www.youtube.com/watch?v=FXCLLsCGY0s
There Is No Such Thing as a Microservice (youtube.com)

Or even better

https://www.youtube.com/watch?v=FXCLLsCGY0s There Is No Such Thing as a Microservice (youtube.com)

答案1

得分: 2

请参阅关于为什么会发生这种情况以及为什么在这种情况下应该使用XPath 3映射运算符 !“在序列上而不是集合上使用 /”部分:

$ xidel -s "https://news.ycombinator.com/newest" -e '
  //span[@class="titleline"]/a ! (@href,.)
'

(还请指定输入)

对于简单的字符串连接,这是不必要的:

-e '//span[@class="titleline"]/a/join((@href,.))'
-e '//span[@class="titleline"]/a/concat(@href," ",.)'
-e '//span[@class="titleline"]/a/x{"@href"} {.}'

(奖励)输出为JSON:

$ xidel -s "https://news.ycombinator.com/newest" -e '
  array{
    //span[@class="titleline"]/a{
      "title": .,
      "url": @href
    }
  }
'
英文:

Please see "Using / on sequences rather than on sets" on why this is happening and why you should be using the XPath 3 mapping operator ! in this case:

$ xidel -s "https://news.ycombinator.com/newest" -e '
  //span[@class="titleline"]/a ! (@href,.)
'

(also please specify input first)

For a simple string-concatenation this isn't necessary:

-e '//span[@class="titleline"]/a/join((@href,.))'
-e '//span[@class="titleline"]/a/concat(@href," ",.)'
-e '//span[@class="titleline"]/a/x"{@href} {.}"'

(Bonus) Output to JSON:

$ xidel -s "https://news.ycombinator.com/newest" -e '
  array{
    //span[@class="titleline"]/a/{
      "title":.,
      "url":@href
    }
  }
'

答案2

得分: 1

Nodes are returned in document order, not in XPath order, so additional parsing is needed. With xmllint and awk:

Result:

  1. https://github.com/thesephist/ink Ink: Minimal, functional programming language inspired by modern JavaScript, Go
  2. https://controlleddigitallending.org/whitepaper/ A White Paper on Controlled Digital Lending of Library Books
  3. item?id=35471687 Ask HN: Connect Guitar to Tesla?

Note: Xpath in the answer does not need awk since href comes before a/text() in document order, as OP expected. Added for reference on how to change output order.

英文:

Nodes are returned in document order not in XPath order so additional parsing is needed. With xmllint and awk

xmllint --html --recover --xpath '//span[@class="titleline"]/a/@href|//span[@class="titleline"]/a/text()' tmp.html 2>/dev/null|\
gawk 'BEGIN{RS="\n? href="; FS="\n"}{ print $1, $2}' | tr -d '"'

Result

https://github.com/thesephist/ink Ink: Minimal, functional programming language inspired by modern JavaScript, Go
https://controlleddigitallending.org/whitepaper/ A White Paper on Controlled Digital Lending of Library Books
item?id=35471687 Ask HN: Connect Guitar to Tesla?

Note: Xpath in the answer does not need awksince href comes before a/text() in document order as OP expected. Added for reference on how to change ouput order.

答案3

得分: 0

$xidel -e ''//span[@class="titleline"]/a/@href|//span[@class="titleline"]/a/text()'' https://news.ycombinator.com/newest

英文:

Found a better way:

$ xidel -e '//span[@class="titleline"]/a/@href|//span[@class="titleline"]/a/text()' https://news.ycombinator.com/newest

huangapple
  • 本文由 发表于 2023年4月7日 00:51:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/75951938.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定