如何使用SQL转储查找信息框模板和其他模板中的维基百科链接。

huangapple go评论45阅读模式
英文:

How to find for the wikipedia links in the infobox templates and other templates, using sql dumps

问题

我想提取信息框和页面模板中提到的页面。

例如,从这个页面:
https://en.wikipedia.org/wiki/DNA

我想提取信息框中的所有链接,如:"遗传学","遗传学介绍" 等。

我希望通过使用SQL转储来实现,可能避免解析整个页面的XML,并且不想使用API。

我找不到一种方法。

尽管Pagelinks还包括信息框中的链接,但我找不到一种排除它们的方法。
我认为Templatelinks可能包含这些信息,但实际上并不包括:我找不到信息框中对应链接的页面ID。

  • 这些信息存储在哪里?
  • 或者我应该查看哪些类型的表格?

我查阅了之前的问题:
https://stackoverflow.com/questions/10654303/where-can-i-find-the-infobox-templates-used-in-wiki
以及MediaWiki参考:
https://www.mediawiki.org/wiki/Manual:Templatelinks_table#Schema_summary

但是未找到解决方案。

英文:

I want to extract the pages mentioned in the infobox and templates of pages.

E.g. From this page:
https://en.wikipedia.org/wiki/DNA

I want to extract all of the links in the infobox, like: "Genetics", "Introduction to Genetics" etc.

I want to do it, by using the sql dumps, possibly avoiding to parse the xml of whole pages, and I don't want to do it with APIs.

I could not find a way.

While Pagelinks does include also the links of infoboxes, I cannot find a way to exclude them.
I thought Templatelinks may have that info, but it is not: I could not find the pageids of the corresponding links in infoboxes.

  • Where is this information stored?
  • Or which kind of tables should I look at?

I consulted previous questions:
https://stackoverflow.com/questions/10654303/where-can-i-find-the-infobox-templates-used-in-wiki
and Mediawiki reference:
https://www.mediawiki.org/wiki/Manual:Templatelinks_table#Schema_summary

but could not find a solution.

答案1

得分: 2

这是一个侧边栏,而不是信息框:https://en.wikipedia.org/wiki/Template:Genetics_sidebar

我认为除了解析模板内容以提取链接或使用API之外,没有其他方法来执行此操作:例如 https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Template:Genetics%20sidebar&pllimit=100&plnamespace=0

这样的方法也应该有效,但对我来说没有返回任何结果:

SELECT * from pagelinks
where pl_title = 'Genetics_sidebar'
and pl_namespace = 0
and pl_from_namespace = 10

https://quarry.wmcloud.org/query/71442

英文:

That is a sidebar rather than an infobox: https://en.wikipedia.org/wiki/Template:Genetics_sidebar

I don't think there's a way of doing it other than parsing the content of the template to extract the links or using the API: e.g. https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Template:Genetics%20sidebar&pllimit=100&plnamespace=0

Something like this should also work but it's not returning any results for me:

SELECT * from pagelinks 
where pl_title = 'Genetics_sidebar'
and pl_namespace = 0
and pl_from_namespace = 10

https://quarry.wmcloud.org/query/71442

huangapple
  • 本文由 发表于 2023年2月16日 18:28:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/75470921.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定