英文:
How to find for the wikipedia links in the infobox templates and other templates, using sql dumps
问题
我想提取信息框和页面模板中提到的页面。
例如,从这个页面:
https://en.wikipedia.org/wiki/DNA
我想提取信息框中的所有链接,如:"遗传学","遗传学介绍" 等。
我希望通过使用SQL转储来实现,可能避免解析整个页面的XML,并且不想使用API。
我找不到一种方法。
尽管Pagelinks还包括信息框中的链接,但我找不到一种排除它们的方法。
我认为Templatelinks可能包含这些信息,但实际上并不包括:我找不到信息框中对应链接的页面ID。
- 这些信息存储在哪里?
- 或者我应该查看哪些类型的表格?
我查阅了之前的问题:
https://stackoverflow.com/questions/10654303/where-can-i-find-the-infobox-templates-used-in-wiki
以及MediaWiki参考:
https://www.mediawiki.org/wiki/Manual:Templatelinks_table#Schema_summary
但是未找到解决方案。
英文:
I want to extract the pages mentioned in the infobox and templates of pages.
E.g. From this page:
https://en.wikipedia.org/wiki/DNA
I want to extract all of the links in the infobox, like: "Genetics", "Introduction to Genetics" etc.
I want to do it, by using the sql dumps, possibly avoiding to parse the xml of whole pages, and I don't want to do it with APIs.
I could not find a way.
While Pagelinks does include also the links of infoboxes, I cannot find a way to exclude them.
I thought Templatelinks may have that info, but it is not: I could not find the pageids of the corresponding links in infoboxes.
- Where is this information stored?
- Or which kind of tables should I look at?
I consulted previous questions:
https://stackoverflow.com/questions/10654303/where-can-i-find-the-infobox-templates-used-in-wiki
and Mediawiki reference:
https://www.mediawiki.org/wiki/Manual:Templatelinks_table#Schema_summary
but could not find a solution.
答案1
得分: 2
这是一个侧边栏,而不是信息框:https://en.wikipedia.org/wiki/Template:Genetics_sidebar
我认为除了解析模板内容以提取链接或使用API之外,没有其他方法来执行此操作:例如 https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Template:Genetics%20sidebar&pllimit=100&plnamespace=0
这样的方法也应该有效,但对我来说没有返回任何结果:
SELECT * from pagelinks
where pl_title = 'Genetics_sidebar'
and pl_namespace = 0
and pl_from_namespace = 10
https://quarry.wmcloud.org/query/71442
英文:
That is a sidebar rather than an infobox: https://en.wikipedia.org/wiki/Template:Genetics_sidebar
I don't think there's a way of doing it other than parsing the content of the template to extract the links or using the API: e.g. https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Template:Genetics%20sidebar&pllimit=100&plnamespace=0
Something like this should also work but it's not returning any results for me:
SELECT * from pagelinks
where pl_title = 'Genetics_sidebar'
and pl_namespace = 0
and pl_from_namespace = 10
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论