使用Go语言解析维基百科的信息框(Infobox)吗?

huangapple go评论114阅读模式
英文:

Parse Wikipedia Infobox with Go?

问题

我正在尝试解析一些维基百科文章的信息框,并且似乎无法弄清楚。我已经下载了文件,并且对于阿尔伯特·爱因斯坦的信息框的解析尝试看起来像这样:

  1. package main
  2. import (
  3. "log"
  4. "regexp"
  5. )
  6. func main() {
  7. st := `{{redirect|Einstein|other uses|Albert Einstein (disambiguation)|and|Einstein (disambiguation)}}
  8. {{pp-semi-indef}}
  9. {{pp-move-indef}}
  10. {{Good article}}
  11. {{Infobox scientist
  12. | name = Albert Einstein
  13. | image = Einstein 1921 by F Schmutzer - restoration.jpg
  14. | caption = Albert Einstein in 1921
  15. | birth_date = {{Birth date|df=yes|1879|3|14}}
  16. | birth_place = [[Ulm]], [[Kingdom of Württemberg]], [[German Empire]]
  17. | death_date = {{Death date and age|df=yes|1955|4|18|1879|3|14}}
  18. | death_place = {{nowrap|[[Princeton, New Jersey]], U.S.}}
  19. | children = [[Lieserl Einstein|"Lieserl"]] (1902–1903?)<br />[[Hans Albert Einstein|Hans Albert]] (1904–1973)<br />[[Eduard Einstein|Eduard "Tete"]] (1910–1965)
  20. | spouse = [[Mileva Marić]] (1903–1919)<br />{{nowrap|[[Elsa Löwenthal]] (19191936)}}
  21. | residence = Germany, Italy, Switzerland, Austria (today: [[Czech Republic]]), Belgium, United States
  22. | citizenship = {{Plainlist|
  23. * [[Kingdom of Württemberg]] (18791896)
  24. * [[Statelessness|Stateless]] (18961901)
  25. * [[Switzerland]] (19011955)
  26. * Austria of the [[Austro-Hungarian Empire]] (19111912)
  27. * Germany (19141933)
  28. * United States (19401955)
  29. }}
  30. | ethnicity = Jewish
  31. | fields = [[Physics]], [[philosophy]]
  32. | workplaces = {{Plainlist|
  33. * [[Swiss Patent Office]] ([[Bern]]) (19021909)
  34. * [[University of Bern]] (19081909)
  35. * [[University of Zurich]] (19091911)
  36. * [[Karl-Ferdinands-Universität|Charles University in Prague]] (19111912)
  37. * [[ETH Zurich]] (19121914)
  38. * [[Prussian Academy of Sciences]] (19141933)
  39. * [[Humboldt University of Berlin]] (19141917)
  40. * [[Kaiser Wilhelm Institute]] (director, 19171933)
  41. * [[German Physical Society]] (president, 19161918)
  42. * [[Leiden University]] (visits, 1920)
  43. * [[Institute for Advanced Study]] (19331955)
  44. * [[Caltech]] (visits, 19311933)
  45. }}
  46. | alma_mater = {{Plainlist|
  47. * [[ETH Zurich|Swiss Federal Polytechnic]] (18961900; B.A., 1900)
  48. * [[University of Zurich]] (Ph.D., 1905)
  49. }}
  50. | doctoral_advisor = [[Alfred Kleiner]]
  51. | thesis_title = Eine neue Bestimmung der Moleküldimensionen (A New Determination of Molecular Dimensions)
  52. | thesis_url = http://e-collection.library.ethz.ch/eserv/eth:30378/eth-30378-01.pdf
  53. | thesis_year = 1905
  54. | academic_advisors = [[Heinrich Friedrich Weber]]
  55. | influenced = {{Plainlist|
  56. * [[Ernst G. Straus]]
  57. * [[Nathan Rosen]]
  58. * [[Leó Szilárd]]
  59. }}
  60. | known_for = {{Plainlist|
  61. * [[General relativity]] and [[special relativity]]
  62. * [[Photoelectric effect]]
  63. * ''[[Massenergy equivalence|E=mc<sup>2</sup>]]''
  64. * Theory of [[Brownian motion]]
  65. * [[Einstein field equations]]
  66. * [[BoseEinstein statistics]]
  67. * [[BoseEinstein condensate]]
  68. * [[Gravitational wave]]
  69. * [[Cosmological constant]]
  70. * [[Classical unified field theories|Unified field theory]]
  71. * [[EPR paradox]]
  72. }}
  73. | awards = {{Plainlist|
  74. * [[Barnard Medal for Meritorious Service to Science|Barnard Medal]] (1920)
  75. * [[Nobel Prize in Physics]] (1921)
  76. * [[Matteucci Medal]] (1921)
  77. * [[ForMemRS]] (1921)<ref name="frs" />
  78. * [[Copley Medal]] (1925)<ref name="frs" />
  79. * [[Max Planck Medal]] (1929)
  80. * [[Time 100: The Most Important People of the Century|''Time'' Person of the Century]] (1999)
  81. }}
  82. | signature = Albert Einstein signature 1934.svg
  83. }}`
  84. re := regexp.MustCompile(`{{Infobox(?s:.*?)}}`)
  85. log.Println(re.FindAllStringSubmatch(st, -1))
  86. }

我试图将信息框中的每个项目放入一个结构体或映射中:

  1. m["name"] = "Albert Einstein"
  2. m["image"] = "Einstein...."
  3. ...
  4. ...
  5. m["death_date"] = "{{Death date and age|df=yes|1955|4|18|1879|3|14}}"
  6. ...
  7. ...

我甚至无法分离信息框。我得到的结果是:

  1. [[{{Infobox scientist
  2. | name = Albert Einstein
  3. | image = Einstein 1921 by F Schmutzer - restoration.jpg
  4. | caption = Albert Einstein in 1921
  5. | birth_date = {{Birth date|df=yes|1879|3|14}}]]

可以在API中找到阿尔伯特·爱因斯坦的条目:

https://en.wikipedia.org/w/api.php?action=query&titles=Albert%20Einstein&prop=revisions&rvprop=content&format=json

编辑:

根据这个问题的接受答案,我尝试了以下正则表达式:

  1. (?=\{Infobox)(\{([^{}]|(?1))*\})

但是得到了错误:

  1. panic: regexp: Compile(`(?=\{Infobox)(\{([^{}]|(?1))*\})`): error parsing regexp: invalid or unsupported Perl syntax: `(?=`

编辑2:
如果有一种通过API提取信息的方法,那我会采用那种方法...我已经阅读了文档,但找不到相关内容。

英文:

I am trying to parse the Infobox for some wikipedia articles and cannot seem to figure it out. I have downloaded the files and for Albert Einstein and my attempt to parse the Infobox looks like this:

  1. package main
  2. import (
  3. "log"
  4. "regexp"
  5. )
  6. func main() {
  7. st := `{{redirect|Einstein|other uses|Albert Einstein (disambiguation)|and|Einstein (disambiguation)}}
  8. {{pp-semi-indef}}
  9. {{pp-move-indef}}
  10. {{Good article}}
  11. {{Infobox scientist
  12. | name = Albert Einstein
  13. | image = Einstein 1921 by F Schmutzer - restoration.jpg
  14. | caption = Albert Einstein in 1921
  15. | birth_date = {{Birth date|df=yes|1879|3|14}}
  16. | birth_place = [[Ulm]], [[Kingdom of Württemberg]], [[German Empire]]
  17. | death_date = {{Death date and age|df=yes|1955|4|18|1879|3|14}}
  18. | death_place = {{nowrap|[[Princeton, New Jersey]], U.S.}}
  19. | children = [[Lieserl Einstein|"Lieserl"]] (1902–1903?)<br />[[Hans Albert Einstein|Hans Albert]] (1904–1973)<br />[[Eduard Einstein|Eduard "Tete"]] (1910–1965)
  20. | spouse = [[Mileva Marić]] (1903–1919)<br />{{nowrap|[[Elsa Löwenthal]] (1919–1936)}}
  21. | residence = Germany, Italy, Switzerland, Austria (today: [[Czech Republic]]), Belgium, United States
  22. | citizenship = {{Plainlist|
  23. * [[Kingdom of Württemberg]] (1879–1896)
  24. * [[Statelessness|Stateless]] (1896–1901)
  25. * [[Switzerland]] (1901–1955)
  26. * Austria of the [[Austro-Hungarian Empire]] (1911–1912)
  27. * Germany (1914–1933)
  28. * United States (1940–1955)
  29. }}
  30. | ethnicity = Jewish
  31. | fields = [[Physics]], [[philosophy]]
  32. | workplaces = {{Plainlist|
  33. * [[Swiss Patent Office]] ([[Bern]]) (1902–1909)
  34. * [[University of Bern]] (1908–1909)
  35. * [[University of Zurich]] (1909–1911)
  36. * [[Karl-Ferdinands-Universität|Charles University in Prague]] (1911–1912)
  37. * [[ETH Zurich]] (1912–1914)
  38. * [[Prussian Academy of Sciences]] (1914–1933)
  39. * [[Humboldt University of Berlin]] (1914–1917)
  40. * [[Kaiser Wilhelm Institute]] (director, 1917–1933)
  41. * [[German Physical Society]] (president, 1916–1918)
  42. * [[Leiden University]] (visits, 1920–)
  43. * [[Institute for Advanced Study]] (1933–1955)
  44. * [[Caltech]] (visits, 1931–1933)
  45. }}
  46. | alma_mater = {{Plainlist|
  47. * [[ETH Zurich|Swiss Federal Polytechnic]] (1896–1900; B.A., 1900)
  48. * [[University of Zurich]] (Ph.D., 1905)
  49. }}
  50. | doctoral_advisor = [[Alfred Kleiner]]
  51. | thesis_title = Eine neue Bestimmung der Moleküldimensionen (A New Determination of Molecular Dimensions)
  52. | thesis_url = http://e-collection.library.ethz.ch/eserv/eth:30378/eth-30378-01.pdf
  53. | thesis_year = 1905
  54. | academic_advisors = [[Heinrich Friedrich Weber]]
  55. | influenced = {{Plainlist|
  56. * [[Ernst G. Straus]]
  57. * [[Nathan Rosen]]
  58. * [[Leó Szilárd]]
  59. }}
  60. | known_for = {{Plainlist|
  61. * [[General relativity]] and [[special relativity]]
  62. * [[Photoelectric effect]]
  63. * ''[[Mass–energy equivalence|E=mc<sup>2</sup>]]''
  64. * Theory of [[Brownian motion]]
  65. * [[Einstein field equations]]
  66. * [[Bose–Einstein statistics]]
  67. * [[Bose–Einstein condensate]]
  68. * [[Gravitational wave]]
  69. * [[Cosmological constant]]
  70. * [[Classical unified field theories|Unified field theory]]
  71. * [[EPR paradox]]
  72. }}
  73. | awards = {{Plainlist|
  74. * [[Barnard Medal for Meritorious Service to Science|Barnard Medal]] (1920)
  75. * [[Nobel Prize in Physics]] (1921)
  76. * [[Matteucci Medal]] (1921)
  77. * [[ForMemRS]] (1921)<ref name="frs" />
  78. * [[Copley Medal]] (1925)<ref name="frs" />
  79. * [[Max Planck Medal]] (1929)
  80. * [[Time 100: The Most Important People of the Century|''Time'' Person of the Century]] (1999)
  81. }}
  82. | signature = Albert Einstein signature 1934.svg
  83. }}
  84. '''Albert Einstein''' ({{IPAc-en|ˈ|aɪ|n|s|t|aɪ|n}};<ref>{{cite book|last=Wells|first=John|authorlink=John C. Wells|title=Longman Pronunciation Dictionary|publisher=Pearson Longman|edition=3rd|date=April 3, 2008|isbn=1-4058-8118-6}}</ref> {{IPA-de|ˈalbɛɐ̯t ˈaɪnʃtaɪn|lang|Albert Einstein german.ogg}}; 14 March 1879 – 18 April 1955) was a German-born<!-- Please do not change this—see talk page and its many archives.-->
  85. [[theoretical physicist]]. He developed the [[general theory of relativity]], one of the two pillars of [[modern physics]] (alongside [[quantum mechanics]]).<ref name=frs>{{cite journal | last1 = Whittaker | first1 = E. | authorlink = E. T. Whittaker| doi = 10.1098/rsbm.1955.0005 | title = Albert Einstein. 1879–1955 | journal = [[Biographical Memoirs of Fellows of the Royal Society]] | volume = 1 | pages = 37–67 | date = 1 November 1955| jstor = 769242}}</ref><ref name="YangHamilton2010">{{cite book|author1=Fujia Yang|author2=Joseph H. Hamilton|title=Modern Atomic and Nuclear Physics|date=2010|publisher=World Scientific|isbn=978-981-4277-16-7}}</ref>{{rp|274}} Einstein's work is also known for its influence on the [[philosophy of science]].<ref>{{Citation |title=Einstein's Philosophy of Science |url=http://plato.stanford.edu/entries/einstein-philscience/#IntWasEinEpiOpp |we......
  86. `
  87. re := regexp.MustCompile(`{{Infobox(?s:.*?)}}`)
  88. log.Println(re.FindAllStringSubmatch(st, -1))
  89. }

I am trying to put each of the items from the infobox into a struct or a map:

  1. m["name"] = "Albert Einstein"
  2. m["image"] = "Einstein...."
  3. ...
  4. ...
  5. m["death_date"] = "{{Death date and age|df=yes|1955|4|18|1879|3|14}}"
  6. ...
  7. ...

I can't even seem to isolate the infobox. I get:

  1. [[{{Infobox scientist
  2. | name = Albert Einstein
  3. | image = Einstein 1921 by F Schmutzer - restoration.jpg
  4. | caption = Albert Einstein in 1921
  5. | birth_date = {{Birth date|df=yes|1879|3|14}}]]

The Albert Einstein entry in the API can be found at:

  1. https://en.wikipedia.org/w/api.php?action=query&titles=Albert%20Einstein&prop=revisions&rvprop=content&format=json

EDIT:

Based on the accepted answer to this question the I tried the following regex:

  1. (?=\{Infobox)(\{([^{}]|(?1))*\})

but get:

  1. panic: regexp: Compile(`(?=\{Infobox)(\{([^{}]|(?1))*\})`): error parsing regexp: invalid or unsupported Perl syntax: `(?=`

EDIT #2:
If there's a way to extract the information via their API then I'll take that....I've been reading through the docs and can't find it.

答案1

得分: 0

我为你制作了一个可能适用的正则表达式:

^\s*\|\s*([^\s]+)\s*=\s*(\{\{Plainlist\|(?:\n\s*\*.*)*|.*)

解释:

  • 这部分:^\s*\|\s*([^\s]+)\s*=\s* 匹配以下形式的行开头:

    1. | <the_label> =
  • 在同一行上继续,这部分:(\{\{Plainlist\|(?:\n\s*\*.*)*|.*) 将匹配列表:

    1. {{Plainlist|
    2. * [[Ernst G. Straus]]
    3. * [[Nathan Rosen]]
    4. * [[Leó Szilárd]]

(注意可能会省略最后的 }}。嗯,好吧。)

  • 如果没有列表,它将匹配到行尾。
英文:

I made a regex that might work for you:

^\s*\|\s*([^\s]+)\s*=\s*(\{\{Plainlist\|(?:\n\s*\*.*)*|.*)

Explanation

  • This part: ^\s*\|\s*([^\s]+)\s*=\s* matches the start of lines like:

    1. | <the_label> =
  • Continuing on the same line, this part: (\{\{Plainlist\|(?:\n\s*\*.*)*|.*) will match lists:

    1. {{Plainlist|
    2. * [[Ernst G. Straus]]
    3. * [[Nathan Rosen]]
    4. * [[Leó Szilárd]]

(Note that it may omit the final }}. Oh well.)

  • If there is no list, it matches until the end of the line.

huangapple
  • 本文由 发表于 2016年4月20日 09:20:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/36732212.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定