不同HTML结果相同页面(网络抓取)

huangapple go评论107阅读模式
英文:

Different HTML results for the same page (Web Scraping)

问题

I will provide the translated parts of your text without the code. Here's the translated text:

我是Python的初学者,目前正在进行一个网络抓取项目,需要从网页上的表格中提取数据并保存到CSV文件中。好消息是,我已经成功创建了一个能够在大多数页面上成功完成此任务的算法。但是,有时候进程会中断,因为页面的HTML结构与我预期的不同。

这是其中一个网页示例: https://www.ibm.com/docs/en/imdm/12.0?topic=t-accessdateval

这是我可以处理的预期HTML结构示例:

[以下是HTML示例]

这是导致问题的意外HTML结构示例:

[以下是HTML示例]

我特别困惑于末尾的这些内容:"error.sorryText4":"In%20[.....],很遗憾,我无法找到这些信息。它们大约有500行。

我不明白为什么HTML有时会看起来像这样。当我在页面上按CTRL+U时,我会得到这样的结果,有时也会在我的代码中得到这样的结果。

这是我的Python代码: [以下是代码示例]

如果您需要更多帮助或有其他问题,请随时告诉我。

英文:

I'm a beginner in Python and currently working on a web scraping project where I need to extract data from tables on web pages and save it into a CSV file. The good news is that I've managed to create an algorithm that accomplishes this task successfully for most pages. However, sometimes the process gets aborted because the HTML structure of the page is different from what I expected.

This is one of the webpages: https://www.ibm.com/docs/en/imdm/12.0?topic=t-accessdateval

Here's an example of the expected HTML structure I can work with:

  1. <!DOCTYPE html><html lang="en-us">
  2. <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  3. <meta charset="UTF-8">
  4. <meta name="dcterms.rights" content="© Copyright IBM Corporation 2021">
  5. <meta name="description" content="The ACCESSORENTITLE table provides the ability to associate many users and usergroups to an entitlement rule.">
  6. <meta name="geo.country" content="ZZ">
  7. <script>
  8. digitalData = {
  9. page: {
  10. pageInfo: {
  11. language: "en-us",
  12. version: "v18",
  13. ibm: {
  14. country: "ZZ",
  15. type: "CT701"
  16. }
  17. }
  18. }
  19. };
  20. </script><!-- Licensed Materials - Property of IBM -->
  21. <!-- US Government Users Restricted Rights -->
  22. <!-- Use, duplication or disclosure restricted by -->
  23. <!-- GSA ADP Schedule Contract with IBM Corp. -->
  24. <link rel="stylesheet" type="text/css" href="../com.ibm.mdshs.common.doc/css/swg_info_common.css/ibmdita.css">
  25. <link rel="stylesheet" type="text/css" href="../com.ibm.mdshs.common.doc/css/swg_info_common.css/../com.ibm.mdshs.common.doc/css/swg_info_common.css">
  26. <link rel="Start" href="r_Tables.html">
  27. <title>ACCESSORENTITLE</title>
  28. </head>
  29. <body id="r_accessorentitle_Table"><main role="main"><article role="article" aria-labelledby="d55790e10">
  30. <h1 class="topictitle1" id="d55790e10">ACCESSORENTITLE</h1>
  31. <div class="body refbody"><p class="shortdesc">The ACCESSORENTITLE table provides the ability to associate many users and usergroups to an entitlement rule.</p>
  32. <div class="section">
  33. <div class="p">This table is used by the following functional feature.<ul>
  34. <li>
  35. <a href="r_Rules_of_Visibility_SubjectArea.html">Rules of Visibility</a>
  36. </li>
  37. </ul>
  38. </div>
  39. <div class="tablenoborder"><table summary="" style="width: 100%" class="defaultstyle"><colgroup><col style="width:23.076923076923077%"><col style="width:34.61538461538461%"><col style="width:19.230769230769234%"><col style="width:15.384615384615385%"><col style="width:7.6923076923076925%"></colgroup><thead style="text-align:left;">
  40. <tr>
  41. <th id="d55790e52">Name</th>
  42. <th id="d55790e55">Comment</th>
  43. <th id="d55790e58">Datatype</th>
  44. <th id="d55790e61">Null Option</th>
  45. <th id="d55790e64">Is PK</th>
  46. </tr>
  47. </thead>
  48. <tbody>

And here's an example of the unexpected HTML structure that causes issues:

  1. <!DOCTYPE html>
  2. <html dir="ltr" lang="en-US">
  3. <head>
  4. <script>
  5. // fill in DDO
  6. digitalData = {
  7. page: {
  8. category: {
  9. primaryCategory: 'ELSKCS', // e.g. SB03
  10. },
  11. pageInfo: {
  12. effectiveDate: '', // e.g. 2014-11-19
  13. expiryDate: '', // e.g. 2017-11-19
  14. language: 'en-US', // e.g. en-US FIX
  15. publishDate: '', // e.g. 2014-11-19
  16. publisher: 'IBM Corporation', // e.g. IBM Corporation
  17. version: 'Carbon for IBM.com', // e.g. dds.v1.0.0. NOTE: This is dynamically set by the IBM.com Library
  18. ibm: {
  19. contentDelivery: 'IBM Documentation', // e.g. ECM/Filegen
  20. contentProducer: 'IBM Documentation 1.0', // e.g. ECM/IConS Adopter 34 - GS83J2343G3H3ERG - 11/19/2014 05:14:02 PM
  21. country: 'US', // e.g. FIX
  22. industry: 'ZZ', // e.g. B,U
  23. owner: 'IBM Documentation/Raleigh/IBM', // e.g. Some Person/City/IBM
  24. siteID: 'ESTKCS', // e.g. MySiteID
  25. subject: '', // e.g. SW492
  26. type: 'CT701', // e.g CT305
  27. },
  28. },
  29. },
  30. };
  31. </script>
  32. <meta content="width=device-width,initial-scale=1" name="viewport"/>
  33. <meta content="ie=edge" http-equiv="X-UA-Compatible"/>
  34. <title>
  35. IBM Documentation
  36. </title>
  37. <meta charset="utf-8"/>
  38. <link href="//www.ibm.com/favicon.ico" rel="icon"/>
  39. <meta content="IBM, documentation" name="keywords"/>
  40. <meta content="IBM Documentation." name="description"/>
  41. <meta content="" name="dcterms.date"/>
  42. <meta content="© Copyright IBM Corporation 2023" name="dcterms.rights"/>
  43. <meta content="US" name="geo.country"/>
  44. <meta content="index,follow" name="robots"/>
  45. <meta content="" name="canonical"/>
  46. <script src="//1.www.s81c.com/common/stats/ibm-common.js">
  47. </script>
  48. <link href="/docs/css/style.css" rel="stylesheet"/>
  49. <script>
  50. function convertUnicode(input) {
  51. return input.replace(/\\u(\w\w\w\w)/g,function(a,b) {
  52. var charcode = parseInt(b,16);
  53. return String.fromCharCode(charcode);
  54. });
  55. }
  56. var kcGlobals = {
  57. translation: {
  58. "common.error":"Sorry,%20we%20have%20an%20error",
  59. "common.externalLinkTooltipText":"(Opens%20in%20a%20new%20tab%20or%20window)",
  60. "common.yes":"Yes",
  61. "common.no":"No",
  62. "common.warning":"Warning",
  63. "common.notFound":"We%20didn't%20find%20a%20matching%20topic%20in%20the%20product%20version%20you%20requested.%20Would%20you%20like%20to%20go%20to%20the%20$PRODUCT$%20homepage?",
  64. "common.returnToDocs":"Open%20the%20Red%20Hat%20documentation%20in%20a%20new%20tab",
  65. "common.externalDocumentation":"Viewing%20external%20documentation",
  66. "common.externalDocumentation2":"Use%20this%20link%20to%20view%20OpenShift%20documentation%20on%20the%20Red%20Hat%20documentation%20site.",
  67. "common.previous":"Previous",
  68. "common.next":"Next",
  69. "common.backToTopButton":"Back%20to%20top%20button",
  70. "common.copyright":"%C2%A9%20Copyright%20IBM%20Corporation%202022,%202023",
  71. "error.unexpectedErrorHeading":"An%20unexpected%20error%20occurred",
  72. "error.sorryText1":"We're%20sorry!",
  73. "error.sorryText2":"The%20requested%20page%20does%20not%20exist%20or%20might%20have%20moved.",
  74. "error.sorryText3":"If%20you%20accessed%20this%20page%20by%20using%20a%20bookmark%20or%20external%20URL,%20the%20bookmark%20or%20links%20might%20need%20to%20be%20updated.%20Use%20IBM%20Documentation%20search%20to%20find%20the%20content's%20new%20location.",
  75. "error.sorryText4":"In%20this%20case,%20use%20the%20table%20of%20contents%20or%20the%20search%20to%20find%20the%20content.",
  76. "error.sorryText5":"If%20you%20accessed%20this%20page%20from%20the%20table%20of%20contents%20or%20a%20search,%20please%20report%20the%20broken%20link%20to%20%3Ca%20id=%22ibmdocs-mailto-link%22%20href=%22%22%3EIBM%20Documentation%20support%3C/a%3E%20who%20will%20alert%20the%20appropriate%20content%20group.",
  77. "error.tabError":"Resource%20not%20found",

I am especially confused by these things at the end: "error.sorryText4":"In%20[.....]

Unfortunately, I couldn't find any information on these. And they go on for about 500 lines.

I don't understand why the HTML sometimes looks like this. I get this result when I do CTRL+U on the page and sometimes as a result from my code

This is my Python code:

  1. import requests
  2. import pandas as pd
  3. from bs4 import BeautifulSoup
  4. # Read the CSV file containing the identifiers
  5. df_identifiers = pd.read_csv('identifiers.csv')
  6. # Create an empty DataFrame to store the combined results
  7. df_combined = pd.DataFrame()
  8. # Iterate over the identifiers and process each URL
  9. for index, row in df_identifiers.iterrows():
  10. # Construct the URL using the identifier from the CSV file
  11. identifier = row['Identifier']
  12. url = f"https://www.ibm.com/docs/en/imdm/12.0?topic=tables-{identifier}"
  13. # Send a GET request to the URL
  14. r = requests.get(url)
  15. soup = BeautifulSoup(r.text, "html.parser")
  16. # Extract the desired data from the HTML
  17. Table = soup.find("h1", class_="topictitle1").get_text(strip=True).strip()
  18. description = soup.find('p', class_='shortdesc').get_text(strip=True)
  19. div_element = soup.find('div', class_='p')
  20. a_elements = div_element.find_all('a')
  21. feature_list = [a.get_text(strip=True) for a in a_elements]
  22. table = soup.find("table")
  23. headers = [header.get_text(strip=True) for header in table.select("th")]
  24. data_rows = table.select("tbody tr")
  25. data = [[td.get_text(strip=True) for td in row.select("td")] for row in data_rows]
  26. # Create a DataFrame for the current URL's data
  27. df = pd.DataFrame(data, columns=headers)
  28. df["Description"] = description
  29. for i, feature in enumerate(feature_list):
  30. df[f"Feature_{i+1}"] = feature
  31. df.insert(0, "Table", Table)
  32. # Append the current DataFrame to the combined DataFrame
  33. df_combined = df_combined.append(df, ignore_index=True)
  34. # Save the combined DataFrame to a CSV file
  35. df_combined.to_csv('combined_table_data.csv', index=False)

答案1

得分: 1

  1. import httpx
  2. import trio
  3. import re
  4. import pandas as pd
  5. headers = {
  6. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0'
  7. }
  8. async def main():
  9. async with httpx.AsyncClient(headers=headers, base_url='https://www.ibm.com/docs') as client:
  10. params = {
  11. 'topic': 't-accessdateval'
  12. }
  13. r = await client.get('en/imdm/12.0', params=params)
  14. nurl = "api/v1/content/" + \
  15. re.search('"oldUrl":"(.*?)"', r.text).group(1)
  16. params = {
  17. 'parsebody': 'true',
  18. 'lang': 'en'
  19. }
  20. r = await client.get(nurl, params=params)
  21. df = pd.read_html(r.content, attrs={'class': 'defaultstyle'})[0]
  22. print(df)
  23. if __name__ == "__main__":
  24. trio.run(main)
英文:
  1. import httpx
  2. import trio
  3. import re
  4. import pandas as pd
  5. headers = {
  6. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0'
  7. }
  8. async def main():
  9. async with httpx.AsyncClient(headers=headers, base_url='https://www.ibm.com/docs') as client:
  10. params = {
  11. 'topic': 't-accessdateval'
  12. }
  13. r = await client.get('en/imdm/12.0', params=params)
  14. nurl = "api/v1/content/" + \
  15. re.search('"oldUrl":"(.*?)"', r.text).group(1)
  16. params = {
  17. 'parsebody': 'true',
  18. 'lang': 'en'
  19. }
  20. r = await client.get(nurl, params=params)
  21. df = pd.read_html(r.content, attrs={'class': 'defaultstyle'})[0]
  22. print(df)
  23. if __name__ == "__main__":
  24. trio.run(main)

Output:

  1. Name Comment ... Null Option Is PK
  2. 0 ACC_DATE_VAL_ID A unique, system-generated key that identifies... ... Not Null Yes
  3. 1 INSTANCE_PK The actual primary key of the row in the logic... ... Not Null No
  4. 2 ENTITY_NAME The name of the business entity. ... Not Null No
  5. 3 COL_NAME The actual name of the column where the defaul... ... Null No
  6. 4 DESCRIPTION A description of the record. ... Null No
  7. 5 LAST_USED_DT The date that this data was last used. There i... ... Null No
  8. 6 LAST_VERIFIED_DT The date that this data was last verified. The... ... Null No
  9. 7 LAST_UPDATE_DT When a record is added or updated, this field ... ... Not Null No
  10. 8 LAST_UPDATE_USER The ID of the user who last updated the data. ... Null No
  11. 9 LAST_UPDATE_TX_ID A unique, system-generated key that identifies... ... Null No
  12. [10 rows x 5 columns]

huangapple
  • 本文由 发表于 2023年5月21日 19:07:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76299577.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定