不同HTML结果相同页面(网络抓取)

huangapple go评论80阅读模式
英文:

Different HTML results for the same page (Web Scraping)

问题

I will provide the translated parts of your text without the code. Here's the translated text:

我是Python的初学者,目前正在进行一个网络抓取项目,需要从网页上的表格中提取数据并保存到CSV文件中。好消息是,我已经成功创建了一个能够在大多数页面上成功完成此任务的算法。但是,有时候进程会中断,因为页面的HTML结构与我预期的不同。

这是其中一个网页示例: https://www.ibm.com/docs/en/imdm/12.0?topic=t-accessdateval

这是我可以处理的预期HTML结构示例:

[以下是HTML示例]

这是导致问题的意外HTML结构示例:

[以下是HTML示例]

我特别困惑于末尾的这些内容:"error.sorryText4":"In%20[.....],很遗憾,我无法找到这些信息。它们大约有500行。

我不明白为什么HTML有时会看起来像这样。当我在页面上按CTRL+U时,我会得到这样的结果,有时也会在我的代码中得到这样的结果。

这是我的Python代码: [以下是代码示例]

如果您需要更多帮助或有其他问题,请随时告诉我。

英文:

I'm a beginner in Python and currently working on a web scraping project where I need to extract data from tables on web pages and save it into a CSV file. The good news is that I've managed to create an algorithm that accomplishes this task successfully for most pages. However, sometimes the process gets aborted because the HTML structure of the page is different from what I expected.

This is one of the webpages: https://www.ibm.com/docs/en/imdm/12.0?topic=t-accessdateval

Here's an example of the expected HTML structure I can work with:

<!DOCTYPE html><html lang="en-us">
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="UTF-8">

<meta name="dcterms.rights" content="© Copyright IBM Corporation 2021">



<meta name="description" content="The ACCESSORENTITLE table provides the ability to associate many users and usergroups to an entitlement rule.">

<meta name="geo.country" content="ZZ">
<script>
    digitalData = {
      page: {
        pageInfo: {
  language: "en-us",

  version: "v18",
  ibm: {
  country: "ZZ",
  type: "CT701"
  
         }
       }
     }
   };
  </script><!-- Licensed Materials - Property of IBM -->
<!-- US Government Users Restricted Rights -->
<!-- Use, duplication or disclosure restricted by -->
<!-- GSA ADP Schedule Contract with IBM Corp. -->
<link rel="stylesheet" type="text/css" href="../com.ibm.mdshs.common.doc/css/swg_info_common.css/ibmdita.css">
<link rel="stylesheet" type="text/css" href="../com.ibm.mdshs.common.doc/css/swg_info_common.css/../com.ibm.mdshs.common.doc/css/swg_info_common.css">
<link rel="Start" href="r_Tables.html">
<title>ACCESSORENTITLE</title>
</head>
<body id="r_accessorentitle_Table"><main role="main"><article role="article" aria-labelledby="d55790e10">
    <h1 class="topictitle1" id="d55790e10">ACCESSORENTITLE</h1>

    
    <div class="body refbody"><p class="shortdesc">The ACCESSORENTITLE table provides the ability to associate many users and usergroups to an entitlement rule.</p>

        <div class="section">
            <div class="p">This table is used by the following functional feature.<ul>
                    <li>
                        <a href="r_Rules_of_Visibility_SubjectArea.html">Rules of Visibility</a>
                    </li>

                </ul>

            </div>

            
<div class="tablenoborder"><table summary="" style="width: 100%" class="defaultstyle"><colgroup><col style="width:23.076923076923077%"><col style="width:34.61538461538461%"><col style="width:19.230769230769234%"><col style="width:15.384615384615385%"><col style="width:7.6923076923076925%"></colgroup><thead style="text-align:left;">
                        <tr>
                            <th id="d55790e52">Name</th>

                            <th id="d55790e55">Comment</th>

                            <th id="d55790e58">Datatype</th>

                            <th id="d55790e61">Null Option</th>

                            <th id="d55790e64">Is PK</th>

                        </tr>

                    </thead>
<tbody>

And here's an example of the unexpected HTML structure that causes issues:

<!DOCTYPE html>
<html dir="ltr" lang="en-US">
 <head>
  <script>
   // fill in DDO
      digitalData = {
        page: {
          category: {
            primaryCategory: 'ELSKCS', // e.g. SB03
          },
          pageInfo: {
            effectiveDate: '', // e.g. 2014-11-19
            expiryDate: '', // e.g. 2017-11-19
            language: 'en-US', // e.g. en-US FIX
            publishDate: '', // e.g. 2014-11-19
            publisher: 'IBM Corporation', // e.g. IBM Corporation
            version: 'Carbon for IBM.com', // e.g. dds.v1.0.0. NOTE: This is dynamically set by the IBM.com Library
            ibm: {
              contentDelivery: 'IBM Documentation', // e.g. ECM/Filegen
              contentProducer: 'IBM Documentation 1.0', // e.g. ECM/IConS Adopter 34 - GS83J2343G3H3ERG - 11/19/2014 05:14:02 PM
              country: 'US', // e.g. FIX
              industry: 'ZZ', // e.g. B,U
              owner: 'IBM Documentation/Raleigh/IBM', // e.g. Some Person/City/IBM
              siteID: 'ESTKCS', // e.g. MySiteID
              subject: '', // e.g. SW492
              type: 'CT701', // e.g CT305
            },
          },
        },
      };
  </script>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="ie=edge" http-equiv="X-UA-Compatible"/>
  <title>
   IBM Documentation
  </title>
  <meta charset="utf-8"/>
  <link href="//www.ibm.com/favicon.ico" rel="icon"/>
  <meta content="IBM, documentation" name="keywords"/>
  <meta content="IBM Documentation." name="description"/>
  <meta content="" name="dcterms.date"/>
  <meta content="© Copyright IBM Corporation 2023" name="dcterms.rights"/>
  <meta content="US" name="geo.country"/>
  <meta content="index,follow" name="robots"/>
  <meta content="" name="canonical"/>
  <script src="//1.www.s81c.com/common/stats/ibm-common.js">
  </script>
  <link href="/docs/css/style.css" rel="stylesheet"/>
  <script>
   function convertUnicode(input) {
          return input.replace(/\\u(\w\w\w\w)/g,function(a,b) {
          var charcode = parseInt(b,16);
          return String.fromCharCode(charcode);
          });
        }
        var kcGlobals = {
          translation: {
			"common.error":"Sorry,%20we%20have%20an%20error",
			"common.externalLinkTooltipText":"(Opens%20in%20a%20new%20tab%20or%20window)",
			"common.yes":"Yes",
			"common.no":"No",
			"common.warning":"Warning",
			"common.notFound":"We%20didn't%20find%20a%20matching%20topic%20in%20the%20product%20version%20you%20requested.%20Would%20you%20like%20to%20go%20to%20the%20$PRODUCT$%20homepage?",
			"common.returnToDocs":"Open%20the%20Red%20Hat%20documentation%20in%20a%20new%20tab",
			"common.externalDocumentation":"Viewing%20external%20documentation",
			"common.externalDocumentation2":"Use%20this%20link%20to%20view%20OpenShift%20documentation%20on%20the%20Red%20Hat%20documentation%20site.",
			"common.previous":"Previous",
			"common.next":"Next",
			"common.backToTopButton":"Back%20to%20top%20button",
			"common.copyright":"%C2%A9%20Copyright%20IBM%20Corporation%202022,%202023",
			"error.unexpectedErrorHeading":"An%20unexpected%20error%20occurred",
			"error.sorryText1":"We're%20sorry!",
			"error.sorryText2":"The%20requested%20page%20does%20not%20exist%20or%20might%20have%20moved.",
			"error.sorryText3":"If%20you%20accessed%20this%20page%20by%20using%20a%20bookmark%20or%20external%20URL,%20the%20bookmark%20or%20links%20might%20need%20to%20be%20updated.%20Use%20IBM%20Documentation%20search%20to%20find%20the%20content's%20new%20location.",
			"error.sorryText4":"In%20this%20case,%20use%20the%20table%20of%20contents%20or%20the%20search%20to%20find%20the%20content.",
			"error.sorryText5":"If%20you%20accessed%20this%20page%20from%20the%20table%20of%20contents%20or%20a%20search,%20please%20report%20the%20broken%20link%20to%20%3Ca%20id=%22ibmdocs-mailto-link%22%20href=%22%22%3EIBM%20Documentation%20support%3C/a%3E%20who%20will%20alert%20the%20appropriate%20content%20group.",
			"error.tabError":"Resource%20not%20found",

I am especially confused by these things at the end: "error.sorryText4":"In%20[.....]

Unfortunately, I couldn't find any information on these. And they go on for about 500 lines.

I don't understand why the HTML sometimes looks like this. I get this result when I do CTRL+U on the page and sometimes as a result from my code

This is my Python code:

import requests
import pandas as pd
from bs4 import BeautifulSoup

# Read the CSV file containing the identifiers
df_identifiers = pd.read_csv('identifiers.csv')

# Create an empty DataFrame to store the combined results
df_combined = pd.DataFrame()

# Iterate over the identifiers and process each URL
for index, row in df_identifiers.iterrows():
    # Construct the URL using the identifier from the CSV file
    identifier = row['Identifier']
    url = f"https://www.ibm.com/docs/en/imdm/12.0?topic=tables-{identifier}"
    
    # Send a GET request to the URL
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    
    # Extract the desired data from the HTML
    Table = soup.find("h1", class_="topictitle1").get_text(strip=True).strip()
    description = soup.find('p', class_='shortdesc').get_text(strip=True)
    div_element = soup.find('div', class_='p')
    a_elements = div_element.find_all('a')
    feature_list = [a.get_text(strip=True) for a in a_elements]
    table = soup.find("table")
    headers = [header.get_text(strip=True) for header in table.select("th")]
    data_rows = table.select("tbody tr")
    data = [[td.get_text(strip=True) for td in row.select("td")] for row in data_rows]
    
    # Create a DataFrame for the current URL's data
    df = pd.DataFrame(data, columns=headers)
    df["Description"] = description
    for i, feature in enumerate(feature_list):
        df[f"Feature_{i+1}"] = feature
    df.insert(0, "Table", Table)
    
    # Append the current DataFrame to the combined DataFrame
    df_combined = df_combined.append(df, ignore_index=True)

# Save the combined DataFrame to a CSV file
df_combined.to_csv('combined_table_data.csv', index=False)

答案1

得分: 1

import httpx
import trio
import re
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0'
}


async def main():
    async with httpx.AsyncClient(headers=headers, base_url='https://www.ibm.com/docs') as client:
        params = {
            'topic': 't-accessdateval'
        }
        r = await client.get('en/imdm/12.0', params=params)
        nurl = "api/v1/content/" + \
            re.search('"oldUrl":"(.*?)"', r.text).group(1)
        params = {
            'parsebody': 'true',
            'lang': 'en'
        }
        r = await client.get(nurl, params=params)
        df = pd.read_html(r.content, attrs={'class': 'defaultstyle'})[0]
        print(df)


if __name__ == "__main__":
    trio.run(main)
英文:
import httpx
import trio
import re
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0'
}


async def main():
    async with httpx.AsyncClient(headers=headers, base_url='https://www.ibm.com/docs') as client:
        params = {
            'topic': 't-accessdateval'
        }
        r = await client.get('en/imdm/12.0', params=params)
        nurl = "api/v1/content/" + \
            re.search('"oldUrl":"(.*?)"', r.text).group(1)
        params = {
            'parsebody': 'true',
            'lang': 'en'
        }
        r = await client.get(nurl, params=params)
        df = pd.read_html(r.content, attrs={'class': 'defaultstyle'})[0]
        print(df)


if __name__ == "__main__":
    trio.run(main)

Output:

                Name                                            Comment  ... Null Option Is PK
0    ACC_DATE_VAL_ID  A unique, system-generated key that identifies...  ...    Not Null   Yes
1        INSTANCE_PK  The actual primary key of the row in the logic...  ...    Not Null    No
2        ENTITY_NAME                   The name of the business entity.  ...    Not Null    No
3           COL_NAME  The actual name of the column where the defaul...  ...        Null    No
4        DESCRIPTION                       A description of the record.  ...        Null    No
5       LAST_USED_DT  The date that this data was last used. There i...  ...        Null    No
6   LAST_VERIFIED_DT  The date that this data was last verified. The...  ...        Null    No
7     LAST_UPDATE_DT  When a record is added or updated, this field ...  ...    Not Null    No
8   LAST_UPDATE_USER      The ID of the user who last updated the data.  ...        Null    No
9  LAST_UPDATE_TX_ID  A unique, system-generated key that identifies...  ...        Null    No

[10 rows x 5 columns]

huangapple
  • 本文由 发表于 2023年5月21日 19:07:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76299577.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定