英文:
Different HTML results for the same page (Web Scraping)
问题
I will provide the translated parts of your text without the code. Here's the translated text:
我是Python的初学者,目前正在进行一个网络抓取项目,需要从网页上的表格中提取数据并保存到CSV文件中。好消息是,我已经成功创建了一个能够在大多数页面上成功完成此任务的算法。但是,有时候进程会中断,因为页面的HTML结构与我预期的不同。
这是其中一个网页示例: https://www.ibm.com/docs/en/imdm/12.0?topic=t-accessdateval
这是我可以处理的预期HTML结构示例:
[以下是HTML示例]
这是导致问题的意外HTML结构示例:
[以下是HTML示例]
我特别困惑于末尾的这些内容:"error.sorryText4":"In%20[.....],很遗憾,我无法找到这些信息。它们大约有500行。
我不明白为什么HTML有时会看起来像这样。当我在页面上按CTRL+U时,我会得到这样的结果,有时也会在我的代码中得到这样的结果。
这是我的Python代码: [以下是代码示例]
如果您需要更多帮助或有其他问题,请随时告诉我。
英文:
I'm a beginner in Python and currently working on a web scraping project where I need to extract data from tables on web pages and save it into a CSV file. The good news is that I've managed to create an algorithm that accomplishes this task successfully for most pages. However, sometimes the process gets aborted because the HTML structure of the page is different from what I expected.
This is one of the webpages: https://www.ibm.com/docs/en/imdm/12.0?topic=t-accessdateval
Here's an example of the expected HTML structure I can work with:
<!DOCTYPE html><html lang="en-us">
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="UTF-8">
<meta name="dcterms.rights" content="© Copyright IBM Corporation 2021">
<meta name="description" content="The ACCESSORENTITLE table provides the ability to associate many users and usergroups to an entitlement rule.">
<meta name="geo.country" content="ZZ">
<script>
digitalData = {
page: {
pageInfo: {
language: "en-us",
version: "v18",
ibm: {
country: "ZZ",
type: "CT701"
}
}
}
};
</script><!-- Licensed Materials - Property of IBM -->
<!-- US Government Users Restricted Rights -->
<!-- Use, duplication or disclosure restricted by -->
<!-- GSA ADP Schedule Contract with IBM Corp. -->
<link rel="stylesheet" type="text/css" href="../com.ibm.mdshs.common.doc/css/swg_info_common.css/ibmdita.css">
<link rel="stylesheet" type="text/css" href="../com.ibm.mdshs.common.doc/css/swg_info_common.css/../com.ibm.mdshs.common.doc/css/swg_info_common.css">
<link rel="Start" href="r_Tables.html">
<title>ACCESSORENTITLE</title>
</head>
<body id="r_accessorentitle_Table"><main role="main"><article role="article" aria-labelledby="d55790e10">
<h1 class="topictitle1" id="d55790e10">ACCESSORENTITLE</h1>
<div class="body refbody"><p class="shortdesc">The ACCESSORENTITLE table provides the ability to associate many users and usergroups to an entitlement rule.</p>
<div class="section">
<div class="p">This table is used by the following functional feature.<ul>
<li>
<a href="r_Rules_of_Visibility_SubjectArea.html">Rules of Visibility</a>
</li>
</ul>
</div>
<div class="tablenoborder"><table summary="" style="width: 100%" class="defaultstyle"><colgroup><col style="width:23.076923076923077%"><col style="width:34.61538461538461%"><col style="width:19.230769230769234%"><col style="width:15.384615384615385%"><col style="width:7.6923076923076925%"></colgroup><thead style="text-align:left;">
<tr>
<th id="d55790e52">Name</th>
<th id="d55790e55">Comment</th>
<th id="d55790e58">Datatype</th>
<th id="d55790e61">Null Option</th>
<th id="d55790e64">Is PK</th>
</tr>
</thead>
<tbody>
And here's an example of the unexpected HTML structure that causes issues:
<!DOCTYPE html>
<html dir="ltr" lang="en-US">
<head>
<script>
// fill in DDO
digitalData = {
page: {
category: {
primaryCategory: 'ELSKCS', // e.g. SB03
},
pageInfo: {
effectiveDate: '', // e.g. 2014-11-19
expiryDate: '', // e.g. 2017-11-19
language: 'en-US', // e.g. en-US FIX
publishDate: '', // e.g. 2014-11-19
publisher: 'IBM Corporation', // e.g. IBM Corporation
version: 'Carbon for IBM.com', // e.g. dds.v1.0.0. NOTE: This is dynamically set by the IBM.com Library
ibm: {
contentDelivery: 'IBM Documentation', // e.g. ECM/Filegen
contentProducer: 'IBM Documentation 1.0', // e.g. ECM/IConS Adopter 34 - GS83J2343G3H3ERG - 11/19/2014 05:14:02 PM
country: 'US', // e.g. FIX
industry: 'ZZ', // e.g. B,U
owner: 'IBM Documentation/Raleigh/IBM', // e.g. Some Person/City/IBM
siteID: 'ESTKCS', // e.g. MySiteID
subject: '', // e.g. SW492
type: 'CT701', // e.g CT305
},
},
},
};
</script>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="ie=edge" http-equiv="X-UA-Compatible"/>
<title>
IBM Documentation
</title>
<meta charset="utf-8"/>
<link href="//www.ibm.com/favicon.ico" rel="icon"/>
<meta content="IBM, documentation" name="keywords"/>
<meta content="IBM Documentation." name="description"/>
<meta content="" name="dcterms.date"/>
<meta content="© Copyright IBM Corporation 2023" name="dcterms.rights"/>
<meta content="US" name="geo.country"/>
<meta content="index,follow" name="robots"/>
<meta content="" name="canonical"/>
<script src="//1.www.s81c.com/common/stats/ibm-common.js">
</script>
<link href="/docs/css/style.css" rel="stylesheet"/>
<script>
function convertUnicode(input) {
return input.replace(/\\u(\w\w\w\w)/g,function(a,b) {
var charcode = parseInt(b,16);
return String.fromCharCode(charcode);
});
}
var kcGlobals = {
translation: {
"common.error":"Sorry,%20we%20have%20an%20error",
"common.externalLinkTooltipText":"(Opens%20in%20a%20new%20tab%20or%20window)",
"common.yes":"Yes",
"common.no":"No",
"common.warning":"Warning",
"common.notFound":"We%20didn't%20find%20a%20matching%20topic%20in%20the%20product%20version%20you%20requested.%20Would%20you%20like%20to%20go%20to%20the%20$PRODUCT$%20homepage?",
"common.returnToDocs":"Open%20the%20Red%20Hat%20documentation%20in%20a%20new%20tab",
"common.externalDocumentation":"Viewing%20external%20documentation",
"common.externalDocumentation2":"Use%20this%20link%20to%20view%20OpenShift%20documentation%20on%20the%20Red%20Hat%20documentation%20site.",
"common.previous":"Previous",
"common.next":"Next",
"common.backToTopButton":"Back%20to%20top%20button",
"common.copyright":"%C2%A9%20Copyright%20IBM%20Corporation%202022,%202023",
"error.unexpectedErrorHeading":"An%20unexpected%20error%20occurred",
"error.sorryText1":"We're%20sorry!",
"error.sorryText2":"The%20requested%20page%20does%20not%20exist%20or%20might%20have%20moved.",
"error.sorryText3":"If%20you%20accessed%20this%20page%20by%20using%20a%20bookmark%20or%20external%20URL,%20the%20bookmark%20or%20links%20might%20need%20to%20be%20updated.%20Use%20IBM%20Documentation%20search%20to%20find%20the%20content's%20new%20location.",
"error.sorryText4":"In%20this%20case,%20use%20the%20table%20of%20contents%20or%20the%20search%20to%20find%20the%20content.",
"error.sorryText5":"If%20you%20accessed%20this%20page%20from%20the%20table%20of%20contents%20or%20a%20search,%20please%20report%20the%20broken%20link%20to%20%3Ca%20id=%22ibmdocs-mailto-link%22%20href=%22%22%3EIBM%20Documentation%20support%3C/a%3E%20who%20will%20alert%20the%20appropriate%20content%20group.",
"error.tabError":"Resource%20not%20found",
I am especially confused by these things at the end: "error.sorryText4":"In%20[.....]
Unfortunately, I couldn't find any information on these. And they go on for about 500 lines.
I don't understand why the HTML sometimes looks like this. I get this result when I do CTRL+U on the page and sometimes as a result from my code
This is my Python code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Read the CSV file containing the identifiers
df_identifiers = pd.read_csv('identifiers.csv')
# Create an empty DataFrame to store the combined results
df_combined = pd.DataFrame()
# Iterate over the identifiers and process each URL
for index, row in df_identifiers.iterrows():
# Construct the URL using the identifier from the CSV file
identifier = row['Identifier']
url = f"https://www.ibm.com/docs/en/imdm/12.0?topic=tables-{identifier}"
# Send a GET request to the URL
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
# Extract the desired data from the HTML
Table = soup.find("h1", class_="topictitle1").get_text(strip=True).strip()
description = soup.find('p', class_='shortdesc').get_text(strip=True)
div_element = soup.find('div', class_='p')
a_elements = div_element.find_all('a')
feature_list = [a.get_text(strip=True) for a in a_elements]
table = soup.find("table")
headers = [header.get_text(strip=True) for header in table.select("th")]
data_rows = table.select("tbody tr")
data = [[td.get_text(strip=True) for td in row.select("td")] for row in data_rows]
# Create a DataFrame for the current URL's data
df = pd.DataFrame(data, columns=headers)
df["Description"] = description
for i, feature in enumerate(feature_list):
df[f"Feature_{i+1}"] = feature
df.insert(0, "Table", Table)
# Append the current DataFrame to the combined DataFrame
df_combined = df_combined.append(df, ignore_index=True)
# Save the combined DataFrame to a CSV file
df_combined.to_csv('combined_table_data.csv', index=False)
答案1
得分: 1
import httpx
import trio
import re
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0'
}
async def main():
async with httpx.AsyncClient(headers=headers, base_url='https://www.ibm.com/docs') as client:
params = {
'topic': 't-accessdateval'
}
r = await client.get('en/imdm/12.0', params=params)
nurl = "api/v1/content/" + \
re.search('"oldUrl":"(.*?)"', r.text).group(1)
params = {
'parsebody': 'true',
'lang': 'en'
}
r = await client.get(nurl, params=params)
df = pd.read_html(r.content, attrs={'class': 'defaultstyle'})[0]
print(df)
if __name__ == "__main__":
trio.run(main)
英文:
import httpx
import trio
import re
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0'
}
async def main():
async with httpx.AsyncClient(headers=headers, base_url='https://www.ibm.com/docs') as client:
params = {
'topic': 't-accessdateval'
}
r = await client.get('en/imdm/12.0', params=params)
nurl = "api/v1/content/" + \
re.search('"oldUrl":"(.*?)"', r.text).group(1)
params = {
'parsebody': 'true',
'lang': 'en'
}
r = await client.get(nurl, params=params)
df = pd.read_html(r.content, attrs={'class': 'defaultstyle'})[0]
print(df)
if __name__ == "__main__":
trio.run(main)
Output:
Name Comment ... Null Option Is PK
0 ACC_DATE_VAL_ID A unique, system-generated key that identifies... ... Not Null Yes
1 INSTANCE_PK The actual primary key of the row in the logic... ... Not Null No
2 ENTITY_NAME The name of the business entity. ... Not Null No
3 COL_NAME The actual name of the column where the defaul... ... Null No
4 DESCRIPTION A description of the record. ... Null No
5 LAST_USED_DT The date that this data was last used. There i... ... Null No
6 LAST_VERIFIED_DT The date that this data was last verified. The... ... Null No
7 LAST_UPDATE_DT When a record is added or updated, this field ... ... Not Null No
8 LAST_UPDATE_USER The ID of the user who last updated the data. ... Null No
9 LAST_UPDATE_TX_ID A unique, system-generated key that identifies... ... Null No
[10 rows x 5 columns]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论