英文:
Web Scraping table with Python is just returning an empty list back
问题
我正在尝试使用Python-Beautifulsoup从此网站的所有页面中提取表格中的所有数据,并将其存储在一个字典中,如下面的代码所示。然而,这只返回一个空列表。
此外,我还在尝试从每个具有自己单独页面的公司中提取数据,并将其存储在同一个字典中。
from bs4 import BeautifulSoup
import requests
from pprint import pprint
case_data = []
case_url = 'https://www.dataquest.io'
case_page = requests.get(case_url)
soup_case = BeautifulSoup(case_page.content, 'html.parser')
case_table = soup_case.find('div',{'class':'slds-table slds-table--bordered slds-max-medium-table_stacked cCaseList'})
pprint(case_table)
英文:
I'm trying to scrape all the data from this table, using Python-Beautifulsoup, from all the pages for this website and into a dictionary, as seen from the code below. However, this is just returning an empty list back
Moreover, I am also trying to scrape for each company which has it’s own separate page,into that dictionary also.
from bs4 import BeautifulSoup
import requests
from pprint import pprint
case_data = []
case_url = 'https://www.dataquest.io'
case_page = requests.get(case_url)
soup_case = BeautifulSoup(case_page.content, 'html.parser')
case_table = soup_case.find('div',{'class':'slds-table slds-table--bordered slds-max-medium-table_stacked cCaseList'})
pprint(case_table)
答案1
得分: 0
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
import pandas as pd
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver.get("https://masked_per_user_request/")
time.sleep(2)
df = pd.read_html(driver.page_source)[0]
df.to_csv('result.csv', index=False)
driver.quit()
Output: 点击此处
请注意,数据是通过从JSON后端的XHR
请求渲染的,因此您可能可以通过POST
请求调用它,包括JSON
主体数据和Cookies
。
类似以下方式:
import requests
data = {
'message': '{"actions":[{"id":"108;a","descriptor":"serviceComponent://ui.communities.components.aura.components.forceCommunity.richText.RichTextController/ACTION$getParsedRichTextValue","callingDescriptor":"UNKNOWN","params":{"html":"<p style=\\"text-align: justify;\\"><span style=\\"font-size: 14px;\\">The RSPO aspires to ensure transparency throughout the complaints process and the reporting thereof. Decisions not to disclose information through the RSPO website require motivation on genuine grounds that disclosure will go against the interest of the complaints process and/or may jeopardize the well-being or safety of stakeholders involved, and that non-disclosure does not undermine adherence to the principles and objectives of RSPO:</span></p><p><br></p><ul><li style=\\"text-align: justify;\\"><span style=\\"font-size: 14px;\\">The non-disclosed information relates to a legitimate aim, i.e. peaceful and constructive resolution of complaints in accordance with RSPO objectives and P&amp;C;</span></li></ul><p><br></p><ul><li style=\\"text-align: justify;\\"><span style=\\"font-size: 14px;\\">The disclosure of said information threatens harm to that aim; and</span></li></ul><p style=\\"text-align: justify;\\"><br></p><ul><li style=\\"text-align: justify;\\"><span style=\\"font-size: 14px;\\">The harm to the aim is greater than the public interest in having the information disclosed.</span></li></ul><p>&nbsp;</p>"},"version":"47.0","storable":true},{"id":"88;a","descriptor":"apex://ComplaintsCaseController/ACTION$searchCaseList","callingDescriptor":"markup://c:CaseList","params":{"searchString":"","pageNumber":1,"defaultPageSize":"10"}},{"id":"111;a","descriptor":"serviceComponent://ui.communities.components.aura.components.forceCommunity.controller.HeadlineController/ACTION$getInitData","callingDescriptor":"UNKNOWN","params":{"uniqueNameOrId":"","pageType":""},"version":"47.0","storable":true}]}',
'aura.context': '{"mode":"PROD","fwuid":"5fuxCiO1mNHGdvJphU5ELQ","app":"siteforce:communityApp","loaded":{"APPLICATION@markup://siteforce:communityApp":"0luQG4JZE_TU28tAfQgGSA"},"dn":[],"globals":{},"uad":false}',
'aura.pageURI': '/Complaint/s/casetracker',
'aura.token': 'undefined'
}
r = requests.post("https://masked_per_user_request/", json=data).json()
print(r)
您需要找出Cookies参数。
英文:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
import pandas as pd
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver.get("https://masked_per_user_request/")
time.sleep(2)
df = pd.read_html(driver.page_source)[0]
df.to_csv('result.csv', index=False)
driver.quit()
Output: click here
Note that the data is rendered via XHR
request from JSON
back-end whcih is XHR-URL So you might be able to call it via POST
request including JSON
body data and Cookies
Something like the following:
import requests
data = {
'message': '{"actions":[{"id":"108;a","descriptor":"serviceComponent://ui.communities.components.aura.components.forceCommunity.richText.RichTextController/ACTION$getParsedRichTextValue","callingDescriptor":"UNKNOWN","params":{"html":"<p style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The RSPO aspires to ensure transparency throughout the complaints process and the reporting thereof. Decisions not to disclose information through the RSPO website require motivation on genuine grounds that disclosure will go against the interest of the complaints process and/or may jeopardize the well-being or safety of stakeholders involved, and that non-disclosure does not undermine adherence to the principles and objectives of RSPO:</span></p><p><br></p><ul><li style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The non-disclosed information relates to a legitimate aim, i.e. peaceful and constructive resolution of complaints in accordance with RSPO objectives and P&amp;C;</span></li></ul><p><br></p><ul><li style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The disclosure of said information threatens harm to that aim; and</span></li></ul><p style=\"text-align: justify;\"><br></p><ul><li style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The harm to the aim is greater than the public interest in having the information disclosed.</span></li></ul><p>&nbsp;</p>"},"version":"47.0","storable":true},{"id":"88;a","descriptor":"apex://ComplaintsCaseController/ACTION$searchCaseList","callingDescriptor":"markup://c:CaseList","params":{"searchString":"","pageNumber":1,"defaultPageSize":"10"}},{"id":"111;a","descriptor":"serviceComponent://ui.communities.components.aura.components.forceCommunity.controller.HeadlineController/ACTION$getInitData","callingDescriptor":"UNKNOWN","params":{"uniqueNameOrId":"","pageType":""},"version":"47.0","storable":true}]}',
'aura.context': '{"mode":"PROD","fwuid":"5fuxCiO1mNHGdvJphU5ELQ","app":"siteforce:communityApp","loaded":{"APPLICATION@markup://siteforce:communityApp":"0luQG4JZE_TU28tAfQgGSA"},"dn":[],"globals":{},"uad":false}',
'aura.pageURI': '/Complaint/s/casetracker',
'aura.token': 'undefined'
}
r = requests.post("https://masked_per_user_request/", json=data).json()
print(r)
>You will need to figure out the Cookies Parameters.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论