使用Python进行网页抓取仅返回一个空列表。

huangapple go评论88阅读模式
英文:

Web Scraping table with Python is just returning an empty list back

问题

我正在尝试使用Python-Beautifulsoup从此网站的所有页面中提取表格中的所有数据,并将其存储在一个字典中,如下面的代码所示。然而,这只返回一个空列表。

此外,我还在尝试从每个具有自己单独页面的公司中提取数据,并将其存储在同一个字典中。

from bs4 import BeautifulSoup
import requests 
from pprint import pprint

case_data = []

case_url = 'https://www.dataquest.io'
case_page = requests.get(case_url) 
soup_case = BeautifulSoup(case_page.content, 'html.parser') 
case_table = soup_case.find('div',{'class':'slds-table slds-table--bordered slds-max-medium-table_stacked cCaseList'})

pprint(case_table)
英文:

I'm trying to scrape all the data from this table, using Python-Beautifulsoup, from all the pages for this website and into a dictionary, as seen from the code below. However, this is just returning an empty list back

Moreover, I am also trying to scrape for each company which has it’s own separate page,into that dictionary also.

from bs4 import BeautifulSoup
import requests 
from pprint import pprint

case_data = []

case_url = 'https://www.dataquest.io'
case_page = requests.get(case_url) 
soup_case = BeautifulSoup(case_page.content, 'html.parser') 
case_table = soup_case.find('div',{'class':'slds-table slds-table--bordered slds-max-medium-table_stacked cCaseList'})

pprint(case_table)

答案1

得分: 0

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
import pandas as pd

options = Options()
options.add_argument('--headless')

driver = webdriver.Firefox(options=options)
driver.get("https://masked_per_user_request/")
time.sleep(2)

df = pd.read_html(driver.page_source)[0]

df.to_csv('result.csv', index=False)

driver.quit()

Output: 点击此处

请注意,数据是通过从JSON后端XHR请求渲染的,因此您可能可以通过POST请求调用它,包括JSON主体数据和Cookies

类似以下方式:

import requests

data = {
    'message': '{"actions":[{"id":"108;a","descriptor":"serviceComponent://ui.communities.components.aura.components.forceCommunity.richText.RichTextController/ACTION$getParsedRichTextValue","callingDescriptor":"UNKNOWN","params":{"html":"<p style=\\"text-align: justify;\\"><span style=\\"font-size: 14px;\\">The RSPO aspires to ensure transparency throughout the complaints process and the reporting thereof. Decisions not to disclose information through the RSPO website require motivation on genuine grounds that disclosure will go against the interest of the complaints process and/or may jeopardize the well-being or safety of stakeholders involved, and that non-disclosure does not undermine adherence to the principles and objectives of RSPO:</span></p><p><br></p><ul><li style=\\"text-align: justify;\\"><span style=\\"font-size: 14px;\\">The non-disclosed information relates to a legitimate aim, i.e. peaceful and constructive resolution of complaints in accordance with RSPO objectives and P&amp;amp;C;</span></li></ul><p><br></p><ul><li style=\\"text-align: justify;\\"><span style=\\"font-size: 14px;\\">The disclosure of said information threatens harm to that aim; and</span></li></ul><p style=\\"text-align: justify;\\"><br></p><ul><li style=\\"text-align: justify;\\"><span style=\\"font-size: 14px;\\">The harm to the aim is greater than the public interest in having the information disclosed.</span></li></ul><p>&amp;nbsp;</p>"},"version":"47.0","storable":true},{"id":"88;a","descriptor":"apex://ComplaintsCaseController/ACTION$searchCaseList","callingDescriptor":"markup://c:CaseList","params":{"searchString":"","pageNumber":1,"defaultPageSize":"10"}},{"id":"111;a","descriptor":"serviceComponent://ui.communities.components.aura.components.forceCommunity.controller.HeadlineController/ACTION$getInitData","callingDescriptor":"UNKNOWN","params":{"uniqueNameOrId":"","pageType":""},"version":"47.0","storable":true}]}',
    'aura.context': '{"mode":"PROD","fwuid":"5fuxCiO1mNHGdvJphU5ELQ","app":"siteforce:communityApp","loaded":{"APPLICATION@markup://siteforce:communityApp":"0luQG4JZE_TU28tAfQgGSA"},"dn":[],"globals":{},"uad":false}',
    'aura.pageURI': '/Complaint/s/casetracker',
    'aura.token': 'undefined'
}

r = requests.post("https://masked_per_user_request/", json=data).json()

print(r)

您需要找出Cookies参数。

英文:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
import pandas as pd

options = Options()
options.add_argument(&#39;--headless&#39;)

driver = webdriver.Firefox(options=options)
driver.get(&quot;https://masked_per_user_request/&quot;)
time.sleep(2)

df = pd.read_html(driver.page_source)[0]

df.to_csv(&#39;result.csv&#39;, index=False)

driver.quit()

Output: click here

Note that the data is rendered via XHR request from JSON back-end whcih is XHR-URL So you might be able to call it via POST request including JSON body data and Cookies

Something like the following:

import requests


data = {
    &#39;message&#39;: &#39;{&quot;actions&quot;:[{&quot;id&quot;:&quot;108;a&quot;,&quot;descriptor&quot;:&quot;serviceComponent://ui.communities.components.aura.components.forceCommunity.richText.RichTextController/ACTION$getParsedRichTextValue&quot;,&quot;callingDescriptor&quot;:&quot;UNKNOWN&quot;,&quot;params&quot;:{&quot;html&quot;:&quot;&lt;p style=\&quot;text-align: justify;\&quot;&gt;&lt;span style=\&quot;font-size: 14px;\&quot;&gt;The RSPO aspires to ensure transparency throughout the complaints process and the reporting thereof. Decisions not to disclose information through the RSPO website require motivation on genuine grounds that disclosure will go against the interest of the complaints process and/or may jeopardize the well-being or safety of stakeholders involved, and that non-disclosure does not undermine adherence to the principles and objectives of RSPO:&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;ul&gt;&lt;li style=\&quot;text-align: justify;\&quot;&gt;&lt;span style=\&quot;font-size: 14px;\&quot;&gt;The non-disclosed information relates to a legitimate aim, i.e. peaceful and constructive resolution of complaints in accordance with RSPO objectives and P&amp;amp;C;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;ul&gt;&lt;li style=\&quot;text-align: justify;\&quot;&gt;&lt;span style=\&quot;font-size: 14px;\&quot;&gt;The disclosure of said information threatens harm to that aim; and&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p style=\&quot;text-align: justify;\&quot;&gt;&lt;br&gt;&lt;/p&gt;&lt;ul&gt;&lt;li style=\&quot;text-align: justify;\&quot;&gt;&lt;span style=\&quot;font-size: 14px;\&quot;&gt;The harm to the aim is greater than the public interest in having the information disclosed.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&quot;},&quot;version&quot;:&quot;47.0&quot;,&quot;storable&quot;:true},{&quot;id&quot;:&quot;88;a&quot;,&quot;descriptor&quot;:&quot;apex://ComplaintsCaseController/ACTION$searchCaseList&quot;,&quot;callingDescriptor&quot;:&quot;markup://c:CaseList&quot;,&quot;params&quot;:{&quot;searchString&quot;:&quot;&quot;,&quot;pageNumber&quot;:1,&quot;defaultPageSize&quot;:&quot;10&quot;}},{&quot;id&quot;:&quot;111;a&quot;,&quot;descriptor&quot;:&quot;serviceComponent://ui.communities.components.aura.components.forceCommunity.controller.HeadlineController/ACTION$getInitData&quot;,&quot;callingDescriptor&quot;:&quot;UNKNOWN&quot;,&quot;params&quot;:{&quot;uniqueNameOrId&quot;:&quot;&quot;,&quot;pageType&quot;:&quot;&quot;},&quot;version&quot;:&quot;47.0&quot;,&quot;storable&quot;:true}]}&#39;,
    &#39;aura.context&#39;: &#39;{&quot;mode&quot;:&quot;PROD&quot;,&quot;fwuid&quot;:&quot;5fuxCiO1mNHGdvJphU5ELQ&quot;,&quot;app&quot;:&quot;siteforce:communityApp&quot;,&quot;loaded&quot;:{&quot;APPLICATION@markup://siteforce:communityApp&quot;:&quot;0luQG4JZE_TU28tAfQgGSA&quot;},&quot;dn&quot;:[],&quot;globals&quot;:{},&quot;uad&quot;:false}&#39;,
    &#39;aura.pageURI&#39;: &#39;/Complaint/s/casetracker&#39;,
    &#39;aura.token&#39;: &#39;undefined&#39;
}

r = requests.post(&quot;https://masked_per_user_request/&quot;, json=data).json()


print(r)

>You will need to figure out the Cookies Parameters.

huangapple
  • 本文由 发表于 2020年1月6日 19:59:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/59611727.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定