问题

我正在尝试使用Python-Beautifulsoup从此网站的所有页面中提取表格中的所有数据，并将其存储在一个字典中，如下面的代码所示。然而，这只返回一个空列表。

此外，我还在尝试从每个具有自己单独页面的公司中提取数据，并将其存储在同一个字典中。

from bs4 import BeautifulSoup
import requests 
from pprint import pprint

case_data = []

case_url = 'https://www.dataquest.io'
case_page = requests.get(case_url) 
soup_case = BeautifulSoup(case_page.content, 'html.parser') 
case_table = soup_case.find('div',{'class':'slds-table slds-table--bordered slds-max-medium-table_stacked cCaseList'})

pprint(case_table)

英文:

I'm trying to scrape all the data from this table, using Python-Beautifulsoup, from all the pages for this website and into a dictionary, as seen from the code below. However, this is just returning an empty list back

Moreover, I am also trying to scrape for each company which has it’s own separate page,into that dictionary also.

from bs4 import BeautifulSoup
import requests 
from pprint import pprint

case_data = []

case_url = &#39;https://www.dataquest.io&#39;
case_page = requests.get(case_url) 
soup_case = BeautifulSoup(case_page.content, &#39;html.parser&#39;) 
case_table = soup_case.find(&#39;div&#39;,{&#39;class&#39;:&#39;slds-table slds-table--bordered slds-max-medium-table_stacked cCaseList&#39;})

pprint(case_table)

答案1

得分: 0

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
import pandas as pd

options = Options()
options.add_argument('--headless')

driver = webdriver.Firefox(options=options)
driver.get("https://masked_per_user_request/")
time.sleep(2)

df = pd.read_html(driver.page_source)[0]

df.to_csv('result.csv', index=False)

driver.quit()

Output: 点击此处

请注意，数据是通过从JSON后端的XHR请求渲染的，因此您可能可以通过POST请求调用它，包括JSON主体数据和Cookies。

类似以下方式：

import requests

data = {
    'message': '{"actions":[{"id":"108;a","descriptor":"serviceComponent://ui.communities.components.aura.components.forceCommunity.richText.RichTextController/ACTION$getParsedRichTextValue","callingDescriptor":"UNKNOWN","params":{"html":"<p style=\\"text-align: justify;\\"><span style=\\"font-size: 14px;\\">The RSPO aspires to ensure transparency throughout the complaints process and the reporting thereof. Decisions not to disclose information through the RSPO website require motivation on genuine grounds that disclosure will go against the interest of the complaints process and/or may jeopardize the well-being or safety of stakeholders involved, and that non-disclosure does not undermine adherence to the principles and objectives of RSPO:</span></p><p><br></p><ul><li style=\\"text-align: justify;\\"><span style=\\"font-size: 14px;\\">The non-disclosed information relates to a legitimate aim, i.e. peaceful and constructive resolution of complaints in accordance with RSPO objectives and P&amp;amp;C;</span></li></ul><p><br></p><ul><li style=\\"text-align: justify;\\"><span style=\\"font-size: 14px;\\">The disclosure of said information threatens harm to that aim; and</span></li></ul><p style=\\"text-align: justify;\\"><br></p><ul><li style=\\"text-align: justify;\\"><span style=\\"font-size: 14px;\\">The harm to the aim is greater than the public interest in having the information disclosed.</span></li></ul><p>&amp;nbsp;</p>"},"version":"47.0","storable":true},{"id":"88;a","descriptor":"apex://ComplaintsCaseController/ACTION$searchCaseList","callingDescriptor":"markup://c:CaseList","params":{"searchString":"","pageNumber":1,"defaultPageSize":"10"}},{"id":"111;a","descriptor":"serviceComponent://ui.communities.components.aura.components.forceCommunity.controller.HeadlineController/ACTION$getInitData","callingDescriptor":"UNKNOWN","params":{"uniqueNameOrId":"","pageType":""},"version":"47.0","storable":true}]}',
    'aura.context': '{"mode":"PROD","fwuid":"5fuxCiO1mNHGdvJphU5ELQ","app":"siteforce:communityApp","loaded":{"APPLICATION@markup://siteforce:communityApp":"0luQG4JZE_TU28tAfQgGSA"},"dn":[],"globals":{},"uad":false}',
    'aura.pageURI': '/Complaint/s/casetracker',
    'aura.token': 'undefined'
}

r = requests.post("https://masked_per_user_request/", json=data).json()

print(r)

您需要找出Cookies参数。

英文:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
import pandas as pd

options = Options()
options.add_argument(&#39;--headless&#39;)

driver = webdriver.Firefox(options=options)
driver.get(&quot;https://masked_per_user_request/&quot;)
time.sleep(2)

df = pd.read_html(driver.page_source)[0]

df.to_csv(&#39;result.csv&#39;, index=False)

driver.quit()

Output: click here

Note that the data is rendered via XHR request from JSON back-end whcih is XHR-URL So you might be able to call it via POST request including JSON body data and Cookies

Something like the following:

import requests


data = {
    &#39;message&#39;: &#39;{&quot;actions&quot;:[{&quot;id&quot;:&quot;108;a&quot;,&quot;descriptor&quot;:&quot;serviceComponent://ui.communities.components.aura.components.forceCommunity.richText.RichTextController/ACTION$getParsedRichTextValue&quot;,&quot;callingDescriptor&quot;:&quot;UNKNOWN&quot;,&quot;params&quot;:{&quot;html&quot;:&quot;&lt;p style=\&quot;text-align: justify;\&quot;&gt;&lt;span style=\&quot;font-size: 14px;\&quot;&gt;The RSPO aspires to ensure transparency throughout the complaints process and the reporting thereof. Decisions not to disclose information through the RSPO website require motivation on genuine grounds that disclosure will go against the interest of the complaints process and/or may jeopardize the well-being or safety of stakeholders involved, and that non-disclosure does not undermine adherence to the principles and objectives of RSPO:&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;ul&gt;&lt;li style=\&quot;text-align: justify;\&quot;&gt;&lt;span style=\&quot;font-size: 14px;\&quot;&gt;The non-disclosed information relates to a legitimate aim, i.e. peaceful and constructive resolution of complaints in accordance with RSPO objectives and P&amp;amp;C;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;ul&gt;&lt;li style=\&quot;text-align: justify;\&quot;&gt;&lt;span style=\&quot;font-size: 14px;\&quot;&gt;The disclosure of said information threatens harm to that aim; and&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p style=\&quot;text-align: justify;\&quot;&gt;&lt;br&gt;&lt;/p&gt;&lt;ul&gt;&lt;li style=\&quot;text-align: justify;\&quot;&gt;&lt;span style=\&quot;font-size: 14px;\&quot;&gt;The harm to the aim is greater than the public interest in having the information disclosed.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&quot;},&quot;version&quot;:&quot;47.0&quot;,&quot;storable&quot;:true},{&quot;id&quot;:&quot;88;a&quot;,&quot;descriptor&quot;:&quot;apex://ComplaintsCaseController/ACTION$searchCaseList&quot;,&quot;callingDescriptor&quot;:&quot;markup://c:CaseList&quot;,&quot;params&quot;:{&quot;searchString&quot;:&quot;&quot;,&quot;pageNumber&quot;:1,&quot;defaultPageSize&quot;:&quot;10&quot;}},{&quot;id&quot;:&quot;111;a&quot;,&quot;descriptor&quot;:&quot;serviceComponent://ui.communities.components.aura.components.forceCommunity.controller.HeadlineController/ACTION$getInitData&quot;,&quot;callingDescriptor&quot;:&quot;UNKNOWN&quot;,&quot;params&quot;:{&quot;uniqueNameOrId&quot;:&quot;&quot;,&quot;pageType&quot;:&quot;&quot;},&quot;version&quot;:&quot;47.0&quot;,&quot;storable&quot;:true}]}&#39;,
    &#39;aura.context&#39;: &#39;{&quot;mode&quot;:&quot;PROD&quot;,&quot;fwuid&quot;:&quot;5fuxCiO1mNHGdvJphU5ELQ&quot;,&quot;app&quot;:&quot;siteforce:communityApp&quot;,&quot;loaded&quot;:{&quot;APPLICATION@markup://siteforce:communityApp&quot;:&quot;0luQG4JZE_TU28tAfQgGSA&quot;},&quot;dn&quot;:[],&quot;globals&quot;:{},&quot;uad&quot;:false}&#39;,
    &#39;aura.pageURI&#39;: &#39;/Complaint/s/casetracker&#39;,
    &#39;aura.token&#39;: &#39;undefined&#39;
}

r = requests.post(&quot;https://masked_per_user_request/&quot;, json=data).json()


print(r)

>You will need to figure out the Cookies Parameters.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Python进行网页抓取仅返回一个空列表。

问题

答案1

Recursive Pydantic model to gRPC protobuf

Nosetests 由于某种原因未运行。

How to handle iframes on a webpage

BioPython – 如何一次对齐多个序列？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论