2023年3月20日 23:40:40go评论118阅读模式

英文:

How does one "click" a subsidiary URL in Python, scrape that URL, and append those scraped data to the output of parent file?

问题

Python: Python 3.11.2
Python Editor: PyCharm 2022.3.3 (Community Edition) - Build PC-223.8836.43
OS: Windows 11 Pro, 22H2, 22621.1413
Browser: Chrome 111.0.5563.65 (Official Build) (64-bit)

仍然是初学者，我正在抓取网址 https://dockets.justia.com/search?parties=Novo+Nordisk，但也想抓取其10个超链接页面，如 https://dockets.justia.com/docket/puerto-rico/prdce/3:2023cv01127/175963,https://dockets.justia.com/docket/california/cacdce/2:2023cv01929/878409, 等等。

我要如何 (1) "打开" 这10个超链接页面， (2) 抓取子链接文档中的信息，例如在 table-responsive with-gaps table-padding--small table-bordered table-padding-sides--small table-full-width 内，然后 (3) 将抓取到的信息附加到打印输出文件（索引）的父URL。

到目前为止我有这些...

from bs4 import BeautifulSoup
import requests
html_text = requests.get("https://dockets.justia.com/search?parties=Novo+Nordisk").text
soup = BeautifulSoup(html_text, "lxml")
cases = soup.find_all("div", class_="has-padding-content-block-30 -zb")
# 将信息打印到单独的文件
for index, case in enumerate(cases):
    case_number = case.find("span", class_="citation").text.replace(" ", "")
    case_url = case.find("a", {"class": "case-name"})["href"]
    with open(f"posts/{index}.txt", "w") as f:
        f.write(f"案件编号: {case_number.strip()} \t")
        f.write(f"案件URL: {case_url} \n")
        print(f"文件已保存: {index}")
# 如果要在终端打印
# for case in cases:
#    case_number = case.find("span", class_="citation").text.replace(" ","")
#    case_url = case.find("a", {"class": "case-name"})["href"]
#    print(f"案件编号: {case_number.strip()}") # strip会清除标签
#    print(f"案件URL: {case_url}")

英文:

Python: Python 3.11.2
Python Editor: PyCharm 2022.3.3 (Community Edition) - Build PC-223.8836.43
OS: Windows 11 Pro, 22H2, 22621.1413
Browser: Chrome 111.0.5563.65 (Official Build) (64-bit)

Still a Baby Pythoneer, I'm scraping a URL https://dockets.justia.com/search?parties=Novo+Nordisk
but also want to scrape its 10 hyperlinked pages (e.g., "https://dockets.justia.com/docket/puerto-rico/prdce/3:2023cv01127/175963,https://dockets.justia.com/docket/california/cacdce/2:2023cv01929/878409, etc.)

How do I (1) "open" the 10 hyperlinked pages, (2) scrape the information in the subsidiary, hyperlinked document, e.g., inside table-responsive with-gaps table-padding--small table-bordered table-padding-sides--small table-full-width, and then (3) append the captured information to the print output files (index) the parent URL.

I have looked into Selenium a bit to perhaps open and control the webpage that way, but it doesn't seem particularly applicable here. Do I really need Selenium for this or is there some nifty and simple way to do this?

This is what I have so far...

from bs4 import BeautifulSoup
import requests
html_text = requests.get(&quot;https://dockets.justia.com/search?parties=Novo+Nordisk&quot;).text
soup = BeautifulSoup(html_text, &quot;lxml&quot;)
cases = soup.find_all(&quot;div&quot;, class_ = &quot;has-padding-content-block-30 -zb&quot;)
# Printing to individual files
for index, case in enumerate(cases):
    case_number = case.find(&quot;span&quot;, class_ = &quot;citation&quot;).text.replace(&quot; &quot;,&quot;&quot;)
    case_url = case.find(&quot;a&quot;, {&quot;class&quot;: &quot;case-name&quot;})[&quot;href&quot;]
    with open(f&quot;posts/{index}.txt&quot;, &quot;w&quot;) as f:
        f.write(f&quot;Case No.: {case_number.strip()} \t&quot;)
        f.write(f&quot;Case URL: {case_url} \n&quot;)
        print(&quot;File saved: {index}&quot;)
# If printing in terminal
# for case in cases:
#    case_number = case.find(&quot;span&quot;, class_ = &quot;citation&quot;).text.replace(&quot; &quot;,&quot;&quot;)
#    case_url = case.find(&quot;a&quot;, {&quot;class&quot;: &quot;case-name&quot;})[&quot;href&quot;]
#    print(f&quot;Case No.: {case_number.strip()}&quot;) # strip cleans off tags
#    print(f&quot;Case URL: {case_url}&quot;)

答案1

得分: 0

从 aiohttp 导入 ClientSession
从 pyuseragents 导入 random
从 bs4 导入 BeautifulSoup
从 asyncio 导入 run

类 DocketsJustia：

初始化方法：
	self.headers = {
	'authority': 'dockets.justia.com',
	'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
	'accept-language': 'ro-RO,ro;q=0.9,en-US;q=0.8,en;q=0.7',
	'cache-control': 'max-age=0',
	'referer': 'https://dockets.justia.com/search?parties=Novo+Nordisk',
	'user-agent': random(),
	}
	self.PatchFile = "nametxt.txt"
Parser 方法：
	async def Parser(self, session):
		count = 1
		while True:
		
			params = {
				'parties': 'Novo Nordisk',
				'page': f'{count}',
			}
			async with session.get(f'https://dockets.justia.com/search?parties=Novo+Nordisk&page={count}', params=params) as response:
				links = BeautifulSoup(await response.text(), "lxml").find_all("div", {"class": "has-padding-content-block-30 -zb"})
				for link in links:
					try:
						case_link = link.find("a", {"class": "case-name"}).get("href")
						case_number = link.find("span", {"class": "citation"}).text
						print(case_number + "\t" + case_link + "\n")
						with open(self.PatchFile, "a", encoding='utf-8') as file:
							file.write(case_number + "\t" + case_link + "\n")
					except:
						pass
			count += 1
LoggerParser 方法：
	async def LoggerParser(self):
		async with ClientSession(headers=self.headers) as session:
			await self.Parser(session)
StartDocketsJustia 函数：
	def StartDocketsJustia():
		run(DocketsJustia().LoggerParser())
主程序入口：
	if __name__ == '__main__':
		StartDocketsJustia()

英文:

from aiohttp import ClientSession
from pyuseragents import random
from bs4 import BeautifulSoup
from asyncio import run
class DocketsJustia:
	def __init__(self):
		self.headers = {
		&#39;authority&#39;: &#39;dockets.justia.com&#39;,
		&#39;accept&#39;: &#39;text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7&#39;,
		&#39;accept-language&#39;: &#39;ro-RO,ro;q=0.9,en-US;q=0.8,en;q=0.7&#39;,
		&#39;cache-control&#39;: &#39;max-age=0&#39;,
		&#39;referer&#39;: &#39;https://dockets.justia.com/search?parties=Novo+Nordisk&#39;,
		&#39;user-agent&#39;: random(),
		}
		self.PatchFile = &quot;nametxt.txt&quot;
	async def Parser(self, session):
		count = 1
		while True:
    
			params = {
				&#39;parties&#39;: &#39;Novo Nordisk&#39;,
				&#39;page&#39;: f&#39;{count}&#39;,
				}
			async with session.get(f&#39;https://dockets.justia.com/search?parties=Novo+Nordisk&amp;page={count}&#39;, params=params) as response:
				links = BeautifulSoup(await response.text(), &quot;lxml&quot;).find_all(&quot;div&quot;, {&quot;class&quot;: &quot;has-padding-content-block-30 -zb&quot;})
				for link in links:
					try:
						case_link = link.find(&quot;a&quot;, {&quot;class&quot;: &quot;case-name&quot;}).get(&quot;href&quot;)
						case_number = link.find(&quot;span&quot;, {&quot;class&quot;: &quot;citation&quot;}).text
						print(case_number + &quot;\t&quot; + case_link + &quot;\n&quot;)
						with open(self.PatchFile, &quot;a&quot;, encoding=&#39;utf-8&#39;) as file:
							file.write(case_number + &quot;\t&quot; + case_link + &quot;\n&quot;)
					except:
						pass
			count += 1
	async def LoggerParser(self):
		async with ClientSession(headers=self.headers) as session:
			await self.Parser(session)
def StartDocketsJustia():
	run(DocketsJustia().LoggerParser())
if __name__ == &#39;__main__&#39;:
	StartDocketsJustia()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How does one "click" a subsidiary URL in Python, scrape that URL, and append those scraped data to the output of parent file?

问题

答案1

YoloV8 TFlite Python 预测和解释输出

在DetailView中，如何根据模型中的字段选择模板？

Django – 安装 Angular 和 React 后出现 ImportError。

scraping with rvest got error no applicable method for 'xml_find_first' applied to an object of class "character"

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。