How does one "click" a subsidiary URL in Python, scrape that URL, and append those scraped data to the output of parent file?

huangapple go评论118阅读模式
英文:

How does one "click" a subsidiary URL in Python, scrape that URL, and append those scraped data to the output of parent file?

问题

Python: Python 3.11.2
Python Editor: PyCharm 2022.3.3 (Community Edition) - Build PC-223.8836.43
OS: Windows 11 Pro, 22H2, 22621.1413
Browser: Chrome 111.0.5563.65 (Official Build) (64-bit)


仍然是初学者,我正在抓取网址 https://dockets.justia.com/search?parties=Novo+Nordisk,但也想抓取其10个超链接页面,如 https://dockets.justia.com/docket/puerto-rico/prdce/3:2023cv01127/175963,https://dockets.justia.com/docket/california/cacdce/2:2023cv01929/878409, 等等

我要如何 (1) "打开" 这10个超链接页面, (2) 抓取子链接文档中的信息,例如在 table-responsive with-gaps table-padding--small table-bordered table-padding-sides--small table-full-width 内,然后 (3) 将抓取到的信息附加到打印输出文件(索引)的父URL。

到目前为止我有这些...

  1. from bs4 import BeautifulSoup
  2. import requests
  3. html_text = requests.get("https://dockets.justia.com/search?parties=Novo+Nordisk").text
  4. soup = BeautifulSoup(html_text, "lxml")
  5. cases = soup.find_all("div", class_="has-padding-content-block-30 -zb")
  6. # 将信息打印到单独的文件
  7. for index, case in enumerate(cases):
  8. case_number = case.find("span", class_="citation").text.replace(" ", "")
  9. case_url = case.find("a", {"class": "case-name"})["href"]
  10. with open(f"posts/{index}.txt", "w") as f:
  11. f.write(f"案件编号: {case_number.strip()} \t")
  12. f.write(f"案件URL: {case_url} \n")
  13. print(f"文件已保存: {index}")
  14. # 如果要在终端打印
  15. # for case in cases:
  16. # case_number = case.find("span", class_="citation").text.replace(" ","")
  17. # case_url = case.find("a", {"class": "case-name"})["href"]
  18. # print(f"案件编号: {case_number.strip()}") # strip会清除标签
  19. # print(f"案件URL: {case_url}")
英文:

Python: Python 3.11.2
Python Editor: PyCharm 2022.3.3 (Community Edition) - Build PC-223.8836.43
OS: Windows 11 Pro, 22H2, 22621.1413
Browser: Chrome 111.0.5563.65 (Official Build) (64-bit)


Still a Baby Pythoneer, I'm scraping a URL https://dockets.justia.com/search?parties=Novo+Nordisk
but also want to scrape its 10 hyperlinked pages (e.g., "https://dockets.justia.com/docket/puerto-rico/prdce/3:2023cv01127/175963,https://dockets.justia.com/docket/california/cacdce/2:2023cv01929/878409, etc.)

How do I (1) "open" the 10 hyperlinked pages, (2) scrape the information in the subsidiary, hyperlinked document, e.g., inside table-responsive with-gaps table-padding--small table-bordered table-padding-sides--small table-full-width, and then (3) append the captured information to the print output files (index) the parent URL.

I have looked into Selenium a bit to perhaps open and control the webpage that way, but it doesn't seem particularly applicable here. Do I really need Selenium for this or is there some nifty and simple way to do this?

This is what I have so far...

  1. from bs4 import BeautifulSoup
  2. import requests
  3. html_text = requests.get("https://dockets.justia.com/search?parties=Novo+Nordisk").text
  4. soup = BeautifulSoup(html_text, "lxml")
  5. cases = soup.find_all("div", class_ = "has-padding-content-block-30 -zb")
  6. # Printing to individual files
  7. for index, case in enumerate(cases):
  8. case_number = case.find("span", class_ = "citation").text.replace(" ","")
  9. case_url = case.find("a", {"class": "case-name"})["href"]
  10. with open(f"posts/{index}.txt", "w") as f:
  11. f.write(f"Case No.: {case_number.strip()} \t")
  12. f.write(f"Case URL: {case_url} \n")
  13. print("File saved: {index}")
  14. # If printing in terminal
  15. # for case in cases:
  16. # case_number = case.find("span", class_ = "citation").text.replace(" ","")
  17. # case_url = case.find("a", {"class": "case-name"})["href"]
  18. # print(f"Case No.: {case_number.strip()}") # strip cleans off tags
  19. # print(f"Case URL: {case_url}")

答案1

得分: 0

从 aiohttp 导入 ClientSession
从 pyuseragents 导入 random
从 bs4 导入 BeautifulSoup
从 asyncio 导入 run

类 DocketsJustia:

  1. 初始化方法:
  2. self.headers = {
  3. 'authority': 'dockets.justia.com',
  4. 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
  5. 'accept-language': 'ro-RO,ro;q=0.9,en-US;q=0.8,en;q=0.7',
  6. 'cache-control': 'max-age=0',
  7. 'referer': 'https://dockets.justia.com/search?parties=Novo+Nordisk',
  8. 'user-agent': random(),
  9. }
  10. self.PatchFile = "nametxt.txt"
  11. Parser 方法:
  12. async def Parser(self, session):
  13. count = 1
  14. while True:
  15. params = {
  16. 'parties': 'Novo Nordisk',
  17. 'page': f'{count}',
  18. }
  19. async with session.get(f'https://dockets.justia.com/search?parties=Novo+Nordisk&page={count}', params=params) as response:
  20. links = BeautifulSoup(await response.text(), "lxml").find_all("div", {"class": "has-padding-content-block-30 -zb"})
  21. for link in links:
  22. try:
  23. case_link = link.find("a", {"class": "case-name"}).get("href")
  24. case_number = link.find("span", {"class": "citation"}).text
  25. print(case_number + "\t" + case_link + "\n")
  26. with open(self.PatchFile, "a", encoding='utf-8') as file:
  27. file.write(case_number + "\t" + case_link + "\n")
  28. except:
  29. pass
  30. count += 1
  31. LoggerParser 方法:
  32. async def LoggerParser(self):
  33. async with ClientSession(headers=self.headers) as session:
  34. await self.Parser(session)
  35. StartDocketsJustia 函数:
  36. def StartDocketsJustia():
  37. run(DocketsJustia().LoggerParser())
  38. 主程序入口:
  39. if __name__ == '__main__':
  40. StartDocketsJustia()
英文:
  1. from aiohttp import ClientSession
  2. from pyuseragents import random
  3. from bs4 import BeautifulSoup
  4. from asyncio import run
  5. class DocketsJustia:
  6. def __init__(self):
  7. self.headers = {
  8. 'authority': 'dockets.justia.com',
  9. 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
  10. 'accept-language': 'ro-RO,ro;q=0.9,en-US;q=0.8,en;q=0.7',
  11. 'cache-control': 'max-age=0',
  12. 'referer': 'https://dockets.justia.com/search?parties=Novo+Nordisk',
  13. 'user-agent': random(),
  14. }
  15. self.PatchFile = "nametxt.txt"
  16. async def Parser(self, session):
  17. count = 1
  18. while True:
  19. params = {
  20. 'parties': 'Novo Nordisk',
  21. 'page': f'{count}',
  22. }
  23. async with session.get(f'https://dockets.justia.com/search?parties=Novo+Nordisk&page={count}', params=params) as response:
  24. links = BeautifulSoup(await response.text(), "lxml").find_all("div", {"class": "has-padding-content-block-30 -zb"})
  25. for link in links:
  26. try:
  27. case_link = link.find("a", {"class": "case-name"}).get("href")
  28. case_number = link.find("span", {"class": "citation"}).text
  29. print(case_number + "\t" + case_link + "\n")
  30. with open(self.PatchFile, "a", encoding='utf-8') as file:
  31. file.write(case_number + "\t" + case_link + "\n")
  32. except:
  33. pass
  34. count += 1
  35. async def LoggerParser(self):
  36. async with ClientSession(headers=self.headers) as session:
  37. await self.Parser(session)
  38. def StartDocketsJustia():
  39. run(DocketsJustia().LoggerParser())
  40. if __name__ == '__main__':
  41. StartDocketsJustia()

huangapple
  • 本文由 发表于 2023年3月20日 23:40:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/75792369.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定