我找不到正确的标签来抓取类名、代码和描述(描述通过链接提供)。

huangapple go评论103阅读模式
英文:

I can't find the correct tags to scrape the class name, code, and description (description is via link)

问题

我是新手,正在尝试从这个网站上抓取课程代码、名称和描述:

URL = https://catalog.registrar.ucla.edu/search?parentAcademicOrg=7e561ea0db6fa0107f1572f5f39619b1&ct=subject

无论我如何设置我的divs = soup.find_all(),都似乎没有打印任何内容(最终我将把所有数据打印到CSV文件中)。

这是我的代码:

  1. import requests
  2. from bs4 import BeautifulSoup
  3. # 定义要抓取的URL
  4. url = 'https://catalog.registrar.ucla.edu/search?parentAcademicOrg=7e561ea0db6fa0107f1572f5f39619b1&ct=subject'
  5. # 发送GET请求并获取HTML响应
  6. response = requests.get(url)
  7. html = response.content
  8. # 使用BeautifulSoup解析HTML
  9. soup = BeautifulSoup(html, 'html.parser')
  10. # 查找所有class为'courseblock'的div标签
  11. divs = soup.find_all('div', {'class': 'courseblock'})
  12. # 遍历divs并提取每个div中的课程名称
  13. for div in divs:
  14. # 从div内的第一个span标签中提取课程名称
  15. course_name = div.find('span', {'class': 'courseblocktitle'}).text.strip()
  16. # 打印课程名称
  17. print(course_name)

与下面这行代码不同:

  1. divs = soup.find_all('div', {'class': 'courseblock'}),

我还尝试了以下方式:

  1. 1. divs = soup.find_all('div', {'class': 'css-15y68hq-Box--Box-Box-Flex--Flex-Flex-results-styles--ResultItemContainer e1ecnqs53'}),
  2. 2. divs = soup.find_all('span', {'class': 'result-item-title'}),

我甚至还没有尝试描述部分,因为我卡在这里了。任何帮助将会很有帮助。

英文:

I'm brand new to scraping. I'm trying to scrape the class code, name, and description from this website:

URL = https://catalog.registrar.ucla.edu/search?parentAcademicOrg=7e561ea0db6fa0107f1572f5f39619b1&ct=subject

No matter what I set my divs = soup.find_all(), nothing seems to print (eventually I will print all data to csv).

Here is what I have:

  1. import requests
  2. from bs4 import BeautifulSoup
  3. # Define the URL to scrape
  4. url = 'https://catalog.registrar.ucla.edu/search?parentAcademicOrg=7e561ea0db6fa0107f1572f5f39619b1&ct=subject'
  5. # Send a GET request to the URL and get the HTML response
  6. response = requests.get(url)
  7. html = response.content
  8. # Parse the HTML using BeautifulSoup
  9. soup = BeautifulSoup(html, 'html.parser')
  10. # Find all the div tags with class 'courseblock'
  11. divs = soup.find_all('div', {'class': 'courseblock'})
  12. # Loop through the divs and extract the course name from each div
  13. for div in divs:
  14. # Extract the course name from the first span tag within the div
  15. course_name = div.find('span', {'class': 'courseblocktitle'}).text.strip()
  16. # Print the course name
  17. print(course_name)

Instead of this line below:

  1. divs = soup.find_all('div', {'class': 'courseblock'}),

I've also tried:

  1. 1. divs = soup.find_all('div', {'class': 'css-15y68hq-Box--Box-Box-Flex--Flex-Flex-results-styles--ResultItemContainer e1ecnqs53'}),
  2. 2. divs = soup.find_all('span', {'class': 'result-item-title'}),

I haven't even attempted the description because I'm stuck on this. Any help would be great.

答案1

得分: 0

以下是代码部分的翻译:

  1. import requests
  2. headers = {
  3. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  4. }
  5. api_endpoint = "https://api-us-west-1.prod.courseloop.com/publisher/search-academic-items?"
  6. payload = {
  7. "siteId": "ucla-prod",
  8. "query": "",
  9. "contenttype": "subject",
  10. "searchFilters": [
  11. {
  12. "filterField": "implementationYear",
  13. "filterValue": [
  14. "2022"
  15. ],
  16. "isExactMatch": False
  17. },
  18. {
  19. "filterField": "parentAcademicOrg",
  20. "filterValue": [
  21. "7e561ea0db6fa0107f1572f5f39619b1"
  22. ],
  23. "isExactMatch": False
  24. }
  25. ],
  26. "from": 0,
  27. "size": 20
  28. }
  29. data = requests.post(api_endpoint, headers=headers, json=payload).json()
  30. for item in data["data"]["results"]:
  31. print(item["title"])
  32. print(f"https://catalog.registrar.ucla.edu{item['uri']}")

输出:

  1. A&O SCI 1 Climate Change: From Puzzles to Policy
  2. https://catalog.registrar.ucla.edu/course/2022/AOSCI1
  3. A&O SCI 1L Climate Change: From Puzzles to PolicyLaboratory
  4. https://catalog.registrar.ucla.edu/course/2022/AOSCI1L
  5. A&O SCI 2 Air Pollution
  6. https://catalog.registrar.ucla.edu/course/2022/AOSCI2
  7. A&O SCI 2L Air Pollution Laboratory
  8. https://catalog.registrar.ucla.edu/course/2022/AOSCI2L
  9. A&O SCI 3 Meteorology and Extreme Weather
  10. https://catalog.registrar.ucla.edu/course/2022/AOSCI3
  11. A&O SCI 3L Meteorology and Extreme Weather Laboratory
  12. https://catalog.registrar.ucla.edu/course/2022/AOSCI3L
  13. A&O SCI 5 Climates of Other Worlds
  14. https://catalog.registrar.ucla.edu/course/2022/AOSCI5
  15. A&O SCI M7 Perils of Space: Introduction to Space Weather
  16. https://catalog.registrar.ucla.edu/course/2022/AOSCIM7
  17. A&O SCI 19 Fiat Lux Freshman Seminars
  18. https://catalog.registrar.ucla.edu/course/2022/AOSCI19
  19. A&O SCI 51 Fundamentals of Climate Science
  20. https://catalog.registrar.ucla.edu/course/2022/AOSCI51
  21. A&O SCI M71 Introduction to Computing for Geoscientists
  22. https://catalog.registrar.ucla.edu/course/2022/AOSCIM71
  23. A&O SCI 88 Lower-Division Seminar
  24. https://catalog.registrar.ucla.edu/course/2022/AOSCI88
  25. A&O SCI 89 Honors Seminars
  26. https://catalog.registrar.ucla.edu/course/2022/AOSCI89
  27. A&O SCI 89HC Honors Contracts
  28. https://catalog.registrar.ucla.edu/course/2022/AOSCI89HC
  29. A&O SCI 90 Introduction to Undergraduate Research in Atmospheric and Oceanic Sciences
  30. https://catalog.registrar.ucla.edu/course/2022/AOSCI90
  31. A&O SCI 99 Student Research Program
  32. https://catalog.registrar.ucla.edu/course/2022/AOSCI99
  33. A&O SCI M100 Earth and Its Environment
  34. https://catalog.registrar.ucla.edu/course/2022/AOSCIM100
  35. A&O SCI 101 Fundamentals of Atmospheric Dynamics and Thermodynamics
  36. https://catalog.registrar.ucla.edu/course/2022/AOSCI101
  37. A&O SCI 102 Climate Change and Climate Modeling
  38. https://catalog.registrar.ucla.edu/course/2022/AOSCI102
  39. A&O SCI 103 Physical Oceanography
英文:

The data comes from an API endpoint that you can query.

Try this:

  1. import requests
  2. headers = {
  3. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  4. }
  5. api_endpoint = "https://api-us-west-1.prod.courseloop.com/publisher/search-academic-items?"
  6. payload = {
  7. "siteId": "ucla-prod",
  8. "query": "",
  9. "contenttype": "subject",
  10. "searchFilters": [
  11. {
  12. "filterField": "implementationYear",
  13. "filterValue": [
  14. "2022"
  15. ],
  16. "isExactMatch": False
  17. },
  18. {
  19. "filterField": "parentAcademicOrg",
  20. "filterValue": [
  21. "7e561ea0db6fa0107f1572f5f39619b1"
  22. ],
  23. "isExactMatch": False
  24. }
  25. ],
  26. "from": 0,
  27. "size": 20
  28. }
  29. data = requests.post(api_endpoint, headers=headers, json=payload).json()
  30. for item in data["data"]["results"]:
  31. print(item["title"])
  32. print(f"https://catalog.registrar.ucla.edu{item['uri']}")

Output:

  1. A&O SCI 1 Climate Change: From Puzzles to Policy
  2. https://catalog.registrar.ucla.edu/course/2022/AOSCI1
  3. A&O SCI 1L Climate Change: From Puzzles to PolicyLaboratory
  4. https://catalog.registrar.ucla.edu/course/2022/AOSCI1L
  5. A&O SCI 2 Air Pollution
  6. https://catalog.registrar.ucla.edu/course/2022/AOSCI2
  7. A&O SCI 2L Air Pollution Laboratory
  8. https://catalog.registrar.ucla.edu/course/2022/AOSCI2L
  9. A&O SCI 3 Meteorology and Extreme Weather
  10. https://catalog.registrar.ucla.edu/course/2022/AOSCI3
  11. A&O SCI 3L Meteorology and Extreme Weather Laboratory
  12. https://catalog.registrar.ucla.edu/course/2022/AOSCI3L
  13. A&O SCI 5 Climates of Other Worlds
  14. https://catalog.registrar.ucla.edu/course/2022/AOSCI5
  15. A&O SCI M7 Perils of Space: Introduction to Space Weather
  16. https://catalog.registrar.ucla.edu/course/2022/AOSCIM7
  17. A&O SCI 19 Fiat Lux Freshman Seminars
  18. https://catalog.registrar.ucla.edu/course/2022/AOSCI19
  19. A&O SCI 51 Fundamentals of Climate Science
  20. https://catalog.registrar.ucla.edu/course/2022/AOSCI51
  21. A&O SCI M71 Introduction to Computing for Geoscientists
  22. https://catalog.registrar.ucla.edu/course/2022/AOSCIM71
  23. A&O SCI 88 Lower-Division Seminar
  24. https://catalog.registrar.ucla.edu/course/2022/AOSCI88
  25. A&O SCI 89 Honors Seminars
  26. https://catalog.registrar.ucla.edu/course/2022/AOSCI89
  27. A&O SCI 89HC Honors Contracts
  28. https://catalog.registrar.ucla.edu/course/2022/AOSCI89HC
  29. A&O SCI 90 Introduction to Undergraduate Research in Atmospheric and Oceanic Sciences
  30. https://catalog.registrar.ucla.edu/course/2022/AOSCI90
  31. A&O SCI 99 Student Research Program
  32. https://catalog.registrar.ucla.edu/course/2022/AOSCI99
  33. A&O SCI M100 Earth and Its Environment
  34. https://catalog.registrar.ucla.edu/course/2022/AOSCIM100
  35. A&O SCI 101 Fundamentals of Atmospheric Dynamics and Thermodynamics
  36. https://catalog.registrar.ucla.edu/course/2022/AOSCI101
  37. A&O SCI 102 Climate Change and Climate Modeling
  38. https://catalog.registrar.ucla.edu/course/2022/AOSCI102
  39. A&O SCI 103 Physical Oceanography

huangapple
  • 本文由 发表于 2023年4月4日 03:59:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75923342.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定