英文:
Python Web-Scraping Code only returning first iteration in my loop
问题
我是新手学习网页抓取。我编写了一段代码来获取网页中文章的标题、段落和YouTube链接。我的“for”循环在第一次迭代中返回正确的结果,但重复了10次,而没有获取其他文章。网页上有10篇独立的文章,所以我认为与我编写的.select函数有关。以下是代码:
import requests
import bs4
url = 'https://coreyms.com'
# 获取网页的响应对象,确保它运行正确
response = requests.get(url)
response.raise_for_status()
# 使用bs4将所有HTML解析为网页上的一个字符串
schafer = bs4.BeautifulSoup(response.text, 'html.parser')
# 尝试使用for循环
for article in schafer.find_all('article'):
header = article.select('a')
header = header[0].getText()
print(header)
paragraph = article.select('div > p')
paragraph = paragraph[0].getText()
print(paragraph)
link = article.select('iframe')
# 在这里解析YouTube链接,获取可在YouTube上观看的纯链接
link = link[0].get('src')
vidID = link.split('/')[4]
vidID = vidID.split('?')[0]
ytLink = f'https://youtube.com/watch?v={vidID}'
print(ytLink)
print()
请注意,我对代码进行了一些修改,以确保在每次迭代中处理的是当前文章而不是整个网页。希望这能解决你的问题。
英文:
I'm new to web-scraping. I wrote a code to return the header, paragraph, and youtube link of article within the webpage. My "for" loop is returning the first iteration correctly, but it's repeating it 10 times and not pulling the other articles. There are 10 separate articles on the webpage so I think it has something to do with the .select function I'm writing. Code below:
import requests
import bs4
url = 'https://coreyms.com'
# Get the url in a response object and make sure it runs correctly
response = requests.get(url)
response.raise_for_status()
# Now im using bs4 to parse all the html into a single string on the webpage
schafer = bs4.BeautifulSoup(response.text, 'html.parser')
# Attempting to use a for loop
for article in schafer.find_all('article'):
header = schafer.select('article a')
header = header[0].getText()
print(header)
paragraph = schafer.select('article div > p')
paragraph = paragraph[0].getText()
print(paragraph)
link = schafer.select('article iframe')
# This is where you parse out the youtube link to just get the pure link to watch on Youtube
link = link[0].get('src')
vidID = link.split('/')[4]
vidID = vidID.split('?')[0]
ytLink = f'https://youtube.com/watch?v={vidID}'
print(ytLink)
print()
答案1
得分: 1
你在迭代中使用这个作为你的迭代器:
for article in schafer.find_all('article'):
因此,在每次循环中变化的变量是 article
。然而,你从未使用这个变量,而是使用了 schafer
,这是一个在循环进行时从未改变的变量。
要修复你的问题,将 schafer
替换为 article
并更改 select
语句。例如:
header = schafer.select('article a')
应该改成
header = article.select('a')
而这一行:
paragraph = schafer.select('article div > p')
应该改成
paragraph = article.select('div > p')
然后你应该会得到你期望的结果。
英文:
You use this as your iterator:
for article in schafer.find_all('article'):
So the variable that changes every loop is article
. However, you never use this variable, instead using schafer
, which is a variable that never changes as the loops go on.
To fix you problem, replace schafer
with article
and change the select
statement. For example:
header = schafer.select('article a')
becomes
header = article.select('a')
The line
paragraph = schafer.select('article div > p')
becomes
paragraph = article.select('div > p')
You should then get the results you expect.
答案2
得分: 0
你可以简单地按照以下方式操作:
import requests
import bs4
response = requests.get(url='https://coreyms.com')
schafer = bs4.BeautifulSoup(response.text, 'html.parser')
for article in schafer.find_all('article'):
header = article.findNext('h2').text
print(f"Header: {header}")
paragraph = article.findNext('div', class_="entry-content").text
print(f"Paragraph: {paragraph}")
yt_link = article.findNext('iframe', class_="youtube-player")['src'].split('?')[0].replace("embed/", "watch?v=")
print(f"YouTube Link: {yt_link}")
请注意,这是一段Python代码。
英文:
You can simply do like the below
import requests
import bs4
response = requests.get(url='https://coreyms.com')
schafer = bs4.BeautifulSoup(response.text, 'html.parser')
for article in schafer.find_all('article'):
header = article.findNext('h2').text
print(f"Header: {header}")
paragraph = article.findNext('div', class_="entry-content").text
print(f"Paragraph: {paragraph}")
yt_link = article.findNext('iframe', class_="youtube-player")['src'].split('?')[0].replace("embed/", "watch?v=")
print(f"YouTube Link: {yt_link}")
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论