Python Web-Scraping 代码仅在循环中返回第一个迭代。

huangapple go评论68阅读模式
英文:

Python Web-Scraping Code only returning first iteration in my loop

问题

我是新手学习网页抓取。我编写了一段代码来获取网页中文章的标题、段落和YouTube链接。我的“for”循环在第一次迭代中返回正确的结果,但重复了10次,而没有获取其他文章。网页上有10篇独立的文章,所以我认为与我编写的.select函数有关。以下是代码:

import requests
import bs4

url = 'https://coreyms.com'

# 获取网页的响应对象,确保它运行正确
response = requests.get(url)
response.raise_for_status()

# 使用bs4将所有HTML解析为网页上的一个字符串
schafer = bs4.BeautifulSoup(response.text, 'html.parser')

# 尝试使用for循环
for article in schafer.find_all('article'):
    header = article.select('a')
    header = header[0].getText()
    print(header)

    paragraph = article.select('div > p')
    paragraph = paragraph[0].getText()
    print(paragraph)

    link = article.select('iframe')

    # 在这里解析YouTube链接,获取可在YouTube上观看的纯链接
    link = link[0].get('src')
    vidID = link.split('/')[4]
    vidID = vidID.split('?')[0]
    ytLink = f'https://youtube.com/watch?v={vidID}'
    print(ytLink)
    print()

请注意,我对代码进行了一些修改,以确保在每次迭代中处理的是当前文章而不是整个网页。希望这能解决你的问题。

英文:

I'm new to web-scraping. I wrote a code to return the header, paragraph, and youtube link of article within the webpage. My "for" loop is returning the first iteration correctly, but it's repeating it 10 times and not pulling the other articles. There are 10 separate articles on the webpage so I think it has something to do with the .select function I'm writing. Code below:

import requests
import bs4


url = 'https://coreyms.com'

    # Get the url in a response object and make sure it runs correctly

response = requests.get(url)
response.raise_for_status()

    # Now im using bs4 to parse all the html into a single string on the webpage 

schafer = bs4.BeautifulSoup(response.text, 'html.parser')

    # Attempting to use a for loop  
    
for article in schafer.find_all('article'):
    header = schafer.select('article a')
    header = header[0].getText()
    print(header)

    paragraph = schafer.select('article div > p')
    paragraph = paragraph[0].getText()
    print(paragraph)
    
    link = schafer.select('article iframe')

    #     This is where you parse out the youtube link to just get the pure link to watch on Youtube

    link = link[0].get('src')
    vidID = link.split('/')[4]
    vidID = vidID.split('?')[0]
    ytLink = f'https://youtube.com/watch?v={vidID}'
    print(ytLink)
    print()

答案1

得分: 1

你在迭代中使用这个作为你的迭代器:

for article in schafer.find_all('article'):

因此,在每次循环中变化的变量是 article。然而,你从未使用这个变量,而是使用了 schafer,这是一个在循环进行时从未改变的变量。

要修复你的问题,将 schafer 替换为 article 并更改 select 语句。例如:

header = schafer.select('article a')

应该改成

header = article.select('a')

而这一行:

paragraph = schafer.select('article div > p')

应该改成

paragraph = article.select('div > p')

然后你应该会得到你期望的结果。

英文:

You use this as your iterator:

for article in schafer.find_all('article'):

So the variable that changes every loop is article. However, you never use this variable, instead using schafer, which is a variable that never changes as the loops go on.

To fix you problem, replace schafer with article and change the select statement. For example:

header = schafer.select('article a')

becomes

header = article.select('a')

The line

paragraph = schafer.select('article div > p')

becomes

paragraph = article.select('div > p')

You should then get the results you expect.

答案2

得分: 0

你可以简单地按照以下方式操作:

import requests
import bs4

response = requests.get(url='https://coreyms.com')

schafer = bs4.BeautifulSoup(response.text, 'html.parser')

for article in schafer.find_all('article'):
    header = article.findNext('h2').text
    print(f"Header: {header}")
    paragraph = article.findNext('div', class_="entry-content").text
    print(f"Paragraph: {paragraph}")
    yt_link = article.findNext('iframe', class_="youtube-player")['src'].split('?')[0].replace("embed/", "watch?v=")
    print(f"YouTube Link: {yt_link}")

请注意,这是一段Python代码。

英文:

You can simply do like the below

import requests
import bs4

response = requests.get(url='https://coreyms.com')

schafer = bs4.BeautifulSoup(response.text, 'html.parser')

for article in schafer.find_all('article'):
    header = article.findNext('h2').text
    print(f"Header: {header}")
    paragraph = article.findNext('div', class_="entry-content").text
    print(f"Paragraph: {paragraph}")
    yt_link = article.findNext('iframe', class_="youtube-player")['src'].split('?')[0].replace("embed/", "watch?v=")
    print(f"YouTube Link: {yt_link}")

huangapple
  • 本文由 发表于 2023年2月24日 11:57:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/75552488.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定