2023年2月24日 11:57:59go评论152阅读模式

英文:

Python Web-Scraping Code only returning first iteration in my loop

问题

我是新手学习网页抓取。我编写了一段代码来获取网页中文章的标题、段落和YouTube链接。我的“for”循环在第一次迭代中返回正确的结果，但重复了10次，而没有获取其他文章。网页上有10篇独立的文章，所以我认为与我编写的.select函数有关。以下是代码：

import requests
import bs4

url = 'https://coreyms.com'

# 获取网页的响应对象，确保它运行正确
response = requests.get(url)
response.raise_for_status()

# 使用bs4将所有HTML解析为网页上的一个字符串
schafer = bs4.BeautifulSoup(response.text, 'html.parser')

# 尝试使用for循环
for article in schafer.find_all('article'):
    header = article.select('a')
    header = header[0].getText()
    print(header)

    paragraph = article.select('div > p')
    paragraph = paragraph[0].getText()
    print(paragraph)

    link = article.select('iframe')

    # 在这里解析YouTube链接，获取可在YouTube上观看的纯链接
    link = link[0].get('src')
    vidID = link.split('/')[4]
    vidID = vidID.split('?')[0]
    ytLink = f'https://youtube.com/watch?v={vidID}'
    print(ytLink)
    print()

请注意，我对代码进行了一些修改，以确保在每次迭代中处理的是当前文章而不是整个网页。希望这能解决你的问题。

英文:

I'm new to web-scraping. I wrote a code to return the header, paragraph, and youtube link of article within the webpage. My "for" loop is returning the first iteration correctly, but it's repeating it 10 times and not pulling the other articles. There are 10 separate articles on the webpage so I think it has something to do with the .select function I'm writing. Code below:

import requests
import bs4


url = &#39;https://coreyms.com&#39;

    # Get the url in a response object and make sure it runs correctly

response = requests.get(url)
response.raise_for_status()

    # Now im using bs4 to parse all the html into a single string on the webpage 

schafer = bs4.BeautifulSoup(response.text, &#39;html.parser&#39;)

    # Attempting to use a for loop  
    
for article in schafer.find_all(&#39;article&#39;):
    header = schafer.select(&#39;article a&#39;)
    header = header[0].getText()
    print(header)

    paragraph = schafer.select(&#39;article div &gt; p&#39;)
    paragraph = paragraph[0].getText()
    print(paragraph)
    
    link = schafer.select(&#39;article iframe&#39;)

    #     This is where you parse out the youtube link to just get the pure link to watch on Youtube

    link = link[0].get(&#39;src&#39;)
    vidID = link.split(&#39;/&#39;)[4]
    vidID = vidID.split(&#39;?&#39;)[0]
    ytLink = f&#39;https://youtube.com/watch?v={vidID}&#39;
    print(ytLink)
    print()

答案1

得分: 1

你在迭代中使用这个作为你的迭代器：

for article in schafer.find_all('article'):

因此，在每次循环中变化的变量是 article。然而，你从未使用这个变量，而是使用了 schafer，这是一个在循环进行时从未改变的变量。

要修复你的问题，将 schafer 替换为 article 并更改 select 语句。例如：

header = schafer.select('article a')

应该改成

header = article.select('a')

而这一行：

paragraph = schafer.select('article div > p')

应该改成

paragraph = article.select('div > p')

然后你应该会得到你期望的结果。

英文:

You use this as your iterator:

for article in schafer.find_all(&#39;article&#39;):

So the variable that changes every loop is article. However, you never use this variable, instead using schafer, which is a variable that never changes as the loops go on.

To fix you problem, replace schafer with article and change the select statement. For example:

header = schafer.select(&#39;article a&#39;)

becomes

header = article.select(&#39;a&#39;)

The line

paragraph = schafer.select(&#39;article div &gt; p&#39;)

becomes

paragraph = article.select(&#39;div &gt; p&#39;)

You should then get the results you expect.

答案2

得分: 0

你可以简单地按照以下方式操作：

import requests
import bs4

response = requests.get(url='https://coreyms.com')

schafer = bs4.BeautifulSoup(response.text, 'html.parser')

for article in schafer.find_all('article'):
    header = article.findNext('h2').text
    print(f"Header: {header}")
    paragraph = article.findNext('div', class_="entry-content").text
    print(f"Paragraph: {paragraph}")
    yt_link = article.findNext('iframe', class_="youtube-player")['src'].split('?')[0].replace("embed/", "watch?v=")
    print(f"YouTube Link: {yt_link}")

请注意，这是一段Python代码。

英文:

You can simply do like the below

import requests
import bs4

response = requests.get(url=&#39;https://coreyms.com&#39;)

schafer = bs4.BeautifulSoup(response.text, &#39;html.parser&#39;)

for article in schafer.find_all(&#39;article&#39;):
    header = article.findNext(&#39;h2&#39;).text
    print(f&quot;Header: {header}&quot;)
    paragraph = article.findNext(&#39;div&#39;, class_=&quot;entry-content&quot;).text
    print(f&quot;Paragraph: {paragraph}&quot;)
    yt_link = article.findNext(&#39;iframe&#39;, class_=&quot;youtube-player&quot;)[&#39;src&#39;].split(&#39;?&#39;)[0].replace(&quot;embed/&quot;, &quot;watch?v=&quot;)
    print(f&quot;YouTube Link: {yt_link}&quot;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python Web-Scraping 代码仅在循环中返回第一个迭代。

问题

答案1

答案2

将天气数据更新频率从3小时更改为1小时。

用注释替换Python脚本中的打印命令。

Django：可选外键条目的反向查找并包含在数据集中

What is the runtime complexity of this function considering that "in" is like a nested loop looking at lists as input?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论