2023年5月30日 09:16:49go评论92阅读模式

英文:

Can someone help me properly scrape YouTube titles in Python using BS4?

问题

Sure, here is the translated code:

import requests
from bs4 import BeautifulSoup
def get_youtube_titles():
    url = 'https://www.youtube.com/'
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        # Find YouTube title elements
        title_elements = soup.find_all('a', class_='yt-simple-endpoint focus-on-expand style-scope ytd-rich-grid-media')
        # Extract and print the titles
        for title_element in title_elements:
            title = title_element.text.strip()
            print(title)
    except requests.exceptions.RequestException as e:
        print('Network connection error:', e)
# Get YouTube titles
get_youtube_titles()

If you have any further questions or need assistance with the code, please let me know.

英文:

i wanna collect youtube titles from useing BS4 in python. this is code i got recommended by GPT but doesnt work well. im looking for some intelligent coder here. thank you

import requests
from bs4 import BeautifulSoup
def get_youtube_titles():
url = &#39;https://www.youtube.com/&#39;
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, &#39;html.parser&#39;)
    
        # Find YouTube title elements
        title_elements = soup.find_all(&#39;a&#39;, class_=&#39;yt-simple-endpoint focus-on-expand style-scope ytd-rich-grid-media&#39;)
    
        # Extract and print the titles
        for title_element in title_elements:
            title = title_element.text.strip()
            print(title)
    
    except requests.exceptions.RequestException as e:
        print(&#39;Network connection error:&#39;, e)
# Get YouTube titles
get_youtube_titles()

I asked to GPT but doesn't work well

答案1

得分: 1

Your code is using requests.get, so you'll only get the source html, which is not the same as the fully rendered HTML you might inspect on your browser. For that, you might want to use something that supports JavaScript (like selenium - and don't forget to add in some wait time to allow the page to load....).

However, if all you want are some titles, you can try extracting from the script tags that contain the JavaScript with the following functions:

# import json
## a general function for extracting a JavaScript variable from a bs4 object
def get_jsScriptVal(jSoup, valDecl, isJson=True):
    script_finder = lambda s: s and valDecl in s
    for sc in jSoup.find('script', string=script_finder):
        for st in  sc.string.split(';'):
            ls, rs, *_ = 展开收缩
)]
            if ls == valDecl and rs: return json.loads(rs) if isJson else rs
## specifically for your case
def get_ytInitialTitles(ySoup):
    contents = get_jsScriptVal(ySoup, 'var ytInitialData')[&#39;'contents'&#39;]
    tab1 = contents[&#39;'twoColumnBrowseResultsRenderer'&#39;][&#39;'tabs'&#39;][0]
    contents = tab1[&#39;'tabRenderer'&#39;][&#39;'content'&#39;][&#39;'richGridRenderer'&#39;][&#39;'contents'&#39;]
    contents = [c[&#39;'richItemRenderer'&#39;][&#39;'content'&#39;][&#39;'videoRenderer'&#39;] 
                for c in contents if &#39;'richItemRenderer'&#39; in c and 
                &#39;'videoRenderer'&#39; in c[&#39;'richItemRenderer'&#39;][&#39;'content'&#39;]]
    titles = [c[&#39;'title'&#39;][&#39;'runs'&#39;][0][&#39;'text'&#39;] for c in contents]
    return titles

Now, if you edit your code to use the functions above:

import requests
from bs4 import BeautifulSoup
import json
#### DON'T FORGET TO PASTE THE FUNCTION DEFINITIONS INTO YOUR CODE TOO ####
## def get_jsScriptVal....
## def get_ytInitialTitles....
##########################################################################
def get_youtube_titles():
    url = 'https://www.youtube.com/'
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
    
        # titles = get_ytInitialTitles(soup) # Find YouTube title elements
        # for title in titles: print(title) # Extract and print the titles
        # OR [in one line]
        for title in get_ytInitialTitles(soup): print(title)
    
    except Exception as e:
        print('Failed to scrape due to', type(e), ':', e)
get_youtube_titles()

then it should print something like

Survive 100 Days In Circle, Win $500,000
lofi hip hop radio 🎸 - beats to relax/study to
Spectaculair ingekleurde film over het begin van de Duitse bezetting van Nederland tijdens WOII
Omtzigt is WOEST & SLOOPT liegende Rutte! 'Kijk die ouders in hun ogen!'
Ineens vielen er bommen op zonnepanelen... Algemene beschouwingen Venlo 2023
Trump Opens Up on Secret White House Documents, Biden Family & Republican Opponents | Trump LIVE
An einem Tag nach Mallorca und zurück: Was verdient ein Flugbegleiter? | Lohnt sich das | BR
I BUILT A SHELTER IN THE FOREST!! AND LIVED THERE FOR 2 MONTHS!
De halvering van China
Ibiza Summer Mix 2023 🔊 Best Of Tropical Deep House Music Chill Out Mix 2023🔊 Chillout Lounge #153
Tibetaanse Genezende Fluit • Afgifte van melatonine en gifstoffen • Elimineer stress en kalmeer ...
Alle 200 POTLODEN GEBRUIKEN in 1 TEKENING - Tekenen Challenge
Top 10 BEST Auditions on BGT 2023!
Ontspannende muziek tot opluchting stress, angst en depressie 🎧 Verzachtende muziek voor zenuwen
6 juni 1944, D-Day, Operatie Overlord | Ingekleurd
Ed Sheeran, Martin Garrix, Kygo, Dua Lipa, Avicii, Robin Schulz, The Chainsmokers Style - Feeling Me
DIY with Mr Bean | Full Episodes | Classic Mr Bean
EEN WEDSTRIJD VOL AFSCHEID! 👏👁️ | Barcelona vs Mallorca | La Liga 2022/23 | Samenvatting
Deep Focus Music To Improve Concentration - 12 Hours of Ambient Study Music to Concentrate #506
The Inside Guys React To The Miami Heat's Blowout Game 7 Win In Boston | NBA on TNT
Muziek genezen om stress, vermoeidheid, depressie, negativiteit, detoxemoties te verlichten
How Rain Caused Havoc And Changed The Race | 2023 Monaco Grand Prix

(Note: The output above includes the titles extracted from the code.)

英文:

Your code is using requests.get so you'll only get the source html, which is not the same as the fully rendered HTML you might inspect on your browser. For that, you might want to use something that supports JavaScript (like selenium - and don't forget to add in some wait time to allow the page to load....).

However, if all you want are some titles, you can try extracting from the script tags that contain the JavaScript with the following functions:

# import json
## a general function for extracting a JavaScript variable from a bs4 object
def get_jsScriptVal(jSoup, valDecl, isJson=True):
    script_finder = lambda s: s and valDecl in s
    for sc in jSoup.find(&#39;script&#39;, string=script_finder):
        for st in  sc.string.split(&#39;;&#39;):
            ls, rs, *_ = 展开收缩
)]
            if ls == valDecl and rs: return json.loads(rs) if isJson else rs
## specifically for your case
def get_ytInitialTitles(ySoup):
    contents = get_jsScriptVal(ySoup, &#39;var ytInitialData&#39;)[&#39;contents&#39;]
    tab1 = contents[&#39;twoColumnBrowseResultsRenderer&#39;][&#39;tabs&#39;][0]
    contents = tab1[&#39;tabRenderer&#39;][&#39;content&#39;][&#39;richGridRenderer&#39;][&#39;contents&#39;]
    contents = [c[&#39;richItemRenderer&#39;][&#39;content&#39;][&#39;videoRenderer&#39;] 
                for c in contents if &#39;richItemRenderer&#39; in c and 
                &#39;videoRenderer&#39; in c[&#39;richItemRenderer&#39;][&#39;content&#39;]]
    titles = [c[&#39;title&#39;][&#39;runs&#39;][0][&#39;text&#39;] for c in contents]
    return titles

Now, if you edit your code to use the functions above:

import requests
from bs4 import BeautifulSoup
import json
#### DON&#39;T FORGET TO PASTE THE FUNCTION DEFINITIONS INTO YOUR CODE TOO ####
## def get_jsScriptVal....
## def get_ytInitialTitles....
##########################################################################
def get_youtube_titles():
    url = &#39;https://www.youtube.com/&#39;
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, &#39;html.parser&#39;)
    
        # titles = get_ytInitialTitles(soup) # Find YouTube title elements
        # for title in titles: print(title) # Extract and print the titles
        # OR [in one line]
        for title in get_ytInitialTitles(soup): print(title)
    
    except Exception as e:
        print(&#39;Failed to scrape due to&#39;, type(e), &#39;:&#39;, e)
get_youtube_titles()

then it should print something like

> lang-none > Survive 100 Days In Circle, Win $500,000 > lofi hip hop radio 📚 - beats to relax/study to > Spectaculair ingekleurde film over het begin van de Duitse bezetting van Nederland tijdens WOII > Omtzigt is WOEST & SLOOPT liegende Rutte! 'Kijk die ouders in hun ogen!' > Ineens vielen er bommen op zonnepanelen... Algemene beschouwingen Venlo 2023 > Trump Opens Up on Secret White House Documents, Biden Family & Republican Opponents | Trump LIVE > An einem Tag nach Mallorca und zurück: Was verdient ein Flugbegleiter? | Lohnt sich das | BR > I BUILT A SHELTER IN THE FOREST!! AND LIVED THERE FOR 2 MONTHS! > De halvering van China > Ibiza Summer Mix 2023 🍓 Best Of Tropical Deep House Music Chill Out Mix 2023🍓 Chillout Lounge #153 > Tibetaanse Genezende Fluit • Afgifte van melatonine en gifstoffen • Elimineer stress en kalmeer ... > Alle 200 POTLODEN GEBRUIKEN in 1 TEKENING - Tekenen Challenge > Top 10 BEST Auditions on BGT 2023! > Ontspannende muziek tot opluchting stress, angst en depressie 🐬 Verzachtende muziek voor zenuwen > 6 juni 1944, D-Day, Operatie Overlord | Ingekleurd > Ed Sheeran, Martin Garrix, Kygo, Dua Lipa, Avicii, Robin Schulz, The Chainsmokers Style - Feeling Me > DIY with Mr Bean | Full Episodes | Classic Mr Bean > EEN WEDSTRIJD VOL AFSCHEID! 😭🫡 | Barcelona vs Mallorca | La Liga 2022/23 | Samenvatting > Deep Focus Music To Improve Concentration - 12 Hours of Ambient Study Music to Concentrate #506 > The Inside Guys React To The Miami Heat's Blowout Game 7 Win In Boston | NBA on TNT > Muziek genezen om stress, vermoeidheid, depressie, negativiteit, detoxemoties te verlichten > How Rain Caused Havoc And Changed The Race | 2023 Monaco Grand Prix >

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有人可以帮我使用Python和BS4正确地抓取YouTube标题吗？

问题

答案1

Anaconda错误: 使用pip更新/安装库时出现无效的分发-atplotlib

Python – 从XML中抓取数据

如何配置dependabot来检查多个文件？

如何使用Scrapy Playwright设置页面的视口大小？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。