有人可以帮我使用Python和BS4正确地抓取YouTube标题吗?

huangapple go评论65阅读模式
英文:

Can someone help me properly scrape YouTube titles in Python using BS4?

问题

Sure, here is the translated code:

import requests
from bs4 import BeautifulSoup

def get_youtube_titles():
    url = 'https://www.youtube.com/'

    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find YouTube title elements
        title_elements = soup.find_all('a', class_='yt-simple-endpoint focus-on-expand style-scope ytd-rich-grid-media')

        # Extract and print the titles
        for title_element in title_elements:
            title = title_element.text.strip()
            print(title)

    except requests.exceptions.RequestException as e:
        print('Network connection error:', e)

# Get YouTube titles
get_youtube_titles()

If you have any further questions or need assistance with the code, please let me know.

英文:

i wanna collect youtube titles from useing BS4 in python. this is code i got recommended by GPT but doesnt work well. im looking for some intelligent coder here. thank you 有人可以帮我使用Python和BS4正确地抓取YouTube标题吗?

import requests
from bs4 import BeautifulSoup

def get_youtube_titles():
url = 'https://www.youtube.com/'

    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
    
        # Find YouTube title elements
        title_elements = soup.find_all('a', class_='yt-simple-endpoint focus-on-expand style-scope ytd-rich-grid-media')
    
        # Extract and print the titles
        for title_element in title_elements:
            title = title_element.text.strip()
            print(title)
    
    except requests.exceptions.RequestException as e:
        print('Network connection error:', e)

# Get YouTube titles

get_youtube_titles()

I asked to GPT but doesn't work well

答案1

得分: 1

Your code is using requests.get, so you'll only get the source html, which is not the same as the fully rendered HTML you might inspect on your browser. For that, you might want to use something that supports JavaScript (like selenium - and don't forget to add in some wait time to allow the page to load....).

However, if all you want are some titles, you can try extracting from the script tags that contain the JavaScript with the following functions:

# import json

## a general function for extracting a JavaScript variable from a bs4 object
def get_jsScriptVal(jSoup, valDecl, isJson=True):
    script_finder = lambda s: s and valDecl in s
    for sc in jSoup.find('script', string=script_finder):
        for st in  sc.string.split(';'):
            ls, rs, *_ = 
展开收缩
)]
if ls == valDecl and rs: return json.loads(rs) if isJson else rs ## specifically for your case def get_ytInitialTitles(ySoup): contents = get_jsScriptVal(ySoup, 'var ytInitialData')[''contents''] tab1 = contents[''twoColumnBrowseResultsRenderer''][''tabs''][0] contents = tab1[''tabRenderer''][''content''][''richGridRenderer''][''contents''] contents = [c[''richItemRenderer''][''content''][''videoRenderer''] for c in contents if ''richItemRenderer'' in c and ''videoRenderer'' in c[''richItemRenderer''][''content'']] titles = [c[''title''][''runs''][0][''text''] for c in contents] return titles

Now, if you edit your code to use the functions above:

import requests
from bs4 import BeautifulSoup
import json

#### DON'T FORGET TO PASTE THE FUNCTION DEFINITIONS INTO YOUR CODE TOO ####
## def get_jsScriptVal....
## def get_ytInitialTitles....
##########################################################################

def get_youtube_titles():
    url = 'https://www.youtube.com/'
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
    
        # titles = get_ytInitialTitles(soup) # Find YouTube title elements
        # for title in titles: print(title) # Extract and print the titles

        # OR [in one line]
        for title in get_ytInitialTitles(soup): print(title)
    
    except Exception as e:
        print('Failed to scrape due to', type(e), ':', e)

get_youtube_titles()

then it should print something like

Survive 100 Days In Circle, Win $500,000
lofi hip hop radio 🎸 - beats to relax/study to
Spectaculair ingekleurde film over het begin van de Duitse bezetting van Nederland tijdens WOII
Omtzigt is WOEST & SLOOPT liegende Rutte! 'Kijk die ouders in hun ogen!'
Ineens vielen er bommen op zonnepanelen... Algemene beschouwingen Venlo 2023
Trump Opens Up on Secret White House Documents, Biden Family & Republican Opponents | Trump LIVE
An einem Tag nach Mallorca und zurück: Was verdient ein Flugbegleiter? | Lohnt sich das | BR
I BUILT A SHELTER IN THE FOREST!! AND LIVED THERE FOR 2 MONTHS!
De halvering van China
Ibiza Summer Mix 2023 🔊 Best Of Tropical Deep House Music Chill Out Mix 2023🔊 Chillout Lounge #153
Tibetaanse Genezende Fluit • Afgifte van melatonine en gifstoffen • Elimineer stress en kalmeer ...
Alle 200 POTLODEN GEBRUIKEN in 1 TEKENING - Tekenen Challenge
Top 10 BEST Auditions on BGT 2023!
Ontspannende muziek tot opluchting stress, angst en depressie 🎧 Verzachtende muziek voor zenuwen
6 juni 1944, D-Day, Operatie Overlord | Ingekleurd
Ed Sheeran, Martin Garrix, Kygo, Dua Lipa, Avicii, Robin Schulz, The Chainsmokers Style - Feeling Me
DIY with Mr Bean | Full Episodes | Classic Mr Bean
EEN WEDSTRIJD VOL AFSCHEID! 👏👁️ | Barcelona vs Mallorca | La Liga 2022/23 | Samenvatting
Deep Focus Music To Improve Concentration - 12 Hours of Ambient Study Music to Concentrate #506
The Inside Guys React To The Miami Heat's Blowout Game 7 Win In Boston | NBA on TNT
Muziek genezen om stress, vermoeidheid, depressie, negativiteit, detoxemoties te verlichten
How Rain Caused Havoc And Changed The Race | 2023 Monaco Grand Prix

(Note: The output above includes the titles extracted from the code.)

英文:

Your code is using requests.get so you'll only get the source html, which is not the same as the fully rendered HTML you might inspect on your browser. For that, you might want to use something that supports JavaScript (like selenium - and don't forget to add in some wait time to allow the page to load....).

However, if all you want are some titles, you can try extracting from the script tags that contain the JavaScript with the following functions:

# import json

## a general function for extracting a JavaScript variable from a bs4 object
def get_jsScriptVal(jSoup, valDecl, isJson=True):
    script_finder = lambda s: s and valDecl in s
    for sc in jSoup.find('script', string=script_finder):
        for st in  sc.string.split(';'):
            ls, rs, *_ = 
展开收缩
)]
if ls == valDecl and rs: return json.loads(rs) if isJson else rs ## specifically for your case def get_ytInitialTitles(ySoup): contents = get_jsScriptVal(ySoup, 'var ytInitialData')['contents'] tab1 = contents['twoColumnBrowseResultsRenderer']['tabs'][0] contents = tab1['tabRenderer']['content']['richGridRenderer']['contents'] contents = [c['richItemRenderer']['content']['videoRenderer'] for c in contents if 'richItemRenderer' in c and 'videoRenderer' in c['richItemRenderer']['content']] titles = [c['title']['runs'][0]['text'] for c in contents] return titles

Now, if you edit your code to use the functions above:

import requests
from bs4 import BeautifulSoup
import json

#### DON'T FORGET TO PASTE THE FUNCTION DEFINITIONS INTO YOUR CODE TOO ####
## def get_jsScriptVal....
## def get_ytInitialTitles....
##########################################################################

def get_youtube_titles():
    url = 'https://www.youtube.com/'
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
    
        # titles = get_ytInitialTitles(soup) # Find YouTube title elements
        # for title in titles: print(title) # Extract and print the titles

        # OR [in one line]
        for title in get_ytInitialTitles(soup): print(title)
    
    except Exception as e:
        print('Failed to scrape due to', type(e), ':', e)

get_youtube_titles()

then it should print something like

> lang-none
> Survive 100 Days In Circle, Win $500,000
> lofi hip hop radio 📚 - beats to relax/study to
> Spectaculair ingekleurde film over het begin van de Duitse bezetting van Nederland tijdens WOII
> Omtzigt is WOEST & SLOOPT liegende Rutte! 'Kijk die ouders in hun ogen!'
> Ineens vielen er bommen op zonnepanelen... Algemene beschouwingen Venlo 2023
> Trump Opens Up on Secret White House Documents, Biden Family & Republican Opponents | Trump LIVE
> An einem Tag nach Mallorca und zurück: Was verdient ein Flugbegleiter? | Lohnt sich das | BR
> I BUILT A SHELTER IN THE FOREST!! AND LIVED THERE FOR 2 MONTHS!
> De halvering van China
> Ibiza Summer Mix 2023 🍓 Best Of Tropical Deep House Music Chill Out Mix 2023🍓 Chillout Lounge #153
> Tibetaanse Genezende Fluit • Afgifte van melatonine en gifstoffen • Elimineer stress en kalmeer ...
> Alle 200 POTLODEN GEBRUIKEN in 1 TEKENING - Tekenen Challenge
> Top 10 BEST Auditions on BGT 2023!
> Ontspannende muziek tot opluchting stress, angst en depressie 🐬 Verzachtende muziek voor zenuwen
> 6 juni 1944, D-Day, Operatie Overlord | Ingekleurd
> Ed Sheeran, Martin Garrix, Kygo, Dua Lipa, Avicii, Robin Schulz, The Chainsmokers Style - Feeling Me
> DIY with Mr Bean | Full Episodes | Classic Mr Bean
> EEN WEDSTRIJD VOL AFSCHEID! 😭🫡 | Barcelona vs Mallorca | La Liga 2022/23 | Samenvatting
> Deep Focus Music To Improve Concentration - 12 Hours of Ambient Study Music to Concentrate #506
> The Inside Guys React To The Miami Heat's Blowout Game 7 Win In Boston | NBA on TNT
> Muziek genezen om stress, vermoeidheid, depressie, negativiteit, detoxemoties te verlichten
> How Rain Caused Havoc And Changed The Race | 2023 Monaco Grand Prix
>

huangapple
  • 本文由 发表于 2023年5月30日 09:16:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76361069.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定