从列表中提取字符串(/soup元素)的一部分。

huangapple go评论51阅读模式
英文:

Extract part of string(/soup element) within a list of lists

问题

以下是要翻译的内容:

I'm having some issues with scraping fish images off a website. 

species_with_foto = ['/fangster/aborre-perca-fluviatilis/1',
 '/fangster/almindelig-tangnaal-syngnathus-typhle/155',
 '/fangster/ansjos-enggraulis-encrasicholus/66',
 '/fangster/atlantisk-tun-blaafinnet-tun-thunnus-thynnus-/137']

titles = []
species = []
for x in species_with_foto:
    specie_page = 'https://www.fiskefoto.dk' + x
    driver.get(specie_page)
    content = driver.page_source
    soup = BeautifulSoup(content)
    brutto = soup.find_all('img', attrs={'class':'rapportBillede'})
    species.append(brutto)
    #print(brutto)
    titles.append(x)
    try:
        driver.find_element(by=By.XPATH, value='/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div').click()
        print('CLicked next', x)
    except NoSuchElementException:
        print('Succesfully finished - :', x)
    time.sleep(2)

这部分内容没有需要翻译的部分。如果你有其他需要翻译的部分,请提供具体内容,我会尽力帮助你。

英文:

I'm having some issues with scraping fish images off a website.

species_with_foto = ['/fangster/aborre-perca-fluviatilis/1',
 '/fangster/almindelig-tangnaal-syngnathus-typhle/155',
 '/fangster/ansjos-engraulis-encrasicholus/66',
 '/fangster/atlantisk-tun-blaafinnet-tun-thunnus-thynnus-/137']

titles = []
species = []
for x in species_with_foto:
    specie_page = 'https://www.fiskefoto.dk'+x
    driver.get(specie_page)
    content = driver.page_source
    soup = BeautifulSoup(content)
    brutto = soup.find_all('img', attrs={'class':'rapportBillede'})
    species.append(brutto)
    #print(brutto)
    titles.append(x)
    try:
        driver.find_element(by=By.XPATH, value='/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div').click()
        print('CLicked next', x)
    except NoSuchElementException:
        print('Succesfully finished - :', x)
    time.sleep(2)

This returns a list of lists with the sublist looking like this:

[<img alt="Aborre (Perca fluviatilis) aborrefiskeri, striber, rygfinne, regnorm, majs, spinner, " class="rapportBillede" src="/images/400/aborre-perca-fluviatilis-medefiskeri-bundrig-0,220kg-24cm-striber-rygfinne-regnorm-majs-spinner-358-22-29-14-740-2013-21-4.jpg" style="width:50%;"/>,
  <img alt="Aborre (Perca fluviatilis) aborrefiskeri, striber, rygfinne, regnorm, majs, spinner, " class="rapportBillede" src="/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg" style="width:calc(50% - 6px);margin-bottom:7px;"/>]

How can i clean up the list and only keep the src="/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg" - part? I have tried with other variables in the soup.find_all but can't get it to work.

(The selenium part is also not functioning properly, but I'll get to that after......)

EDIT:

This is my code now, I'm really getting close 从列表中提取字符串(/soup元素)的一部分。 One issue is that now my photos are not saved in a list of lists but just a list - I for the love of god don't understand why this happens?

Help to fix and understand would be greatly appreciated!

titles = []
fish_photos = []

for x in species_with_foto_mini:
    site = "https://www.fiskefoto.dk/"+x
    html = urlopen(site)
    bs = BeautifulSoup(html, 'html.parser')
    titles.append(x)
    
    try: 
        images = bs.find_all('img', attrs={'class':'rapportBillede'})
        for img in images:
            if img.has_attr('src'):
                #print(img['src'])
                a = (img['src'])                     
                fish_photos.append(a)
    except KeyError:
        print('No src')
        
    #navigate pages
    try:
        driver.find_element(by=By.XPATH, value='/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div').click()
        print('CLicked next', x)
    except NoSuchElementException:
        print('Succesfully finished -', x)
    time.sleep(2)

EDIT:

I need the end result to be a list of lists looking something like this:

fish_photos =

[['/images/400/aborre-perca-fluviatilis-medefiskeri-bundrig-0,220kg-24cm-striber-rygfinne-regnorm-majs-spinner-358-22-29-14-740-2013-21-4.jpg',
 '/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg',['/images/400/tungehvarre-arnoglossus-laterna-medefiskeri-6650-2523403.jpg', '/images/400/ulk-myoxocephalus-scorpius-medefiskeri-bundrig-koebenhavner-koebenhavner-torsk-mole-sild-boersteorm-pigge-351-18-48-9-680-2013-6-4.jpg'],[ '/images/400/graeskarpe-ctenopharyngodon-idella-medefiskeri-bobleflaad-med-toastbroed-paa-enkeltkrog-5.02kg-77cm-6436-7486.jpg','/images/400/graeskarpe-ctenopharyngodon-idella-medefiskeri-bobleflaad-med-toastbroed-paa-enkeltkrog-10.38kg-96cm-6337-4823146.jpg']

EDIT:
My output now is a list with identical lists. I need it to put every specie in its own list, like this: fish_photo_list = [[trout1, trout2, trout3], [other fish1, other fish 2, other], [salmon1, salmon2]]

My initial code this, but not now.

答案1

得分: 2

以下是翻译好的代码部分:

这是一个示例您可以进行更改

from urllib.request import urlopen
from bs4 import BeautifulSoup

site = "[插入站点名称]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')

try:
   images = bs.find_all('img')
   for img in images:
       if img.has_attr('src'):
          print(img['src'])
except KeyError:
    print('无 src')
英文:

Here is an example, you can change:

from urllib.request import urlopen
from bs4 import BeautifulSoup

site = "[insert name of the site]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')

try: 
   images = bs.find_all('img')
   for img in images:
       if img.has_attr('src'):
          print(img['src'])
except KeyError:
    print('No src')

huangapple
  • 本文由 发表于 2023年2月16日 18:24:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/75470863.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定