2023年2月16日 18:24:22go评论78阅读模式

英文:

Extract part of string(/soup element) within a list of lists

问题

以下是要翻译的内容：

I'm having some issues with scraping fish images off a website. 
species_with_foto = ['/fangster/aborre-perca-fluviatilis/1',
 '/fangster/almindelig-tangnaal-syngnathus-typhle/155',
 '/fangster/ansjos-enggraulis-encrasicholus/66',
 '/fangster/atlantisk-tun-blaafinnet-tun-thunnus-thynnus-/137']
titles = []
species = []
for x in species_with_foto:
    specie_page = 'https://www.fiskefoto.dk' + x
    driver.get(specie_page)
    content = driver.page_source
    soup = BeautifulSoup(content)
    brutto = soup.find_all('img', attrs={'class':'rapportBillede'})
    species.append(brutto)
    #print(brutto)
    titles.append(x)
    try:
        driver.find_element(by=By.XPATH, value='/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div').click()
        print('CLicked next', x)
    except NoSuchElementException:
        print('Succesfully finished - :', x)
    time.sleep(2)

这部分内容没有需要翻译的部分。如果你有其他需要翻译的部分，请提供具体内容，我会尽力帮助你。

英文:

I'm having some issues with scraping fish images off a website.

species_with_foto = [&#39;/fangster/aborre-perca-fluviatilis/1&#39;,
 &#39;/fangster/almindelig-tangnaal-syngnathus-typhle/155&#39;,
 &#39;/fangster/ansjos-engraulis-encrasicholus/66&#39;,
 &#39;/fangster/atlantisk-tun-blaafinnet-tun-thunnus-thynnus-/137&#39;]
titles = []
species = []
for x in species_with_foto:
    specie_page = &#39;https://www.fiskefoto.dk&#39;+x
    driver.get(specie_page)
    content = driver.page_source
    soup = BeautifulSoup(content)
    brutto = soup.find_all(&#39;img&#39;, attrs={&#39;class&#39;:&#39;rapportBillede&#39;})
    species.append(brutto)
    #print(brutto)
    titles.append(x)
    try:
        driver.find_element(by=By.XPATH, value=&#39;/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div&#39;).click()
        print(&#39;CLicked next&#39;, x)
    except NoSuchElementException:
        print(&#39;Succesfully finished - :&#39;, x)
    time.sleep(2)

This returns a list of lists with the sublist looking like this:

[&lt;img alt=&quot;Aborre (Perca fluviatilis) aborrefiskeri, striber, rygfinne, regnorm, majs, spinner, &quot; class=&quot;rapportBillede&quot; src=&quot;/images/400/aborre-perca-fluviatilis-medefiskeri-bundrig-0,220kg-24cm-striber-rygfinne-regnorm-majs-spinner-358-22-29-14-740-2013-21-4.jpg&quot; style=&quot;width:50%;&quot;/&gt;,
  &lt;img alt=&quot;Aborre (Perca fluviatilis) aborrefiskeri, striber, rygfinne, regnorm, majs, spinner, &quot; class=&quot;rapportBillede&quot; src=&quot;/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg&quot; style=&quot;width:calc(50% - 6px);margin-bottom:7px;&quot;/&gt;]

How can i clean up the list and only keep the src="/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg" - part? I have tried with other variables in the soup.find_all but can't get it to work.

(The selenium part is also not functioning properly, but I'll get to that after......)

EDIT:

This is my code now, I'm really getting close One issue is that now my photos are not saved in a list of lists but just a list - I for the love of god don't understand why this happens?

Help to fix and understand would be greatly appreciated!

titles = []
fish_photos = []
for x in species_with_foto_mini:
    site = &quot;https://www.fiskefoto.dk/&quot;+x
    html = urlopen(site)
    bs = BeautifulSoup(html, &#39;html.parser&#39;)
    titles.append(x)
    
    try: 
        images = bs.find_all(&#39;img&#39;, attrs={&#39;class&#39;:&#39;rapportBillede&#39;})
        for img in images:
            if img.has_attr(&#39;src&#39;):
                #print(img[&#39;src&#39;])
                a = (img[&#39;src&#39;])                     
                fish_photos.append(a)
    except KeyError:
        print(&#39;No src&#39;)
        
    #navigate pages
    try:
        driver.find_element(by=By.XPATH, value=&#39;/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div&#39;).click()
        print(&#39;CLicked next&#39;, x)
    except NoSuchElementException:
        print(&#39;Succesfully finished -&#39;, x)
    time.sleep(2)

EDIT:

I need the end result to be a list of lists looking something like this:

fish_photos =

[[&#39;/images/400/aborre-perca-fluviatilis-medefiskeri-bundrig-0,220kg-24cm-striber-rygfinne-regnorm-majs-spinner-358-22-29-14-740-2013-21-4.jpg&#39;,
 &#39;/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg&#39;,[&#39;/images/400/tungehvarre-arnoglossus-laterna-medefiskeri-6650-2523403.jpg&#39;, &#39;/images/400/ulk-myoxocephalus-scorpius-medefiskeri-bundrig-koebenhavner-koebenhavner-torsk-mole-sild-boersteorm-pigge-351-18-48-9-680-2013-6-4.jpg&#39;],[ &#39;/images/400/graeskarpe-ctenopharyngodon-idella-medefiskeri-bobleflaad-med-toastbroed-paa-enkeltkrog-5.02kg-77cm-6436-7486.jpg&#39;,&#39;/images/400/graeskarpe-ctenopharyngodon-idella-medefiskeri-bobleflaad-med-toastbroed-paa-enkeltkrog-10.38kg-96cm-6337-4823146.jpg&#39;]

EDIT:
My output now is a list with identical lists. I need it to put every specie in its own list, like this: fish_photo_list = [[trout1, trout2, trout3], [other fish1, other fish 2, other], [salmon1, salmon2]]

My initial code this, but not now.

答案1

得分: 2

以下是翻译好的代码部分：

这是一个示例，您可以进行更改：
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "[插入站点名称]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')
try:
   images = bs.find_all('img')
   for img in images:
       if img.has_attr('src'):
          print(img['src'])
except KeyError:
    print('无 src')

英文:

Here is an example, you can change:

from urllib.request import urlopen
from bs4 import BeautifulSoup
site = &quot;[insert name of the site]&quot;
html = urlopen(site)
bs = BeautifulSoup(html, &#39;html.parser&#39;)
try: 
   images = bs.find_all(&#39;img&#39;)
   for img in images:
       if img.has_attr(&#39;src&#39;):
          print(img[&#39;src&#39;])
except KeyError:
    print(&#39;No src&#39;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从列表中提取字符串(/soup元素)的一部分。

问题

答案1

如何根据条件提取数据并存储到多个文件中

想要删除包含特定文本的所有行。

根据另一列的更改逐行填充NaN值。

Handling Large Datasets Efficiently in Python: Pandas vs. Dask

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。