从列表中提取字符串(/soup元素)的一部分。

huangapple go评论78阅读模式
英文:

Extract part of string(/soup element) within a list of lists

问题

以下是要翻译的内容:

  1. I'm having some issues with scraping fish images off a website.
  2. species_with_foto = ['/fangster/aborre-perca-fluviatilis/1',
  3. '/fangster/almindelig-tangnaal-syngnathus-typhle/155',
  4. '/fangster/ansjos-enggraulis-encrasicholus/66',
  5. '/fangster/atlantisk-tun-blaafinnet-tun-thunnus-thynnus-/137']
  6. titles = []
  7. species = []
  8. for x in species_with_foto:
  9. specie_page = 'https://www.fiskefoto.dk' + x
  10. driver.get(specie_page)
  11. content = driver.page_source
  12. soup = BeautifulSoup(content)
  13. brutto = soup.find_all('img', attrs={'class':'rapportBillede'})
  14. species.append(brutto)
  15. #print(brutto)
  16. titles.append(x)
  17. try:
  18. driver.find_element(by=By.XPATH, value='/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div').click()
  19. print('CLicked next', x)
  20. except NoSuchElementException:
  21. print('Succesfully finished - :', x)
  22. time.sleep(2)

这部分内容没有需要翻译的部分。如果你有其他需要翻译的部分,请提供具体内容,我会尽力帮助你。

英文:

I'm having some issues with scraping fish images off a website.

  1. species_with_foto = ['/fangster/aborre-perca-fluviatilis/1',
  2. '/fangster/almindelig-tangnaal-syngnathus-typhle/155',
  3. '/fangster/ansjos-engraulis-encrasicholus/66',
  4. '/fangster/atlantisk-tun-blaafinnet-tun-thunnus-thynnus-/137']
  5. titles = []
  6. species = []
  7. for x in species_with_foto:
  8. specie_page = 'https://www.fiskefoto.dk'+x
  9. driver.get(specie_page)
  10. content = driver.page_source
  11. soup = BeautifulSoup(content)
  12. brutto = soup.find_all('img', attrs={'class':'rapportBillede'})
  13. species.append(brutto)
  14. #print(brutto)
  15. titles.append(x)
  16. try:
  17. driver.find_element(by=By.XPATH, value='/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div').click()
  18. print('CLicked next', x)
  19. except NoSuchElementException:
  20. print('Succesfully finished - :', x)
  21. time.sleep(2)

This returns a list of lists with the sublist looking like this:

  1. [<img alt="Aborre (Perca fluviatilis) aborrefiskeri, striber, rygfinne, regnorm, majs, spinner, " class="rapportBillede" src="/images/400/aborre-perca-fluviatilis-medefiskeri-bundrig-0,220kg-24cm-striber-rygfinne-regnorm-majs-spinner-358-22-29-14-740-2013-21-4.jpg" style="width:50%;"/>,
  2. <img alt="Aborre (Perca fluviatilis) aborrefiskeri, striber, rygfinne, regnorm, majs, spinner, " class="rapportBillede" src="/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg" style="width:calc(50% - 6px);margin-bottom:7px;"/>]

How can i clean up the list and only keep the src="/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg" - part? I have tried with other variables in the soup.find_all but can't get it to work.

(The selenium part is also not functioning properly, but I'll get to that after......)

EDIT:

This is my code now, I'm really getting close 从列表中提取字符串(/soup元素)的一部分。 One issue is that now my photos are not saved in a list of lists but just a list - I for the love of god don't understand why this happens?

Help to fix and understand would be greatly appreciated!

  1. titles = []
  2. fish_photos = []
  3. for x in species_with_foto_mini:
  4. site = "https://www.fiskefoto.dk/"+x
  5. html = urlopen(site)
  6. bs = BeautifulSoup(html, 'html.parser')
  7. titles.append(x)
  8. try:
  9. images = bs.find_all('img', attrs={'class':'rapportBillede'})
  10. for img in images:
  11. if img.has_attr('src'):
  12. #print(img['src'])
  13. a = (img['src'])
  14. fish_photos.append(a)
  15. except KeyError:
  16. print('No src')
  17. #navigate pages
  18. try:
  19. driver.find_element(by=By.XPATH, value='/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div').click()
  20. print('CLicked next', x)
  21. except NoSuchElementException:
  22. print('Succesfully finished -', x)
  23. time.sleep(2)

EDIT:

I need the end result to be a list of lists looking something like this:

fish_photos =

  1. [['/images/400/aborre-perca-fluviatilis-medefiskeri-bundrig-0,220kg-24cm-striber-rygfinne-regnorm-majs-spinner-358-22-29-14-740-2013-21-4.jpg',
  2. '/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg',['/images/400/tungehvarre-arnoglossus-laterna-medefiskeri-6650-2523403.jpg', '/images/400/ulk-myoxocephalus-scorpius-medefiskeri-bundrig-koebenhavner-koebenhavner-torsk-mole-sild-boersteorm-pigge-351-18-48-9-680-2013-6-4.jpg'],[ '/images/400/graeskarpe-ctenopharyngodon-idella-medefiskeri-bobleflaad-med-toastbroed-paa-enkeltkrog-5.02kg-77cm-6436-7486.jpg','/images/400/graeskarpe-ctenopharyngodon-idella-medefiskeri-bobleflaad-med-toastbroed-paa-enkeltkrog-10.38kg-96cm-6337-4823146.jpg']

EDIT:
My output now is a list with identical lists. I need it to put every specie in its own list, like this: fish_photo_list = [[trout1, trout2, trout3], [other fish1, other fish 2, other], [salmon1, salmon2]]

My initial code this, but not now.

答案1

得分: 2

以下是翻译好的代码部分:

  1. 这是一个示例您可以进行更改
  2. from urllib.request import urlopen
  3. from bs4 import BeautifulSoup
  4. site = "[插入站点名称]"
  5. html = urlopen(site)
  6. bs = BeautifulSoup(html, 'html.parser')
  7. try:
  8. images = bs.find_all('img')
  9. for img in images:
  10. if img.has_attr('src'):
  11. print(img['src'])
  12. except KeyError:
  13. print('无 src')
英文:

Here is an example, you can change:

  1. from urllib.request import urlopen
  2. from bs4 import BeautifulSoup
  3. site = "[insert name of the site]"
  4. html = urlopen(site)
  5. bs = BeautifulSoup(html, 'html.parser')
  6. try:
  7. images = bs.find_all('img')
  8. for img in images:
  9. if img.has_attr('src'):
  10. print(img['src'])
  11. except KeyError:
  12. print('No src')

huangapple
  • 本文由 发表于 2023年2月16日 18:24:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/75470863.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定