我在for循环中得到相同的输出

huangapple go评论92阅读模式
英文:

I get the same output in for loop

问题

  1. The Hartley Clinic Clinic Contact: Ailing Jeavons 01256 856289 , , Fleet, RG27 8NZ Monday 8:30 17:00 Tuesday 8:30 19:00 Wednesday 8:30 17:00 Thursday 8:30 17:00 Friday 8:30 15:00 Saturday 9:00 17:00 Sunday Closed https://www.saveface.co.uk/clinic/the-hartley-clinic/
  2. The Hartley Clinic Clinic Contact: Ailing Jeavons 01256 856289 , , Fleet, RG27 8NZ Monday 8:30 17:00 Tuesday 8:30 19:00 Wednesday 8:30 17:00 Thursday 8:30 17:00 Friday 8:30 15:00 Saturday 9:00 17:00 Sunday Closed https://www.saveface.co.uk/clinic/the-hartley-clinic/
  3. The Hartley Clinic Clinic Contact: Ailing Jeavons 01256 856289 , , Fleet, RG27 8NZ Monday 8:30 17:00 Tuesday 8:30 19:00 Wednesday 8:30 17:00 Thursday 8:30 17:00 Friday 8:30 15:00 Saturday 9:00 17:00 Sunday Closed https://www.saveface.co.uk/clinic/the-hartley-clinic/
  4. The Hartley Clinic Clinic Contact: Ailing Jeavons 01256 856289 , , Fleet, RG27 8NZ Monday 8:30 17:00 Tuesday 8:30 19:00 Wednesday 8:30 17:00 Thursday 8:30 17:00 Friday 8:30 15:00 Saturday 9:00 17:00 Sunday Closed https://www.saveface.co.uk/clinic/the-hartley-clinic/
  5. The Hartley Clinic Clinic Contact: Ailing Jeavons 01256 856289 , , Fleet, RG27 8NZ Monday 8:30 17:00 Tuesday 8:30 19:00 Wednesday 8:30 17:00 Thursday 8:30 17:00 Friday 8:30 15:00 Saturday 9:00 17:00 Sunday Closed https://www.saveface.co.uk/clinic/the-hartley-clinic/
  6. The Hartley Clinic Clinic Contact: Ailing Jeavons 01256 856289 , , Fleet, RG27 8NZ Monday 8:30 17:00 Tuesday 8:30 19:00 Wednesday 8:30 17:00 Thursday 8:30 17:00 Friday 8:30 15:00 Saturday 9:00 17:00 Sunday Closed https://www.saveface.co.uk/clinic/the-hartley-clinic/
英文:
  1. from selenium import webdriver
  2. from selenium.webdriver.common.by import By
  3. from selenium.webdriver.chrome.service import Service
  4. import pandas as pd
  5. s=Service("C:\selenium driver\chromedriver.exe")
  6. driver = webdriver.Chrome(service=s)
  7. companies_names = []
  8. persons_names = []
  9. phones_numbers = []
  10. locations = []
  11. opening_hours = []
  12. descriptions = []
  13. websites_links = []
  14. all_profiles = []
  15. driver.get("https://www.saveface.co.uk/search/")
  16. driver.implicitly_wait(10)
  17. blocks = driver.find_elements(By.XPATH, "//div[@class='result clientresult']")
  18. for block in range(30):
  19. company_name = blocks[block].find_element(By.XPATH, "//h3[@class='resulttitle']").text.strip()
  20. companies_names.append(company_name)
  21. person_name = blocks[block].find_element(By.XPATH, "//p[@class='name_wrapper']").text.strip()
  22. persons_names.append(person_name)
  23. phone_number = blocks[block].find_element(By.XPATH, "//div[@class='searchContact phone']").text.strip()
  24. phones_numbers.append(phone_number)
  25. location = blocks[block].find_element(By.XPATH, "//li[@class='cls_loc']").text.strip()
  26. locations.append(location)
  27. opening_hour = blocks[block].find_element(By.XPATH, "//li[@class='opening-hours']").text.strip()
  28. opening_hours.append(opening_hour)
  29. profile = blocks[block].find_element(By.XPATH, "//a[@class='visitpage']").get_attribute("href")
  30. all_profiles.append(profile)
  31. print(company_name, person_name, phone_number, location, opening_hour, profile)
  32. if block == 29:
  33. two_page = driver.find_element(By.XPATH, "//a[@class='facetwp-page']")
  34. two_page.click()
  35. driver.implicitly_wait(10)
  36. blocks = driver.find_elements(By.XPATH, "//div[@class='result clientresult']")
  37. for i in range(len(all_profiles)):
  38. driver.get(all_profiles[i])
  39. description = driver.find_element(By.XPATH, "//div[@class='desc-text-left']").text.strip()
  40. descriptions.append(description)
  41. website_link = driver.find_element(By.XPATH, "//a[@class='visitwebsite website']").get_attribute("href")
  42. websites_links.append(website_link)
  43. driver.implicitly_wait(10)
  44. driver.close()
  45. df = pd.DataFrame(
  46. {
  47. "company_name": companies_names,
  48. "person_name": persons_names,
  49. "phone_number": phones_numbers,
  50. "location": locations,
  51. "opening_hour": opening_hours,
  52. "description": descriptions,
  53. "website_link": websites_links,
  54. "profile_on_saveface": all_profiles
  55. }
  56. )
  57. df.to_csv('saveface.csv',index=False)
  58. #print(df)

This is the result:

  1. The Hartley Clinic Clinic Contact: Ailing Jeavons 01256 856289 , , Fleet, RG27 8NZ Monday 8:30 17:00 Tuesday 8:30 19:00 Wednesday 8:30 17:00 Thursday 8:30 17:00 Friday 8:30 15:00 Saturday 9:00 17:00 Sunday Closed https://www.saveface.co.uk/clinic/the-hartley-clinic/
  2. The Hartley Clinic Clinic Contact: Ailing Jeavons 01256 856289 , , Fleet, RG27 8NZ Monday 8:30 17:00 Tuesday 8:30 19:00 Wednesday 8:30 17:00 Thursday 8:30 17:00 Friday 8:30 15:00 Saturday 9:00 17:00 Sunday Closed https://www.saveface.co.uk/clinic/the-hartley-clinic/
  3. The Hartley Clinic Clinic Contact: Ailing Jeavons 01256 856289 , , Fleet, RG27 8NZ Monday 8:30 17:00 Tuesday 8:30 19:00 Wednesday 8:30 17:00 Thursday 8:30 17:00 Friday 8:30 15:00 Saturday 9:00 17:00 Sunday Closed https://www.saveface.co.uk/clinic/the-hartley-clinic/
  4. The Hartley Clinic Clinic Contact: Ailing Jeavons 01256 856289 , , Fleet, RG27 8NZ Monday 8:30 17:00 Tuesday 8:30 19:00 Wednesday 8:30 17:00 Thursday 8:30 17:00 Friday 8:30 15:00 Saturday 9:00 17:00 Sunday Closed https://www.saveface.co.uk/clinic/the-hartley-clinic/
  5. The Hartley Clinic Clinic Contact: Ailing Jeavons 01256 856289 , , Fleet, RG27 8NZ Monday 8:30 17:00 Tuesday 8:30 19:00 Wednesday 8:30 17:00 Thursday 8:30 17:00 Friday 8:30 15:00 Saturday 9:00 17:00 Sunday Closed https://www.saveface.co.uk/clinic/the-hartley-clinic/
  6. The Hartley Clinic Clinic Contact: Ailing Jeavons 01256 856289 , , Fleet, RG27 8NZ Monday 8:30 17:00 Tuesday 8:30 19:00 Wednesday 8:30 17:00 Thursday 8:30 17:00 Friday 8:30 15:00 Saturday 9:00 17:00 Sunday Closed https://www.saveface.co.uk/clinic/the-hartley-clinic/

答案1

得分: 1

为了将搜索限制在以上下文节点为根的子树中,您的表达式应该以.//开头,因此您需要在每个命令中用.//替换//

  1. ... = blocks[block].find_element(...)

//的意思是从文档的根开始搜索,完全忽略上下文节点blocks[block]

此外,请注意,并非所有的块都具有位置,如您可以从此图中看到:

[![enter image description here][1]][1]

在这种情况下,

  1. location = blocks[block].find_element(By.XPATH, "//li[@class='cls_loc']")

将引发NoSuchElementException异常。为了避免这种情况,您需要将命令放在try...except...块中。

更新

使用Selenium从400个块中进行爬取在我的电脑上需要约1分钟,我尝试使用BeautifulSoup,只需要不到1秒!爬取慢的部分是爬取个人资料,因为对于每个个人资料,我们都必须下载一个新的网页,但即使使用BeautifulSoup,速度仍然要快得多。

因此,我编写了一个不使用Selenium,只使用BeautifulSoup的脚本(您可以在终端中运行pip install beautifulsoup4来安装它):

  1. import requests
  2. from bs4 import BeautifulSoup
  3. url = 'https://www.saveface.co.uk/search/'
  4. soup = BeautifulSoup(requests.get(url).text, "html.parser")
  5. css_selector = {
  6. 'company name' : ".title",
  7. 'person name' : ".name_wrapper",
  8. 'phone number' : ".phone",
  9. 'location' : ".cls_loc",
  10. 'opening hours': ".opening-hours",
  11. 'profile link' : ".visitpage",
  12. }
  13. data = {key:[] for key in list(css_selector)+['description','website link']}
  14. number_of_pages = int(str(soup).split('total_pages":')[1].split('}')[0])
  15. for page in range(2,number_of_pages+2):
  16. blocks = soup.select('.clientresult')
  17. for idx,block in enumerate(blocks):
  18. print(f'blocks {idx+1}/{len(blocks)}',end='\r')
  19. for key in list(css_selector):
  20. try:
  21. if 'link' in key:
  22. data[key] += [ block.select_one(css_selector[key])['href'] ]
  23. else:
  24. data[key] += [ block.select_one(css_selector[key]).text.strip().replace('\r\n', ', ') ]
  25. except AttributeError:
  26. data[key] += ['*missing value']
  27. if page <= number_of pages:
  28. print('\nloading page', page)
  29. url_page = f'{url}?fwp_paged={page}&'
  30. soup = BeautifulSoup(requests.get(url_page).text, "html.parser")
  31. print('\nno more pages to load, moving to scrape profile links...')
  32. for idx,url in enumerate(data['profile link']):
  33. print(f"profile link {idx+1}/{len(data['profile link'])} ",end='\r')
  34. soup_profile = BeautifulSoup(requests.get(url).text, "html.parser")
  35. try:
  36. data['description'] += [soup_profile.select_one('.clinicContent > .description').text.strip()]
  37. except AttributeError:
  38. data['description'] += ['*missing value']
  39. try:
  40. data['website link'] += [soup_profile.select_one('.visitwebsite')['href']]
  41. except AttributeError:
  42. data['website link'] += ['*missing value']
  43. Output执行完需要约8分钟

blocks 400/400
loading page 2
blocks 109/109
no more pages to load, moving to scrape profile links...
profile link 509/509

  1. 然后,您可以轻松通过运行`pd.DataFrame(data)`来创建数据框。
  2. <details>
  3. <summary>英文:</summary>
  4. To restric the search within a subtree rooted at the context node, your expression should start with `.//` so you have to replace `//` with `.//` in each of the commands
  5. ... = blocks[block].find_element(...)
  6. The meaning of `//` is to search the document from the document&#39;s root, ignoring the context node `blocks[block]` altogether.
  7. Moreover, notice that not all the blocks have a location as you can see from this image
  8. [![enter image description here][1]][1]
  9. in this case
  10. location = blocks[block].find_element(By.XPATH, &quot;//li[@class=&#39;cls_loc&#39;]&quot;)
  11. will raise a `NoSuchElementException`. To avoid this you have to put the command in a `try...except...` block
  12. # UPDATE
  13. Scraping 400 blocks with selenium takes about 1 minute on my computer, I tried with BeautifulSoup and it just takes less than 1 second! The slow part is to scrape the profiles, because for each of them we have to download a new webpage, however is still way faster with BeautifulSoup.
  14. So I write a script without using selenium, just BeautifulSoup (you can install by running `pip install beautifulsoup4` in the terminal)
  15. import requests
  16. from bs4 import BeautifulSoup
  17. url = &#39;https://www.saveface.co.uk/search/&#39;
  18. soup = BeautifulSoup(requests.get(url).text, &quot;html.parser&quot;)
  19. css_selector = {
  20. &#39;company name&#39; : &quot;.title&quot;,
  21. &#39;person name&#39; : &quot;.name_wrapper&quot;,
  22. &#39;phone number&#39; : &quot;.phone&quot;,
  23. &#39;location&#39; : &quot;.cls_loc&quot;,
  24. &#39;opening hours&#39;: &quot;.opening-hours&quot;,
  25. &#39;profile link&#39; : &quot;.visitpage&quot;,
  26. }
  27. data = {key:[] for key in list(css_selector)+[&#39;description&#39;,&#39;website link&#39;]}
  28. number_of_pages = int(str(soup).split(&#39;total_pages&quot;:&#39;)[1].split(&#39;}&#39;)[0])
  29. for page in range(2,number_of_pages+2):
  30. blocks = soup.select(&#39;.clientresult&#39;)
  31. for idx,block in enumerate(blocks):
  32. print(f&#39;blocks {idx+1}/{len(blocks)}&#39;,end=&#39;\r&#39;)
  33. for key in list(css_selector):
  34. try:
  35. if &#39;link&#39; in key:
  36. data[key] += [ block.select_one(css_selector[key])[&#39;href&#39;] ]
  37. else:
  38. data[key] += [ block.select_one(css_selector[key]).text.strip().replace(&#39;\r\n&#39;,&#39;, &#39;) ]
  39. except AttributeError:
  40. data[key] += [&#39;*missing value*&#39;]
  41. if page &lt;= number_of_pages:
  42. print(&#39;\nloading page&#39;, page)
  43. url_page = f&#39;{url}?fwp_paged={page}&#39;
  44. soup = BeautifulSoup(requests.get(url_page).text, &quot;html.parser&quot;)
  45. print(&#39;\nno more pages to load, moving to scrape profile links...&#39;)
  46. for idx,url in enumerate(data[&#39;profile link&#39;]):
  47. print(f&quot;profile link {idx+1}/{len(data[&#39;profile link&#39;])} &quot;,end=&#39;\r&#39;)
  48. soup_profile = BeautifulSoup(requests.get(url).text, &quot;html.parser&quot;)
  49. try:
  50. data[&#39;description&#39;] += [soup_profile.select_one(&#39;.clinicContent &gt; .description&#39;).text.strip()]
  51. except AttributeError:
  52. data[&#39;description&#39;] += [&#39;*missing value*&#39;]
  53. try:
  54. data[&#39;website link&#39;] += [soup_profile.select_one(&#39;.visitwebsite&#39;)[&#39;href&#39;]]
  55. except AttributeError:
  56. data[&#39;website link&#39;] += [&#39;*missing value*&#39;]
  57. Output (it took about 8 minutes to complete the execution)
  58. blocks 400/400
  59. loading page 2
  60. blocks 109/109
  61. no more pages to load, moving to scrape profile links...
  62. profile link 509/509
  63. Then you can easily create the dataframe by running `pd.DataFrame(data)`
  64. [![enter image description here][2]][2]
  65. [1]: https://i.stack.imgur.com/RLjpLm.png
  66. [2]: https://i.stack.imgur.com/vsGn0.png
  67. </details>
  68. # 答案2
  69. **得分**: 0
  70. 这是新的代码
  71. 但为什么每一页都返回相同的输出:
  72. <details>
  73. <summary>英文:</summary>
  74. this is the new code
  75. but it returns the same output on every page why:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import pandas as pd

s=Service("C:\selenium driver\chromedriver.exe")
driver = webdriver.Chrome(service=s)

companies_names = []
persons_names = []
phones_numbers = []
locations = []
opening_hours = []
descriptions = []
websites_links = []
all_profiles = []

driver.get("https://www.saveface.co.uk/search/")

driver.implicitly_wait(10)

pages = driver.find_elements(By.XPATH, ".//a[@class='facetwp-page']")

for page in range(len(pages)+1):

  1. blocks = driver.find_elements(By.XPATH, &quot;.//div[@class=&#39;result clientresult&#39;]&quot;)
  2. for block in range(10):
  3. try:
  4. company_name = blocks[block].find_element(By.XPATH, &quot;.//h3[@class=&#39;resulttitle&#39;]&quot;).text.strip()
  5. companies_names.append(company_name)
  6. except:
  7. companies_names.append(&quot;Not found on the site&quot;)
  8. try:
  9. person_name = blocks[block].find_element(By.XPATH, &quot;.//p[@class=&#39;name_wrapper&#39;]&quot;).text.strip()
  10. persons_names.append(person_name)
  11. except:
  12. persons_names.append(&quot;Not found on the site&quot;)
  13. try:
  14. phone_number = blocks[block].find_element(By.XPATH, &quot;.//div[@class=&#39;searchContact phone&#39;]&quot;).text.strip()
  15. phones_numbers.append(phone_number)
  16. except:
  17. phones_numbers.append(&quot;Not found on the site&quot;)
  18. try:
  19. location = blocks[block].find_element(By.XPATH, &quot;.//li[@class=&#39;cls_loc&#39;]&quot;).text.strip()
  20. locations.append(location)
  21. except:
  22. locations.append(&quot;Not found on the site&quot;)
  23. try:
  24. opening_hour = blocks[block].find_element(By.XPATH, &quot;.//li[@class=&#39;opening-hours&#39;]&quot;).text.strip()
  25. opening_hours.append(opening_hour)
  26. except:
  27. opening_hours.append(&quot;Not found on the site&quot;)
  28. try:
  29. profile = blocks[block].find_element(By.XPATH, &quot;.//a[@class=&#39;visitpage&#39;]&quot;).get_attribute(&quot;href&quot;)
  30. all_profiles.append(profile)
  31. except:
  32. all_profiles.append(&quot;Not found on the site&quot;)
  33. two_page = driver.find_element(By.XPATH, &quot;.//a[@class=&#39;facetwp-page&#39;]&quot;)
  34. two_page.click()

for i in range(len(all_profiles)):

  1. try:
  2. driver.get(all_profiles[i])
  3. driver.implicitly_wait(10)
  4. try:
  5. description = driver.find_element(By.XPATH, &quot;.//div[@class=&#39;desc-text-left&#39;]&quot;).text.strip()
  6. descriptions.append(description)
  7. except:
  8. descriptions.append(&quot;Not found on the site&quot;)
  9. try:
  10. website_link = driver.find_element(By.XPATH, &quot;.//a[@class=&#39;visitwebsite website&#39;]&quot;).get_attribute(&quot;href&quot;)
  11. websites_links.append(website_link)
  12. except:
  13. websites_links.append(&quot;Not found on the site&quot;)
  14. except:
  15. descriptions.append(&quot;Not found on the site&quot;)
  16. websites_links.append(&quot;Not found on the site&quot;)

driver.implicitly_wait(10)
driver.close()

df = pd.DataFrame(
{
"company_name": companies_names,
"person_name": persons_names,
"phone_number": phones_numbers,
"location": locations,
"opening_hour": opening_hours,
"description": descriptions,
"website_link": websites_links,
"profile_on_saveface": all_profiles
}
)

df.to_csv('saveface.csv',index=False)
print(df)

  1. </details>

huangapple
  • 本文由 发表于 2023年2月14日 21:02:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/75448238.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定