使用Selenium进行网页爬取时加载更多

huangapple go评论163阅读模式
英文:

Load More using Selenium on Webscraping

问题

I was trying to do webscraping on Reuters for nlp analysis and most of it is working, but I am unable to get the code to click the "load more" button for more news articles. Below is the code currently being used:

  1. import csv
  2. import time
  3. import pprint
  4. from datetime import datetime, timedelta
  5. import requests
  6. import nltk
  7. nltk.download('vader_lexicon')
  8. from urllib.request import urlopen
  9. from bs4 import BeautifulSoup
  10. from bs4.element import Tag
  11. comp_name = 'Apple'
  12. url = 'https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all'
  13. res = requests.get(url.format(1))
  14. soup = BeautifulSoup(res.text,"lxml")
  15. for item in soup.find_all("h3",{"class":"search-result-title"}):
  16. s = str(item)
  17. article_addr = s.partition('a href="')[2].partition('">')[0]
  18. headline = s.partition('a href="')[2].partition('">')[2].partition('</a></h3>')[0]
  19. article_link = 'https://www.reuters.com' + article_addr
  20. try:
  21. resp = requests.get(article_addr)
  22. except Exception as e:
  23. try:
  24. resp = requests.get(article_link)
  25. except Exception as e:
  26. continue
  27. sauce = BeautifulSoup(resp.text,"lxml")
  28. dateTag = sauce.find("div",{"class":"ArticleHeader_date"})
  29. contentTag = sauce.find("div",{"class":"StandardArticleBody_body"})
  30. date = None
  31. title = None
  32. content = None
  33. if isinstance(dateTag,Tag):
  34. date = dateTag.get_text().partition('/')[0]
  35. if isinstance(contentTag,Tag):
  36. content = contentTag.get_text().strip()
  37. time.sleep(3)
  38. link_soup = BeautifulSoup(content)
  39. sentences = link_soup.findAll("p")
  40. print(date, headline, article_link)
  1. from selenium import webdriver
  2. from selenium.webdriver.common.keys import Keys
  3. import time
  4. browser = webdriver.Safari()
  5. browser.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all')
  6. try:
  7. element = WebDriverWait(browser, 3).until(EC.presence_of_element_located((By.ID,'Id_Of_Element')))
  8. except TimeoutException:
  9. print("Time out!")

(Note: I've fixed the typo in the import statement for Keys and added necessary imports for the Selenium code.)

英文:

I was trying to do webscraping on Reuters for nlp analysis and most of it is working, but I am unable to get the code to click the "load more" button for more news articles. Below is the code currently being used:

  1. import csv
  2. import time
  3. import pprint
  4. from datetime import datetime, timedelta
  5. import requests
  6. import nltk
  7. nltk.download(&#39;vader_lexicon&#39;)
  8. from urllib.request import urlopen
  9. from bs4 import BeautifulSoup
  10. from bs4.element import Tag
  11. comp_name = &#39;Apple&#39;
  12. url = &#39;https://www.reuters.com/search/news?blob=&#39; + comp_name + &#39;&amp;sortBy=date&amp;dateRange=all&#39;
  13. res = requests.get(url.format(1))
  14. soup = BeautifulSoup(res.text,&quot;lxml&quot;)
  15. for item in soup.find_all(&quot;h3&quot;,{&quot;class&quot;:&quot;search-result-title&quot;}):
  16. s = str(item)
  17. article_addr = s.partition(&#39;a href=&quot;&#39;)[2].partition(&#39;&quot;&gt;&#39;)[0]
  18. headline = s.partition(&#39;a href=&quot;&#39;)[2].partition(&#39;&quot;&gt;&#39;)[2].partition(&#39;&lt;/a&gt;&lt;/h3&gt;&#39;)[0]
  19. article_link = &#39;https://www.reuters.com&#39; + article_addr
  20. try:
  21. resp = requests.get(article_addr)
  22. except Exception as e:
  23. try:
  24. resp = requests.get(article_link)
  25. except Exception as e:
  26. continue
  27. sauce = BeautifulSoup(resp.text,&quot;lxml&quot;)
  28. dateTag = sauce.find(&quot;div&quot;,{&quot;class&quot;:&quot;ArticleHeader_date&quot;})
  29. contentTag = sauce.find(&quot;div&quot;,{&quot;class&quot;:&quot;StandardArticleBody_body&quot;})
  30. date = None
  31. title = None
  32. content = None
  33. if isinstance(dateTag,Tag):
  34. date = dateTag.get_text().partition(&#39;/&#39;)[0]
  35. if isinstance(contentTag,Tag):
  36. content = contentTag.get_text().strip()
  37. time.sleep(3)
  38. link_soup = BeautifulSoup(content)
  39. sentences = link_soup.findAll(&quot;p&quot;)
  40. print(date, headline, article_link)
  1. from selenium import webdriver
  2. from selenium.webdriver.common.keys import keys
  3. import time
  4. browser = webdriver.Safari()
  5. browser.get(&#39;https://www.reuters.com/search/news?blob=&#39; + comp_name + &#39;&amp;sortBy=date&amp;dateRange=all&#39;)
  6. try:
  7. element = WebDriverWait(browser, 3).until(EC.presence_of_element_located((By.ID,&#39;Id_Of_Element&#39;)))
  8. except TimeoutException:
  9. print(&quot;Time out!&quot;)

答案1

得分: 3

要点击文本为LOAD MORE RESULTS的元素,您需要使用 WebDriverWait 来等待 element_to_be_clickable(),并且您可以使用以下 定位策略:

  • 代码块:
  1. from selenium import webdriver
  2. from selenium.webdriver.support.ui import WebDriverWait
  3. from selenium.webdriver.common.by import By
  4. from selenium.webdriver.support import expected_conditions as EC
  5. options = webdriver.ChromeOptions()
  6. options.add_argument("start-maximized")
  7. options.add_experimental_option("excludeSwitches", ["enable-automation"])
  8. options.add_experimental_option('useAutomationExtension', False)
  9. driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
  10. comp_name = 'Apple'
  11. driver.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all')
  12. while True:
  13. try:
  14. driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='search-result-more-txt']"))))
  15. WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='search-result-more-txt']"))).click()
  16. print("LOAD MORE RESULTS 按钮已点击")
  17. except TimeoutException:
  18. print("没有更多LOAD MORE RESULTS按钮可点击")
  19. break
  20. driver.quit()
  • 控制台输出:
  1. LOAD MORE RESULTS 按钮已点击
  2. LOAD MORE RESULTS 按钮已点击
  3. LOAD MORE RESULTS 按钮已点击
  4. .
  5. .
  6. 没有更多LOAD MORE RESULTS按钮可点击

参考

您可以在以下链接中找到相关的详细讨论:

英文:

To click the element with text as LOAD MORE RESULTS you need to induce WebDriverWait for the element_to_be_clickable() and you can use the following Locator Strategies:

  • Code Block:

    1. from selenium import webdriver
    2. from selenium.webdriver.support.ui import WebDriverWait
    3. from selenium.webdriver.common.by import By
    4. from selenium.webdriver.support import expected_conditions as EC
    5. options = webdriver.ChromeOptions()
    6. options.add_argument(&quot;start-maximized&quot;)
    7. options.add_experimental_option(&quot;excludeSwitches&quot;, [&quot;enable-automation&quot;])
    8. options.add_experimental_option(&#39;useAutomationExtension&#39;, False)
    9. driver = webdriver.Chrome(options=options, executable_path=r&#39;C:\WebDrivers\chromedriver.exe&#39;)
    10. comp_name = &#39;Apple&#39;
    11. driver.get(&#39;https://www.reuters.com/search/news?blob=&#39; + comp_name + &#39;&amp;sortBy=date&amp;dateRange=all&#39;)
    12. while True:
    13. try:
    14. driver.execute_script(&quot;return arguments[0].scrollIntoView(true);&quot;, WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, &quot;//div[@class=&#39;search-result-more-txt&#39;]&quot;))))
    15. WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, &quot;//div[@class=&#39;search-result-more-txt&#39;]&quot;))).click()
    16. print(&quot;LOAD MORE RESULTS button clicked&quot;)
    17. except TimeoutException:
    18. print(&quot;No more LOAD MORE RESULTS button to be clicked&quot;)
    19. break
    20. driver.quit()
  • Console Output:

    1. LOAD MORE RESULTS button clicked
    2. LOAD MORE RESULTS button clicked
    3. LOAD MORE RESULTS button clicked
    4. .
    5. .
    6. No more LOAD MORE RESULTS button to be clicked

Reference

You can find a relevant detailed discussion in:

答案2

得分: 0

Sure, here is the translated code portion:

  1. 要点击LOAD MORE RESULTS按钮请使用`WebDriverWait()``element_to_be_clickable()`函数
  2. 使用while循环并检查计数器小于11以点击10
  3. 我已在Chrome上进行了测试因为我没有Safari浏览器但它也应该可以工作
  4. ```python
  5. from selenium import webdriver
  6. from selenium.webdriver.support.ui import WebDriverWait
  7. from selenium.webdriver.support import expected_conditions as EC
  8. from selenium.webdriver.common.by import By
  9. comp_name = "Apple"
  10. browser = webdriver.Chrome()
  11. browser.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all')
  12. # 接受条款按钮
  13. WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#_evidon-banner-acceptbutton"))).click()
  14. i = 1
  15. while i < 11:
  16. try:
  17. element = WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='search-result-more-txt' and text()='LOAD MORE RESULTS']")))
  18. element.location_once_scrolled_into_view
  19. browser.execute_script("arguments[0].click();", element)
  20. print(i)
  21. i = i + 1
  22. except TimeoutException:
  23. print("超时!")
英文:

To click on LOAD MORE RESULTS induce WebDriverWait() and element_to_be_clickable()

Use while loop and check the counter<11 to click on 10 times.

I have tested on Chrome since I don't have safari browser however it should work too.

  1. from selenium import webdriver
  2. from selenium.webdriver.support.ui import WebDriverWait
  3. from selenium.webdriver.support import expected_conditions as EC
  4. from selenium.webdriver.common.by import By
  5. comp_name=&quot;Apple&quot;
  6. browser = webdriver.Chrome()
  7. browser.get(&#39;https://www.reuters.com/search/news?blob=&#39; + comp_name + &#39;&amp;sortBy=date&amp;dateRange=all&#39;)
  8. #Accept the trems button
  9. WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR,&quot;button#_evidon-banner-acceptbutton&quot;))).click()
  10. i=1
  11. while i&lt;11:
  12. try:
  13. element = WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.XPATH,&quot;//div[@class=&#39;search-result-more-txt&#39; and text()=&#39;LOAD MORE RESULTS&#39;]&quot;)))
  14. element.location_once_scrolled_into_view
  15. browser.execute_script(&quot;arguments[0].click();&quot;, element)
  16. print(i)
  17. i=i+1
  18. except TimeoutException:
  19. print(&quot;Time out!&quot;)

huangapple
  • 本文由 发表于 2020年1月7日 01:17:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/59616309.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定