Python使用Selenium从Google搜索结果中获取URL。

huangapple go评论96阅读模式
英文:

python selenium getting urls from google search results

问题

我试图使用Selenium从Google搜索结果中获取前10个URL。我知道除了inerHTML之外还有其他术语可以提供cite标签内的文本。

以下是代码:

  1. # 打开Google
  2. from selenium.webdriver.chrome.options import Options
  3. from selenium import webdriver
  4. from selenium.webdriver.support.ui import WebDriverWait
  5. from selenium.webdriver.support import expected_conditions as EC
  6. from selenium.webdriver.common.by import By
  7. from webdriver_manager.chrome import ChromeDriverManager
  8. from selenium.webdriver.chrome.service import Service as ChromeService
  9. from selenium.webdriver.common.keys import Keys
  10. chrome_options = Options()
  11. chrome_options.headless = False
  12. chrome_options.add_argument("start-maximized")
  13. # options.add_experimental_option("detach", True)
  14. chrome_options.add_argument("--no-sandbox")
  15. chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
  16. chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])
  17. chrome_options.add_experimental_option('useAutomationExtension', False)
  18. chrome_options.add_argument('--disable-blink-features=AutomationControlled')
  19. driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)
  20. driver.get('https://www.google.com/')
  21. # 粘贴 - 输入搜索词
  22. var_inp = 'python google search'
  23. # 搜索图像
  24. WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys(var_inp + Keys.RETURN)
  25. # 查找前10家公司
  26. res_lst = []
  27. res = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'cite')))
  28. print(len(res))
  29. for r in res:
  30. print(r.get_attribute('innerHTML'))
  31. # 从公司中获取电子邮件地址
  32. # 发送电子邮件

结果如下:

  1. https://github.com<span class="dyjrff qzEoUe" role="text"> › opsdisk</span>
  2. https://blog.apilayer.com<span class="dyjrff qzEoUe" role="text"> › h...</span>
  3. https://blog.apilayer.com<span class="dyjrff qzEoUe" role="text"> › h...</span>

我想要去掉<span...,因为我只需要URL。我可以使用正则表达式来去掉它们,但我需要get_attribute('TEXT')或其他方法来轻松获取结果。

英文:

I am trying to get firt 10 urls from google search results with selenium. I knew that there was other term than inerHTML which will give me the text inside cite tags.

here is code

  1. #open google
  2. from selenium.webdriver.chrome.options import Options
  3. from selenium import webdriver
  4. from selenium.webdriver.support.ui import WebDriverWait
  5. from selenium.webdriver.support import expected_conditions as EC
  6. from selenium.webdriver.common.by import By
  7. from webdriver_manager.chrome import ChromeDriverManager
  8. from selenium.webdriver.chrome.service import Service as ChromeService
  9. from selenium.webdriver.common.keys import Keys
  10. chrome_options = Options()
  11. chrome_options.headless = False
  12. chrome_options.add_argument(&quot;start-maximized&quot;)
  13. # options.add_experimental_option(&quot;detach&quot;, True)
  14. chrome_options.add_argument(&quot;--no-sandbox&quot;)
  15. chrome_options.add_experimental_option(&quot;excludeSwitches&quot;, [&quot;enable-automation&quot;])
  16. chrome_options.add_experimental_option(&#39;excludeSwitches&#39;, [&#39;enable-logging&#39;])
  17. chrome_options.add_experimental_option(&#39;useAutomationExtension&#39;, False)
  18. chrome_options.add_argument(&#39;--disable-blink-features=AutomationControlled&#39;)
  19. driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)
  20. driver.get(&#39;https://www.google.com/&#39;)
  21. #paste - write name
  22. #var_inp=input(&#39;Write the name to search:&#39;)
  23. var_inp=&#39;python google search&#39;
  24. #search for image
  25. WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.NAME, &quot;q&quot;))).send_keys(var_inp+Keys.RETURN)
  26. #find first 10 companies
  27. res_lst=[]
  28. res=WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.TAG_NAME,&#39;cite&#39;)))
  29. print(len(res))
  30. for r in res:
  31. print(r.get_attribute(&#39;innerHTML&#39;))
  32. #take email addresses from company
  33. #send email

the result is below

  1. https://github.com&lt;span class=&quot;dyjrff qzEoUe&quot; role=&quot;text&quot;&gt; › opsdisk&lt;/span&gt;
  2. https://blog.apilayer.com&lt;span class=&quot;dyjrff qzEoUe&quot; role=&quot;text&quot;&gt; › h...&lt;/span&gt;
  3. https://blog.apilayer.com&lt;span class=&quot;dyjrff qzEoUe&quot; role=&quot;text&quot;&gt; › h...&lt;/span&gt;

I want to get rid of &lt;span... as I need only urls. I can get off them with reg.ex but I need get_attribute(&#39;TEXT&#39;) or sth else that will easily give the result.

答案1

得分: 1

这是针对特定情况的代码:

  1. def remove_span(string):
  2. start = string.find("<span")
  3. end = string.find("</span>") + len("</span>")
  4. return string[:start] + string[end:]

这个函数操作字符串并从中删除了<span>标记。

  1. for r in res:
  2. print(remove_span(r.get_attribute('innerHTML'))) # 返回 https://github.com
英文:

This is for this specific case:

  1. def remove_span(string):
  2. start = string.find(&quot;&lt;span&quot;)
  3. end = string.find(&quot;&lt;/span&gt;&quot;) + len(&quot;&lt;/span&gt;&quot;)
  4. return string[:start] + string[end:]

The function manipulates the string and removes the span from it.

  1. for r in res:
  2. print(removeSpan(r.get_attribute(&#39;innerHTML&#39;))) # returns https://github.com

答案2

得分: 1

获取node值的最佳方法是使用javascript executor并使用节点的firstchild来获取值。

  1. driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)
  2. driver.get('https://www.google.com/')
  3. # 粘贴 - 输入名称
  4. # var_inp = input('输入要搜索的名称:')
  5. var_inp = 'python google search'
  6. # 搜索图像
  7. WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys(var_inp + Keys.RETURN)
  8. # 查找前10家公司
  9. res_lst = []
  10. res = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'cite')))
  11. print(len(res))
  12. for r in res:
  13. print(driver.execute_script("return arguments[0].firstChild.textContent;", r))

输出:

  1. 27
  2. https://pypi.org
  3. https://pypi.org
  4. https://www.geeksforgeeks.org
  5. https://www.geeksforgeeks.org
  6. https://stackoverflow.com
  7. https://stackoverflow.com
  8. https://www.geeksforgeeks.org
  9. https://www.geeksforgeeks.org
  10. https://www.geeksforgeeks.org
  11. https://www.geeksforgeeks.org
  12. https://www.jcchouinard.com
  13. https://www.jcchouinard.com
  14. https://www.educative.io
  15. https://www.educative.io
  16. https://python-googlesearch.readthedocs.io
  17. https://python-googlesearch.readthedocs.io
  18. https://medium.com
  19. https://medium.com
  20. https://medium.com
  21. https://medium.com
  22. https://github.com
  23. https://github.com
  24. https://github.com
  25. https://github.com

如果您有其他问题或需要进一步的翻译,请告诉我。

英文:

The best way to get the value of the node to use javascripts executor and use the firstchild of the node to get the value.

  1. driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)
  2. driver.get(&#39;https://www.google.com/&#39;)
  3. #paste - write name
  4. #var_inp=input(&#39;Write the name to search:&#39;)
  5. var_inp=&#39;python google search&#39;
  6. #search for image
  7. WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.NAME, &quot;q&quot;))).send_keys(var_inp+Keys.RETURN)
  8. #find first 10 companies
  9. res_lst=[]
  10. res=WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.TAG_NAME,&#39;cite&#39;)))
  11. print(len(res))
  12. for r in res:
  13. print(driver.execute_script(&quot;return arguments[0].firstChild.textContent;&quot;, r))

Output:

  1. 27
  2. https://pypi.org
  3. https://pypi.org
  4. https://www.geeksforgeeks.org
  5. https://www.geeksforgeeks.org
  6. https://stackoverflow.com
  7. https://stackoverflow.com
  8. https://www.geeksforgeeks.org
  9. https://www.geeksforgeeks.org
  10. https://www.geeksforgeeks.org
  11. https://www.geeksforgeeks.org
  12. https://www.jcchouinard.com
  13. https://www.jcchouinard.com
  14. https://www.educative.io
  15. https://www.educative.io
  16. https://python-googlesearch.readthedocs.io
  17. https://python-googlesearch.readthedocs.io
  18. https://medium.com
  19. https://medium.com
  20. https://medium.com
  21. https://medium.com
  22. https://github.com
  23. https://github.com
  24. https://github.com
  25. https://github.com

huangapple
  • 本文由 发表于 2023年2月19日 22:15:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/75500738.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定