2023年2月19日 22:15:01go评论96阅读模式

英文:

python selenium getting urls from google search results

问题

我试图使用Selenium从Google搜索结果中获取前10个URL。我知道除了inerHTML之外还有其他术语可以提供cite标签内的文本。

以下是代码：

# 打开Google
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.keys import Keys
chrome_options = Options()
chrome_options.headless = False
chrome_options.add_argument("start-maximized")
# options.add_experimental_option("detach", True)
chrome_options.add_argument("--no-sandbox")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)
driver.get('https://www.google.com/')
# 粘贴 - 输入搜索词
var_inp = 'python google search'
# 搜索图像
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys(var_inp + Keys.RETURN)
# 查找前10家公司
res_lst = []
res = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'cite')))
print(len(res))
for r in res:
    print(r.get_attribute('innerHTML'))
# 从公司中获取电子邮件地址
# 发送电子邮件

结果如下：

https://github.com<span class="dyjrff qzEoUe" role="text"> › opsdisk</span>
https://blog.apilayer.com<span class="dyjrff qzEoUe" role="text"> › h...</span>
https://blog.apilayer.com<span class="dyjrff qzEoUe" role="text"> › h...</span>

我想要去掉<span...，因为我只需要URL。我可以使用正则表达式来去掉它们，但我需要get_attribute('TEXT')或其他方法来轻松获取结果。

英文:

I am trying to get firt 10 urls from google search results with selenium. I knew that there was other term than inerHTML which will give me the text inside cite tags.

here is code

#open google
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.keys import Keys
chrome_options = Options()
chrome_options.headless = False
chrome_options.add_argument(&quot;start-maximized&quot;)
# options.add_experimental_option(&quot;detach&quot;, True)
chrome_options.add_argument(&quot;--no-sandbox&quot;)
chrome_options.add_experimental_option(&quot;excludeSwitches&quot;, [&quot;enable-automation&quot;])
chrome_options.add_experimental_option(&#39;excludeSwitches&#39;, [&#39;enable-logging&#39;])
chrome_options.add_experimental_option(&#39;useAutomationExtension&#39;, False)
chrome_options.add_argument(&#39;--disable-blink-features=AutomationControlled&#39;)
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)
driver.get(&#39;https://www.google.com/&#39;)
#paste - write name
#var_inp=input(&#39;Write the name to search:&#39;)
var_inp=&#39;python google search&#39;
#search for image
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.NAME, &quot;q&quot;))).send_keys(var_inp+Keys.RETURN)
#find first 10 companies
res_lst=[]
res=WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.TAG_NAME,&#39;cite&#39;)))
print(len(res))
for r in res:
    print(r.get_attribute(&#39;innerHTML&#39;))
#take email addresses from company
#send email

the result is below

https://github.com&lt;span class=&quot;dyjrff qzEoUe&quot; role=&quot;text&quot;&gt; › opsdisk&lt;/span&gt;
https://blog.apilayer.com&lt;span class=&quot;dyjrff qzEoUe&quot; role=&quot;text&quot;&gt; › h...&lt;/span&gt;
https://blog.apilayer.com&lt;span class=&quot;dyjrff qzEoUe&quot; role=&quot;text&quot;&gt; › h...&lt;/span&gt;

I want to get rid of <span... as I need only urls. I can get off them with reg.ex but I need get_attribute('TEXT') or sth else that will easily give the result.

答案1

得分: 1

这是针对特定情况的代码：

def remove_span(string):
  start = string.find("<span")
  end = string.find("</span>") + len("</span>")
  return string[:start] + string[end:]

这个函数操作字符串并从中删除了<span>标记。

for r in res:
    print(remove_span(r.get_attribute('innerHTML'))) # 返回 https://github.com

英文:

This is for this specific case:

def remove_span(string):
  start = string.find(&quot;&lt;span&quot;)
  end = string.find(&quot;&lt;/span&gt;&quot;) + len(&quot;&lt;/span&gt;&quot;)
  return string[:start] + string[end:]

The function manipulates the string and removes the span from it.

for r in res:
    print(removeSpan(r.get_attribute(&#39;innerHTML&#39;))) # returns https://github.com

答案2

得分: 1

获取node值的最佳方法是使用javascript executor并使用节点的firstchild来获取值。

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)
driver.get('https://www.google.com/')
# 粘贴 - 输入名称
# var_inp = input('输入要搜索的名称：')
var_inp = 'python google search'
# 搜索图像
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys(var_inp + Keys.RETURN)
# 查找前10家公司
res_lst = []
res = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'cite')))
print(len(res))
for r in res:
    print(driver.execute_script("return arguments[0].firstChild.textContent;", r))

输出:

27
https://pypi.org
https://pypi.org
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://stackoverflow.com
https://stackoverflow.com
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://www.jcchouinard.com
https://www.jcchouinard.com
https://www.educative.io
https://www.educative.io
https://python-googlesearch.readthedocs.io
https://python-googlesearch.readthedocs.io
https://medium.com
https://medium.com
https://medium.com
https://medium.com
https://github.com
https://github.com
https://github.com
https://github.com

如果您有其他问题或需要进一步的翻译，请告诉我。

英文:

The best way to get the value of the node to use javascripts executor and use the firstchild of the node to get the value.

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)
driver.get(&#39;https://www.google.com/&#39;)
#paste - write name
#var_inp=input(&#39;Write the name to search:&#39;)
var_inp=&#39;python google search&#39;
#search for image
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.NAME, &quot;q&quot;))).send_keys(var_inp+Keys.RETURN)
#find first 10 companies
res_lst=[]
res=WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.TAG_NAME,&#39;cite&#39;)))
print(len(res))
for r in res:
    print(driver.execute_script(&quot;return arguments[0].firstChild.textContent;&quot;, r))

Output:

27
https://pypi.org
https://pypi.org
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://stackoverflow.com
https://stackoverflow.com
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://www.jcchouinard.com
https://www.jcchouinard.com
https://www.educative.io
https://www.educative.io
https://python-googlesearch.readthedocs.io
https://python-googlesearch.readthedocs.io
https://medium.com
https://medium.com
https://medium.com
https://medium.com
https://github.com
https://github.com
https://github.com
https://github.com

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python使用Selenium从Google搜索结果中获取URL。

问题

答案1

答案2

如何计算pandas数据框中组间最近事件的平均值？

Pycharm – 内部启动问题 – java.lang.RuntimeException: 无法找到安装主目录路径

Python – 如何将嵌套的 JSON 字典移动到其自己的索引位置？

执行使用Python执行ffmpeg命令来定位*.png文件失败。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。