Python使用Selenium从Google搜索结果中获取URL。

huangapple go评论69阅读模式
英文:

python selenium getting urls from google search results

问题

我试图使用Selenium从Google搜索结果中获取前10个URL。我知道除了inerHTML之外还有其他术语可以提供cite标签内的文本。

以下是代码:

# 打开Google
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.keys import Keys

chrome_options = Options()
chrome_options.headless = False
chrome_options.add_argument("start-maximized")
# options.add_experimental_option("detach", True)
chrome_options.add_argument("--no-sandbox")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.add_argument('--disable-blink-features=AutomationControlled')

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)
driver.get('https://www.google.com/')

# 粘贴 - 输入搜索词
var_inp = 'python google search'
# 搜索图像
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys(var_inp + Keys.RETURN)
# 查找前10家公司
res_lst = []
res = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'cite')))
print(len(res))
for r in res:
    print(r.get_attribute('innerHTML'))

# 从公司中获取电子邮件地址
# 发送电子邮件

结果如下:

https://github.com<span class="dyjrff qzEoUe" role="text"> › opsdisk</span>
https://blog.apilayer.com<span class="dyjrff qzEoUe" role="text"> › h...</span>
https://blog.apilayer.com<span class="dyjrff qzEoUe" role="text"> › h...</span>

我想要去掉<span...,因为我只需要URL。我可以使用正则表达式来去掉它们,但我需要get_attribute('TEXT')或其他方法来轻松获取结果。

英文:

I am trying to get firt 10 urls from google search results with selenium. I knew that there was other term than inerHTML which will give me the text inside cite tags.

here is code

#open google
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.keys import Keys

chrome_options = Options()
chrome_options.headless = False
chrome_options.add_argument(&quot;start-maximized&quot;)
# options.add_experimental_option(&quot;detach&quot;, True)
chrome_options.add_argument(&quot;--no-sandbox&quot;)
chrome_options.add_experimental_option(&quot;excludeSwitches&quot;, [&quot;enable-automation&quot;])
chrome_options.add_experimental_option(&#39;excludeSwitches&#39;, [&#39;enable-logging&#39;])
chrome_options.add_experimental_option(&#39;useAutomationExtension&#39;, False)
chrome_options.add_argument(&#39;--disable-blink-features=AutomationControlled&#39;)

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)
driver.get(&#39;https://www.google.com/&#39;)

#paste - write name
#var_inp=input(&#39;Write the name to search:&#39;)
var_inp=&#39;python google search&#39;
#search for image
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.NAME, &quot;q&quot;))).send_keys(var_inp+Keys.RETURN)
#find first 10 companies
res_lst=[]
res=WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.TAG_NAME,&#39;cite&#39;)))
print(len(res))
for r in res:
    print(r.get_attribute(&#39;innerHTML&#39;))

#take email addresses from company
#send email

the result is below

https://github.com&lt;span class=&quot;dyjrff qzEoUe&quot; role=&quot;text&quot;&gt; › opsdisk&lt;/span&gt;
https://blog.apilayer.com&lt;span class=&quot;dyjrff qzEoUe&quot; role=&quot;text&quot;&gt; › h...&lt;/span&gt;
https://blog.apilayer.com&lt;span class=&quot;dyjrff qzEoUe&quot; role=&quot;text&quot;&gt; › h...&lt;/span&gt;

I want to get rid of &lt;span... as I need only urls. I can get off them with reg.ex but I need get_attribute(&#39;TEXT&#39;) or sth else that will easily give the result.

答案1

得分: 1

这是针对特定情况的代码:

def remove_span(string):
  start = string.find("<span")
  end = string.find("</span>") + len("</span>")
  return string[:start] + string[end:]

这个函数操作字符串并从中删除了<span>标记。

for r in res:
    print(remove_span(r.get_attribute('innerHTML'))) # 返回 https://github.com
英文:

This is for this specific case:

def remove_span(string):
  start = string.find(&quot;&lt;span&quot;)
  end = string.find(&quot;&lt;/span&gt;&quot;) + len(&quot;&lt;/span&gt;&quot;)
  return string[:start] + string[end:]

The function manipulates the string and removes the span from it.

for r in res:
    print(removeSpan(r.get_attribute(&#39;innerHTML&#39;))) # returns https://github.com

答案2

得分: 1

获取node值的最佳方法是使用javascript executor并使用节点的firstchild来获取值。

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)
driver.get('https://www.google.com/')

# 粘贴 - 输入名称
# var_inp = input('输入要搜索的名称:')
var_inp = 'python google search'
# 搜索图像
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys(var_inp + Keys.RETURN)
# 查找前10家公司
res_lst = []
res = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'cite')))
print(len(res))
for r in res:
    print(driver.execute_script("return arguments[0].firstChild.textContent;", r))

输出:

27
https://pypi.org
https://pypi.org
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://stackoverflow.com
https://stackoverflow.com
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://www.jcchouinard.com
https://www.jcchouinard.com
https://www.educative.io
https://www.educative.io
https://python-googlesearch.readthedocs.io
https://python-googlesearch.readthedocs.io
https://medium.com
https://medium.com
https://medium.com
https://medium.com
https://github.com
https://github.com
https://github.com
https://github.com

如果您有其他问题或需要进一步的翻译,请告诉我。

英文:

The best way to get the value of the node to use javascripts executor and use the firstchild of the node to get the value.

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)
driver.get(&#39;https://www.google.com/&#39;)

#paste - write name
#var_inp=input(&#39;Write the name to search:&#39;)
var_inp=&#39;python google search&#39;
#search for image
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.NAME, &quot;q&quot;))).send_keys(var_inp+Keys.RETURN)
#find first 10 companies
res_lst=[]
res=WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.TAG_NAME,&#39;cite&#39;)))
print(len(res))
for r in res:
    print(driver.execute_script(&quot;return arguments[0].firstChild.textContent;&quot;, r))

Output:

27
https://pypi.org
https://pypi.org
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://stackoverflow.com
https://stackoverflow.com
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://www.geeksforgeeks.org
https://www.jcchouinard.com
https://www.jcchouinard.com
https://www.educative.io
https://www.educative.io
https://python-googlesearch.readthedocs.io
https://python-googlesearch.readthedocs.io
https://medium.com
https://medium.com
https://medium.com
https://medium.com
https://github.com
https://github.com
https://github.com
https://github.com

huangapple
  • 本文由 发表于 2023年2月19日 22:15:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/75500738.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定