2020年1月7日 00:26:25go评论238阅读模式

英文:

Filter Div, BeautifulSoup, With Empty Return

问题

以下是您要翻译的代码部分：

from bs4 import BeautifulSoup

for link in soup.select('div > a[href*="/tarefa"]'):
    ref = link.get('href')
    rt = ('https://brainly.com.br' + str(ref))
    p.append(rt)
print(p)

for link in soup.select('div > a[href*=""]'):
    ref = link.get('href')
    rt = ('https://brainly.com.br' + str(ref))
    p.append(rt)
print(p)

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Firefox(executable_path=r'C:/path/geckodriver.exe')
browser.get('https://brainly.com.br/app/ask?entry=hero&q=jhyhv+vjh')

html = browser.execute_script("return document.documentElement.outerHTML")
p = []
soup = BeautifulSoup(html, 'html.parser')
for link in soup.select('div > a[href*=""]'):
    ref = link.get('href')
    rt = ('https://brainly.com.br' + str(ref))
    p.append(rt)
print(p)

请注意，我只翻译了代码部分，不包括其他内容。

英文:

Running my algorithm below I seek to filter a div:

from bs4 import BeautifulSoup

for link in soup.select(&#39;div &gt; a[href*=&quot;/tarefa&quot;]&#39;):
    ref=link.get(&#39;href&#39;)
    rt = (&#39;https://brainly.com.br&#39;+str(ref))
    p.append(rt)
print(p)

Div Below:

&lt;div class=&quot;sg-content-box__content&quot;&gt;&lt;a href=&quot;/tarefa/2254726&quot;&gt;

adjust:

&lt;div class=&quot;sg-content-box&quot;&gt;&lt;a href=&quot;/tarefa/21670613&quot;&gt;

But in doing this the Return is empty.What could be the mistake in this part?

Expected Exit: Examples.

/tarefa/2254726 

/tarefa/21670613

How do I need to check it out? Sometimes the page would end up changing the content where href is a high amount of data, needed something like 'div> a [href * = "/ task"]' so you could search for it. keyword 'task' and not the creation of a variable already containing the content.

Complete Algorithm:

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


browser =webdriver.Firefox(executable_path=r&#39;C:/path/geckodriver.exe&#39;)
browser.get(&#39;https://brainly.com.br/app/ask?entry=hero&amp;q=jhyhv+vjh&#39;)

html = browser.execute_script(&quot;return document.documentElement.outerHTML&quot;)
p=[]
soup=BeautifulSoup(html,&#39;html.parser&#39;)
for link in soup.select(&#39;div &gt; a[href*=&quot;&quot;]&#39;):
    ref=link.get(&#39;href&#39;)
    rt = (&#39;https://brainly.com.br&#39;+str(ref))
    p.append(rt)
print(p)

答案1

得分: 1

以下是您提供的代码的中文翻译：

可能发生的情况是浏览器加载数据需要更多时间，因此有时会得到空结果。

使用 `WebDriverWait()` 并等待元素 `visibility_of_all_elements_located()`


    from selenium import webdriver
    from bs4 import BeautifulSoup
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    
    browser = webdriver.Firefox(executable_path=r'C:/path/geckodriver.exe')
    browser.get('https://brainly.com.br/app/ask?entry=hero&amp;q=jhyhv+vjh')
    WebDriverWait(browser, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'a[href*="/tarefa"]')))
    html = browser.page_source
    #html = browser.execute_script("return document.documentElement.outerHTML")
    p = []
    soup = BeautifulSoup(html, 'html.parser')
    for link in soup.select('div.sg-actions-list__hole > a[href*="/tarefa"]'):
        ref = link.get('href')
        rt = ('https://brainly.com.br' + str(ref))
        p.append(rt)
    print(p)

**输出**:

    ['https://brainly.com.br/tarefa/2254726', 'https://brainly.com.br/tarefa/21670613', 'https://brainly.com.br/tarefa/10188641', 'https://brainly.com.br/tarefa/22664332', 'https://brainly.com.br/tarefa/24152913', 'https://brainly.com.br/tarefa/11344228', 'https://brainly.com.br/tarefa/10888823', 'https://brainly.com.br/tarefa/23525186', 'https://brainly.com.br/tarefa/16838028', 'https://brainly.com.br/tarefa/24494056']

英文:

This is possibly happens that browser is taking more time load the data.Hence you are getting sometimes empty result.

Induce WebDriverWait() and wait for element visibility_of_all_elements_located()

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


browser =webdriver.Firefox(executable_path=r&#39;C:/path/geckodriver.exe&#39;)
browser.get(&#39;https://brainly.com.br/app/ask?entry=hero&amp;q=jhyhv+vjh&#39;)
WebDriverWait(browser,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,&#39;a[href*=&quot;/tarefa&quot;]&#39;)))
html=browser.page_source
#html = browser.execute_script(&quot;return document.documentElement.outerHTML&quot;)
p=[]
soup=BeautifulSoup(html,&#39;html.parser&#39;)
for link in soup.select(&#39;div.sg-actions-list__hole &gt; a[href*=&quot;/tarefa&quot;]&#39;):
    ref=link.get(&#39;href&#39;)
    rt = (&#39;https://brainly.com.br&#39;+str(ref))
    p.append(rt)
print(p)

Output:

[&#39;https://brainly.com.br/tarefa/2254726&#39;, &#39;https://brainly.com.br/tarefa/21670613&#39;, &#39;https://brainly.com.br/tarefa/10188641&#39;, &#39;https://brainly.com.br/tarefa/22664332&#39;, &#39;https://brainly.com.br/tarefa/24152913&#39;, &#39;https://brainly.com.br/tarefa/11344228&#39;, &#39;https://brainly.com.br/tarefa/10888823&#39;, &#39;https://brainly.com.br/tarefa/23525186&#39;, &#39;https://brainly.com.br/tarefa/16838028&#39;, &#39;https://brainly.com.br/tarefa/24494056&#39;]

答案2

得分: 0

from bs4 import BeautifulSoup

test_html = '''
         &lt;div class=&quot;sg-content-box__content&quot;&gt;&lt;a href=&quot;/tarefa/2254726&quot;&gt; 
         &lt;div class=&quot;sg-content-box&quot;&gt;&lt;a href=&quot;/tarefa/21670613&quot;&gt;
         '''
soup = BeautifulSoup(test_html, 'lxml')
p=[]
for link in soup.find_all('div'):
    ref=link.a.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)

英文:

from bs4 import BeautifulSoup

test_html = &#39;&#39;&#39;
         &lt;div class=&quot;sg-content-box__content&quot;&gt;&lt;a href=&quot;/tarefa/2254726&quot;&gt; 
         &lt;div class=&quot;sg-content-box&quot;&gt;&lt;a href=&quot;/tarefa/21670613&quot;&gt;
         &#39;&#39;&#39;

soup = BeautifulSoup(test_html, &#39;lxml&#39;)
p=[]
for link in soup.find_all(&#39;div&#39;):
    ref=link.a.get(&#39;href&#39;)
    rt = (&#39;https://brainly.com.br&#39;+str(ref))
    p.append(rt)
print(p)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Filter Div, BeautifulSoup, With Empty Return 过滤 Div，BeautifulSoup，返回空值

问题

答案1

答案2

全局依赖在FastAPI中不起作用

Python删除每个分组中第一次出现后的行

尝试在Python 3.11中的while循环内并发运行线程

如何使用admin.display装饰器按两个列排序结果

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论