Filter Div, BeautifulSoup, With Empty Return 过滤 Div,BeautifulSoup,返回空值

huangapple go评论139阅读模式
英文:

Filter Div, BeautifulSoup, With Empty Return

问题

以下是您要翻译的代码部分:

from bs4 import BeautifulSoup

for link in soup.select('div > a[href*="/tarefa"]'):
    ref = link.get('href')
    rt = ('https://brainly.com.br' + str(ref))
    p.append(rt)
print(p)
for link in soup.select('div > a[href*=""]'):
    ref = link.get('href')
    rt = ('https://brainly.com.br' + str(ref))
    p.append(rt)
print(p)
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Firefox(executable_path=r'C:/path/geckodriver.exe')
browser.get('https://brainly.com.br/app/ask?entry=hero&q=jhyhv+vjh')

html = browser.execute_script("return document.documentElement.outerHTML")
p = []
soup = BeautifulSoup(html, 'html.parser')
for link in soup.select('div > a[href*=""]'):
    ref = link.get('href')
    rt = ('https://brainly.com.br' + str(ref))
    p.append(rt)
print(p)

请注意,我只翻译了代码部分,不包括其他内容。

英文:

Running my algorithm below I seek to filter a div:

from bs4 import BeautifulSoup

for link in soup.select('div > a[href*="/tarefa"]'):
    ref=link.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)

Div Below:

<div class="sg-content-box__content"><a href="/tarefa/2254726"> 

adjust:

<div class="sg-content-box"><a href="/tarefa/21670613">

But in doing this the Return is empty.What could be the mistake in this part?

Expected Exit: Examples.

/tarefa/2254726 

/tarefa/21670613  

How do I need to check it out? Sometimes the page would end up changing the content where href is a high amount of data, needed something like 'div> a [href * = "/ task"]' so you could search for it. keyword 'task' and not the creation of a variable already containing the content.

Complete Algorithm:

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


browser =webdriver.Firefox(executable_path=r'C:/path/geckodriver.exe')
browser.get('https://brainly.com.br/app/ask?entry=hero&q=jhyhv+vjh')

html = browser.execute_script("return document.documentElement.outerHTML")
p=[]
soup=BeautifulSoup(html,'html.parser')
for link in soup.select('div > a[href*=""]'):
    ref=link.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)

答案1

得分: 1

以下是您提供的代码的中文翻译:

可能发生的情况是浏览器加载数据需要更多时间因此有时会得到空结果

使用 `WebDriverWait()` 并等待元素 `visibility_of_all_elements_located()`


    from selenium import webdriver
    from bs4 import BeautifulSoup
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    
    browser = webdriver.Firefox(executable_path=r'C:/path/geckodriver.exe')
    browser.get('https://brainly.com.br/app/ask?entry=hero&q=jhyhv+vjh')
    WebDriverWait(browser, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'a[href*="/tarefa"]')))
    html = browser.page_source
    #html = browser.execute_script("return document.documentElement.outerHTML")
    p = []
    soup = BeautifulSoup(html, 'html.parser')
    for link in soup.select('div.sg-actions-list__hole > a[href*="/tarefa"]'):
        ref = link.get('href')
        rt = ('https://brainly.com.br' + str(ref))
        p.append(rt)
    print(p)

**输出**:

    ['https://brainly.com.br/tarefa/2254726', 'https://brainly.com.br/tarefa/21670613', 'https://brainly.com.br/tarefa/10188641', 'https://brainly.com.br/tarefa/22664332', 'https://brainly.com.br/tarefa/24152913', 'https://brainly.com.br/tarefa/11344228', 'https://brainly.com.br/tarefa/10888823', 'https://brainly.com.br/tarefa/23525186', 'https://brainly.com.br/tarefa/16838028', 'https://brainly.com.br/tarefa/24494056']
英文:

This is possibly happens that browser is taking more time load the data.Hence you are getting sometimes empty result.

Induce WebDriverWait() and wait for element visibility_of_all_elements_located()

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


browser =webdriver.Firefox(executable_path=r'C:/path/geckodriver.exe')
browser.get('https://brainly.com.br/app/ask?entry=hero&q=jhyhv+vjh')
WebDriverWait(browser,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,'a[href*="/tarefa"]')))
html=browser.page_source
#html = browser.execute_script("return document.documentElement.outerHTML")
p=[]
soup=BeautifulSoup(html,'html.parser')
for link in soup.select('div.sg-actions-list__hole > a[href*="/tarefa"]'):
    ref=link.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)

Output:

['https://brainly.com.br/tarefa/2254726', 'https://brainly.com.br/tarefa/21670613', 'https://brainly.com.br/tarefa/10188641', 'https://brainly.com.br/tarefa/22664332', 'https://brainly.com.br/tarefa/24152913', 'https://brainly.com.br/tarefa/11344228', 'https://brainly.com.br/tarefa/10888823', 'https://brainly.com.br/tarefa/23525186', 'https://brainly.com.br/tarefa/16838028', 'https://brainly.com.br/tarefa/24494056']

答案2

得分: 0

from bs4 import BeautifulSoup

test_html = '''
         <div class="sg-content-box__content"><a href="/tarefa/2254726"> 
         <div class="sg-content-box"><a href="/tarefa/21670613">
         '''
soup = BeautifulSoup(test_html, 'lxml')
p=[]
for link in soup.find_all('div'):
    ref=link.a.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)
英文:
from bs4 import BeautifulSoup

test_html = '''
         <div class="sg-content-box__content"><a href="/tarefa/2254726"> 
         <div class="sg-content-box"><a href="/tarefa/21670613">
         '''

soup = BeautifulSoup(test_html, 'lxml')
p=[]
for link in soup.find_all('div'):
    ref=link.a.get('href')
    rt = ('https://brainly.com.br'+str(ref))
    p.append(rt)
print(p)

huangapple
  • 本文由 发表于 2020年1月7日 00:26:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/59615594.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定