2023年2月8日 23:21:34go评论101阅读模式

英文:

Can't find element using Selenium CSS Selector even though it works fine individually

问题

以下是您要翻译的代码部分：

I am trying to scrape this page: "https://www.semi.org/en/resources/member-directory"
On it's own, the code seems to work fine:
`link = browser.find_element(By.CLASS_NAME, "member-company__title").find_element(By.TAG_NAME, 'a').get_attribute('href')`
This returns my link. However, when I nest the code in a for loop, I get an error that the css selector was unable to find the element. I tried using the X_Path, but that would only access the first container.
This is my code:
results_df = pd.DataFrame({'Company Name': [], 'Join Date': [], 'Company ID': [], 'Company Description': [], 'Link': [], 'Primary Industry': [], 'Primary Product Category': [], 'Primary Sub Product Category': [], 'Keywords': [], 'Address':[]})
browser = webdriver.Chrome()
# Load the desired URL
another_url = "https://www.semi.org/en/resources/member-directory"
browser.get(another_url)
time.sleep(3)
containers = browser.find_elements(By.TAG_NAME, 'tr')
for i in range(len(containers)):
    container = containers[i]
    link = container.find_element(By.TAG_NAME, 'a').get_attribute('href')
    browser.get(link)
    print("Page navigated after click" + browser.title)
    time.sleep(3)
    company_name =  browser.find_element(By.CLASS_NAME, "page-title").text
    try:
        join_date = browser.find_element(By.CLASS_NAME, "member-company__join-date").find_element(By.TAG_NAME, 'span').text
    except NoSuchElementException:
        join_date = "None"
    try:
        c_ID = browser.find_element(By.CLASS_NAME, "member-company__company-id").find_element(By.TAG_NAME, 'span').text
    except NoSuchElementException:
        c_ID = "None"
    try:
        company_description = browser.find_element(By.CLASS_NAME, "member-company__description").text
    except NoSuchElementException:
        company_description = "None" 
    try:
        company_link = browser.find_element(By.CLASS_NAME,"member-company__website").find_element(By.TAG_NAME, 'div').get_attribute('href')
    except NoSuchElementException:
        company_link = "None"
    try:
        primary_industry = browser.find_element(By.CLASS_NAME, "member-company__primary-industry").find_element(By.TAG_NAME, 'div').text
    except NoSuchElementException:
        primary_industry = "None"
    try:
        primary_product_cat = browser.find_element(By.CLASS_NAME, "member-company__primary-product-category").find_element(By.TAG_NAME, 'div').text
    except NoSuchElementException:
        primary_product_cat = "None"
    try:
        primary_sub_product_cat = browser.find_element(By.CLASS_NAME, "member-company__primary-product-subcategory").find_element(By.TAG_NAME, 'div').text
    except NoSuchElementException:
        primary_sub_product_cat = "None"
    
    try:
        keywords = browser.find_element(By.CLASS_NAME, "member-company__keywords ").find_element(By.TAG_NAME, 'div').text
    except NoSuchElementException:
        keywords = "None"
    try:
        address = browser.find_element(By.CLASS_NAME,"member-company__address").text.replace("Street Address","")
    except NoSuchElementException:
        address = "None"
    browser.get(another_url)
    time.sleep(5)
    result_df = pd.DataFrame({"Company Name": [company_name], 
        "Join Date": [join_date],
        "Company ID": [c_ID],
        "Company Description": [company_description],
        "Company Website": [company_link],
        "Primary Industry": [primary_industry],
        "Primary Product Category": [primary_product_cat],
        "Primary Sub Product Category": [primary_sub_product_cat],
        "Keywords": [keywords],
        "Address":[address]})
    results_df = pd.concat([results_df, result_df])
    results_df.reset_index(drop=True, inplace=True)
    results_df.to_csv('semi_test', index=False)
browser.close()

希望这有帮助。如果您需要任何其他帮助，请随时提问。

英文:

I am trying to scrape this page: "https://www.semi.org/en/resources/member-directory"

On it's own, the code seems to work fine:
link = browser.find_element(By.CLASS_NAME, "member-company__title").find_element(By.TAG_NAME, 'a').get_attribute('href')
This returns my link. However, when I nest the code in a for loop, I get an error that the css selector was unable to find the element. I tried using the X_Path, but that would only access the first container.

This is my code:

results_df = pd.DataFrame({&#39;Company Name&#39;: [], &#39;Join Date&#39;: [], &#39;Company ID&#39;: [],&#39;Company Description&#39;: [], &#39;Link&#39;: [], &#39;Primary Industry&#39;: [], 
&#39;Primary Product Category&#39;: [], &#39;Primary Sub Product Category&#39;: [], &#39;Keywords&#39;: [], &#39;Address&#39;:[]})
browser = webdriver.Chrome()
# Load the desired URL
another_url = &quot;https://www.semi.org/en/resources/member-directory&quot;
browser.get(another_url)
time.sleep(3)
containers = browser.find_elements(By.TAG_NAME, &#39;tr&#39;)
for i in range(len(containers)):
container = containers[i]
link = container.find_element(By.TAG_NAME, &#39;a&#39;).get_attribute(&#39;href&#39;)
browser.get(link)
print(&quot;Page navigated after click&quot; + browser.title)
time.sleep(3)
company_name =  browser.find_element(By.CLASS_NAME, &quot;page-title&quot;).text
try:
join_date = browser.find_element(By.CLASS_NAME, &quot;member-company__join-date&quot;).find_element(By.TAG_NAME, &#39;span&#39;).text
except NoSuchElementException:
join_date = &quot;None&quot;
try:
c_ID = browser.find_element(By.CLASS_NAME, &quot;member-company__company-id&quot;).find_element(By.TAG_NAME, &#39;span&#39;).text
except NoSuchElementException:
c_ID = &quot;None&quot;
try:
company_description = browser.find_element(By.CLASS_NAME, &quot;member-company__description&quot;).text
except NoSuchElementException:
company_description = &quot;None&quot; 
try:
company_link = browser.find_element(By.CLASS_NAME,&quot;member-company__website&quot;).find_element(By.TAG_NAME, &#39;div&#39;).get_attribute(&#39;href&#39;)
except NoSuchElementException:
company_link = &quot;None&quot;
try:
primary_industry = browser.find_element(By.CLASS_NAME, &quot;member-company__primary-industry&quot;).find_element(By.TAG_NAME, &#39;div&#39;).text
except NoSuchElementException:
primary_industry = &quot;None&quot;
try:
primary_product_cat = browser.find_element(By.CLASS_NAME, &quot;member-company__primary-product-category&quot;).find_element(By.TAG_NAME, &#39;div&#39;).text
except NoSuchElementException:
primary_product_cat = &quot;None&quot;
try:
primary_sub_product_cat = browser.find_element(By.CLASS_NAME, &quot;member-company__primary-product-subcategory&quot;).find_element(By.TAG_NAME, &#39;div&#39;).text
except NoSuchElementException:
primary_sub_product_cat = &quot;None&quot;
try:
keywords = browser.find_element(By.CLASS_NAME, &quot;member-company__keywords &quot;).find_element(By.TAG_NAME, &#39;div&#39;).text
except NoSuchElementException:
keywords = &quot;None&quot;
try:
address = browser.find_element(By.CLASS_NAME,&quot;member-company__address&quot;).text.replace(&quot;Street Address&quot;,&quot;&quot;)
except NoSuchElementException:
address = &quot;None&quot;
browser.get(another_url)
time.sleep(5)
result_df = pd.DataFrame({&quot;Company Name&quot;: [company_name], 
&quot;Join Date&quot;: [join_date],
&quot;Company ID&quot;: [c_ID],
&quot;Company Description&quot;: [company_description],
&quot;Company Website&quot;: [company_link],
&quot;Primary Industry&quot;: [primary_industry],
&quot;Primary Product Category&quot;: [primary_product_cat],
&quot;Primary Sub Product Category&quot;: [primary_sub_product_cat],
&quot;Keywords&quot;: [keywords],
&quot;Address&quot;:[address]})
results_df = pd.concat([results_df, result_df])
results_df.reset_index(drop=True, inplace=True)
results_df.to_csv(&#39;semi_test&#39;, index=False)
browser.close()

What's going on?

答案1

得分: 0

这主要是由于语句containers = browser.find_elements(By.TAG_NAME, 'tr')。如果你打印出这些容器，你会注意到第一行被选中的是包含没有链接的标题行，因此你的脚本会失败，并抛出你所看到的异常。你可以通过containers = containers[1:]来解决这个问题，但随后你会面临StaleElementReferenceException的问题，因为在打开另一个链接后，你又回到了首页。你应该一次性从页面上获取所有的链接，然后遍历它们来分别抓取，而不是一遍又一遍地返回到首页。

英文:

This is mainly due to the statement containers = browser.find_elements(By.TAG_NAME, 'tr').
If you print out the containers, you'll notice that the first row selected is the header which contains no links and your script will fail giving the exception that you're seeing.
You can fix this with containers = containers[1:] but you'll then face the problem of StaleElementReferenceException because you've come back to the home page after opening another link.
You should scrape all the links from the page at once, and then iterate over those to scrape each of them, rather than coming back to the home page over and over again.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

无法使用Selenium CSS选择器找到元素，即使单独使用它正常。

问题

答案1

Python Web-Scraping 代码仅在循环中返回第一个迭代。

如何使我的类实例在Python中可用于多进程序列化？

将逗号分隔的数值拆分为子节点。

函数递归在for循环中不起作用，但在外部起作用。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。