2023年1月8日 23:56:08go评论107阅读模式

英文:

Requests-html: searching for the xpath of an element returns an empty list

问题

I'm trying to scrape data from this website: myworkdayjobs link

The data I want to collect are the job advertisemnts and their respective data.
Currently there are 7 jobs active.

On the inspect page I can see the 7 wanted elements all having the same:
li class="css-1q2dra3"

But the page.html.xpath() always returns me an empty list.

The steps I've taken are:

session = HTMLSession()
url = (
    'https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite'
    '?locations=91336993fab910af6d6f80c09504c167'
    '&jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78'
    )
page = session.get(url)
page.html.render(sleep=1, keep_page=True, scrolldown=1)
cards = page.html.xpath("the_xpath_here")
print(cards)

I've also tried multiple other xpaths including the xpath I get when I copy the xpath path from rightclicking the element on the inspect page:

//*[@id=&quot;mainContent&quot;]/div/div[2]/section/ul/li[1]/div[1]
/html/body/div/div/div/div[3]/div/div/div[2]/section/ul/li[1]/div[1]
//*[@id=&quot;mainContent&quot;]/div/div[2]/section/ul/li[1]

Now, the only time I get results for a li element is when I

cards = page.html.xpath('//li')

Which returns me the li elements at the bottom of the page. But it fully ignores the elements that I want...

I'm not an expert on webscraping or so, but I have made it work with an other careers page before. What am I missing? Why can't I access those elements?

=========================================================

Additional information:
The problem that I experience seems to happen after the section element.

When I

cards = page.html.xpath('//*[@id=&quot;mainContent&quot;]/div/div[2]/section/*')
print(cards)

I get:

[<Element 'p' data-automation-id='jobFoundText' class=('css-12psxof',)>, <Element 'div' data-automation-id='jobJumpToDetailsContainer' class=('css-14l0ax5',)>, <Element 'div' class=('css-19kzrtu',)>]

Why isn't there no ul element in the list? It's clearly there in the inspect window.

=========================================================

Answer
(Because the answer is in the accepted solution comment)

The page had apparently not fully loaded by the time of the assignment of cards, and thus the ul element was not there yet.

Adding one more second to the renderer sleep did the trick (sleep=2).

session = HTMLSession()
url = (
    'https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite'
    '?locations=91336993fab910af6d6f80c09504c167'
    '&jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78'
    )
page = session.get(url)
page.html.render(sleep=2, keep_page=True, scrolldown=1)
cards = page.html.xpath("the_xpath_here")
print(cards)

英文:

I'm trying to scrape data from this website: myworkdayjobs link

The data I want to collect are the job advertisemnts and their respective data.
Currently there are 7 jobs active.

On the inspect page I can see the 7 wanted elements all having the same:
li class="css-1q2dra3"

But the page.html.xpath() always returns me an empty list.

The steps I've taken are:

session = HTMLSession()
url = (
    &#39;https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite&#39;
    &#39;?locations=91336993fab910af6d6f80c09504c167&#39;
    &#39;&amp;jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78&#39;
    )
page = session.get(url)
page.html.render(sleep=1, keep_page=True, scrolldown=1)
cards = page.html.xpath(&quot;the_xpath_here&quot;)
print(cards)

I've also tried multiple other xpaths including the xpath I get when I copy the xpath path from rightclicking the element on the inspect page:

//*[@id=&quot;mainContent&quot;]/div/div[2]/section/ul/li[1]/div[1]
/html/body/div/div/div/div[3]/div/div/div[2]/section/ul/li[1]/div[1]
//*[@id=&quot;mainContent&quot;]/div/div[2]/section/ul/li[1]

Now, the only time I get results for a li element is when I

cards = page.html.xpath(&#39;//li&#39;)

Which returns me the li elements at the bottom of the page. But it fully ignores the elements that I want...

I'm not an expert on webscraping or so, but I have made it work with an other careers page before. What am I missing? Why can't I access those elements?

=========================================================
Additional information:
The problem that I experience seems to happen after the section element.

When I

cards = page.html.xpath(&#39;//*[@id=&quot;mainContent&quot;]/div/div[2]/section/*&#39;)
print(cards)
[&lt;Element &#39;p&#39; data-automation-id=&#39;jobFoundText&#39; class=(&#39;css-12psxof&#39;,)&gt;, &lt;Element &#39;div&#39; data-automation-id=&#39;jobJumpToDetailsContainer&#39; class=(&#39;css-14l0ax5&#39;,)&gt;, &lt;Element &#39;div&#39; class=(&#39;css-19kzrtu&#39;,)&gt;]

Why isn't there no ul element in the list? It's clearly there in the inspect window.

=========================================================
Answer
(Because the answer is in the accepted solution comment)

The page had aparently not fully loaded by the time of the assignement of cards and thus the ul was not there yet.

Adding one more second on the renderer sleep did the trick (sleep=2).

session = HTMLSession()
url = (
    &#39;https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite&#39;
    &#39;?locations=91336993fab910af6d6f80c09504c167&#39;
    &#39;&amp;jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78&#39;
    )
page = session.get(url)
page.html.render(sleep=2, keep_page=True, scrolldown=1)
cards = page.html.xpath(&quot;the_xpath_here&quot;)
print(cards)

答案1

得分: 0

你可以尝试使用他们的Ajax API来获取有关职位的Json数据。例如：

import requests
api_url = (
    "https://nvidia.wd5.myworkdayjobs.com/wday/cxs/nvidia/NVIDIAExternalCareerSite/jobs"
)
payload = {
    "appliedFacets": {
        "jobFamilyGroup": ["0c40f6bd1d8f10ae43ffaefd46dc7e78"],
        "locations": ["91336993fab910af6d6f80c09504c167"],
    },
    "limit": 20,
    "offset": 0,
    "searchText": "",
}
data = requests.post(api_url, json=payload).json()
print(data)

打印：

{
    "total": 7,
    "jobPostings": [
        {
            "title": "Senior CPU Compiler Engineer",
            "externalPath": "/job/UK-Remote/Senior-CPU-Compiler-Engineer_JR1954638",
            "locationsText": "7 Locations",
            "postedOn": "Posted 18 Days Ago",
            "bulletFields": ["JR1954638"],
        },
        {
            "title": "CPU Compiler Engineer",
            "externalPath": "/job/UK-Remote/CPU-Compiler-Engineer_JR1954640-1",
            "locationsText": "7 Locations",
            "postedOn": "Posted 26 Days Ago",
            "bulletFields": ["JR1954640"],
        },
...等等。

英文:

You can try to use their Ajax API to get the Json data about the jobs. For example:

import requests
api_url = (
    &quot;https://nvidia.wd5.myworkdayjobs.com/wday/cxs/nvidia/NVIDIAExternalCareerSite/jobs&quot;
)
payload = {
    &quot;appliedFacets&quot;: {
        &quot;jobFamilyGroup&quot;: [&quot;0c40f6bd1d8f10ae43ffaefd46dc7e78&quot;],
        &quot;locations&quot;: [&quot;91336993fab910af6d6f80c09504c167&quot;],
    },
    &quot;limit&quot;: 20,
    &quot;offset&quot;: 0,
    &quot;searchText&quot;: &quot;&quot;,
}
data = requests.post(api_url, json=payload).json()
print(data)

Prints:

{
    &quot;total&quot;: 7,
    &quot;jobPostings&quot;: [
        {
            &quot;title&quot;: &quot;Senior CPU Compiler Engineer&quot;,
            &quot;externalPath&quot;: &quot;/job/UK-Remote/Senior-CPU-Compiler-Engineer_JR1954638&quot;,
            &quot;locationsText&quot;: &quot;7 Locations&quot;,
            &quot;postedOn&quot;: &quot;Posted 18 Days Ago&quot;,
            &quot;bulletFields&quot;: [&quot;JR1954638&quot;],
        },
        {
            &quot;title&quot;: &quot;CPU Compiler Engineer&quot;,
            &quot;externalPath&quot;: &quot;/job/UK-Remote/CPU-Compiler-Engineer_JR1954640-1&quot;,
            &quot;locationsText&quot;: &quot;7 Locations&quot;,
            &quot;postedOn&quot;: &quot;Posted 26 Days Ago&quot;,
            &quot;bulletFields&quot;: [&quot;JR1954640&quot;],
        },
...and so on.

答案2

得分: 0

以下是翻译好的部分：

你说你想要li元素，但你的3个不同的xpath变体指向div或单个li。尝试使用你需要的具体xpath：'//li[@class="css-1q2dra3"]'

英文:

You say you want li elements but your 3 variants of xpath point to div or single li. Try out specific xpath you need '//li[@class="css-1q2dra3"]'

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Requests-html: 搜索元素的XPath返回一个空列表

问题

答案1

答案2

Python sparse matrix C with elements c_ij = sum_j min(a_ij, b_ji) from sparse matrices A and B

如何从一个tkinter “多选” Listbox 获取所有内容？

ModuleNotFoundError: 找不到名为 ‘imagekitio’ 的模块。

使用嵌套循环在Python中输入一个二维数组。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。