Requests-html: 搜索元素的XPath返回一个空列表

huangapple go评论85阅读模式
英文:

Requests-html: searching for the xpath of an element returns an empty list

问题

I'm trying to scrape data from this website: myworkdayjobs link

The data I want to collect are the job advertisemnts and their respective data.
Currently there are 7 jobs active.

On the inspect page I can see the 7 wanted elements all having the same:
li class="css-1q2dra3"

But the page.html.xpath() always returns me an empty list.

The steps I've taken are:

  1. session = HTMLSession()
  2. url = (
  3. 'https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite'
  4. '?locations=91336993fab910af6d6f80c09504c167'
  5. '&jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78'
  6. )
  7. page = session.get(url)
  8. page.html.render(sleep=1, keep_page=True, scrolldown=1)
  9. cards = page.html.xpath("the_xpath_here")
  10. print(cards)

I've also tried multiple other xpaths including the xpath I get when I copy the xpath path from rightclicking the element on the inspect page:

  1. //*[@id="mainContent"]/div/div[2]/section/ul/li[1]/div[1]
  2. /html/body/div/div/div/div[3]/div/div/div[2]/section/ul/li[1]/div[1]
  3. //*[@id="mainContent"]/div/div[2]/section/ul/li[1]

Now, the only time I get results for a li element is when I

  1. cards = page.html.xpath('//li')

Which returns me the li elements at the bottom of the page. But it fully ignores the elements that I want...

I'm not an expert on webscraping or so, but I have made it work with an other careers page before. What am I missing? Why can't I access those elements?

=========================================================

Additional information:
The problem that I experience seems to happen after the section element.

When I

  1. cards = page.html.xpath('//*[@id="mainContent"]/div/div[2]/section/*')
  2. print(cards)

I get:

  1. [<Element 'p' data-automation-id='jobFoundText' class=('css-12psxof',)>, <Element 'div' data-automation-id='jobJumpToDetailsContainer' class=('css-14l0ax5',)>, <Element 'div' class=('css-19kzrtu',)>]

Why isn't there no ul element in the list? It's clearly there in the inspect window.

=========================================================

Answer
(Because the answer is in the accepted solution comment)

The page had apparently not fully loaded by the time of the assignment of cards, and thus the ul element was not there yet.

Adding one more second to the renderer sleep did the trick (sleep=2).

  1. session = HTMLSession()
  2. url = (
  3. 'https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite'
  4. '?locations=91336993fab910af6d6f80c09504c167'
  5. '&jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78'
  6. )
  7. page = session.get(url)
  8. page.html.render(sleep=2, keep_page=True, scrolldown=1)
  9. cards = page.html.xpath("the_xpath_here")
  10. print(cards)
英文:

I'm trying to scrape data from this website: myworkdayjobs link

The data I want to collect are the job advertisemnts and their respective data.
Currently there are 7 jobs active.

On the inspect page I can see the 7 wanted elements all having the same:
li class="css-1q2dra3"

But the page.html.xpath() always returns me an empty list.

The steps I've taken are:

  1. session = HTMLSession()
  2. url = (
  3. &#39;https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite&#39;
  4. &#39;?locations=91336993fab910af6d6f80c09504c167&#39;
  5. &#39;&amp;jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78&#39;
  6. )
  7. page = session.get(url)
  8. page.html.render(sleep=1, keep_page=True, scrolldown=1)
  9. cards = page.html.xpath(&quot;the_xpath_here&quot;)
  10. print(cards)

I've also tried multiple other xpaths including the xpath I get when I copy the xpath path from rightclicking the element on the inspect page:

  1. //*[@id=&quot;mainContent&quot;]/div/div[2]/section/ul/li[1]/div[1]
  2. /html/body/div/div/div/div[3]/div/div/div[2]/section/ul/li[1]/div[1]
  3. //*[@id=&quot;mainContent&quot;]/div/div[2]/section/ul/li[1]

Now, the only time I get results for a li element is when I

  1. cards = page.html.xpath(&#39;//li&#39;)

Which returns me the li elements at the bottom of the page. But it fully ignores the elements that I want...

I'm not an expert on webscraping or so, but I have made it work with an other careers page before. What am I missing? Why can't I access those elements?

=========================================================
Additional information:
The problem that I experience seems to happen after the section element.

When I

  1. cards = page.html.xpath(&#39;//*[@id=&quot;mainContent&quot;]/div/div[2]/section/*&#39;)
  2. print(cards)
  3. [&lt;Element &#39;p&#39; data-automation-id=&#39;jobFoundText&#39; class=(&#39;css-12psxof&#39;,)&gt;, &lt;Element &#39;div&#39; data-automation-id=&#39;jobJumpToDetailsContainer&#39; class=(&#39;css-14l0ax5&#39;,)&gt;, &lt;Element &#39;div&#39; class=(&#39;css-19kzrtu&#39;,)&gt;]

Why isn't there no ul element in the list? It's clearly there in the inspect window.

=========================================================
Answer
(Because the answer is in the accepted solution comment)

The page had aparently not fully loaded by the time of the assignement of cards and thus the ul was not there yet.

Adding one more second on the renderer sleep did the trick (sleep=2).

  1. session = HTMLSession()
  2. url = (
  3. &#39;https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite&#39;
  4. &#39;?locations=91336993fab910af6d6f80c09504c167&#39;
  5. &#39;&amp;jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78&#39;
  6. )
  7. page = session.get(url)
  8. page.html.render(sleep=2, keep_page=True, scrolldown=1)
  9. cards = page.html.xpath(&quot;the_xpath_here&quot;)
  10. print(cards)

答案1

得分: 0

你可以尝试使用他们的Ajax API来获取有关职位的Json数据。例如:

  1. import requests
  2. api_url = (
  3. "https://nvidia.wd5.myworkdayjobs.com/wday/cxs/nvidia/NVIDIAExternalCareerSite/jobs"
  4. )
  5. payload = {
  6. "appliedFacets": {
  7. "jobFamilyGroup": ["0c40f6bd1d8f10ae43ffaefd46dc7e78"],
  8. "locations": ["91336993fab910af6d6f80c09504c167"],
  9. },
  10. "limit": 20,
  11. "offset": 0,
  12. "searchText": "",
  13. }
  14. data = requests.post(api_url, json=payload).json()
  15. print(data)

打印:

  1. {
  2. "total": 7,
  3. "jobPostings": [
  4. {
  5. "title": "Senior CPU Compiler Engineer",
  6. "externalPath": "/job/UK-Remote/Senior-CPU-Compiler-Engineer_JR1954638",
  7. "locationsText": "7 Locations",
  8. "postedOn": "Posted 18 Days Ago",
  9. "bulletFields": ["JR1954638"],
  10. },
  11. {
  12. "title": "CPU Compiler Engineer",
  13. "externalPath": "/job/UK-Remote/CPU-Compiler-Engineer_JR1954640-1",
  14. "locationsText": "7 Locations",
  15. "postedOn": "Posted 26 Days Ago",
  16. "bulletFields": ["JR1954640"],
  17. },
  18. ...等等。
英文:

You can try to use their Ajax API to get the Json data about the jobs. For example:

  1. import requests
  2. api_url = (
  3. &quot;https://nvidia.wd5.myworkdayjobs.com/wday/cxs/nvidia/NVIDIAExternalCareerSite/jobs&quot;
  4. )
  5. payload = {
  6. &quot;appliedFacets&quot;: {
  7. &quot;jobFamilyGroup&quot;: [&quot;0c40f6bd1d8f10ae43ffaefd46dc7e78&quot;],
  8. &quot;locations&quot;: [&quot;91336993fab910af6d6f80c09504c167&quot;],
  9. },
  10. &quot;limit&quot;: 20,
  11. &quot;offset&quot;: 0,
  12. &quot;searchText&quot;: &quot;&quot;,
  13. }
  14. data = requests.post(api_url, json=payload).json()
  15. print(data)

Prints:

  1. {
  2. &quot;total&quot;: 7,
  3. &quot;jobPostings&quot;: [
  4. {
  5. &quot;title&quot;: &quot;Senior CPU Compiler Engineer&quot;,
  6. &quot;externalPath&quot;: &quot;/job/UK-Remote/Senior-CPU-Compiler-Engineer_JR1954638&quot;,
  7. &quot;locationsText&quot;: &quot;7 Locations&quot;,
  8. &quot;postedOn&quot;: &quot;Posted 18 Days Ago&quot;,
  9. &quot;bulletFields&quot;: [&quot;JR1954638&quot;],
  10. },
  11. {
  12. &quot;title&quot;: &quot;CPU Compiler Engineer&quot;,
  13. &quot;externalPath&quot;: &quot;/job/UK-Remote/CPU-Compiler-Engineer_JR1954640-1&quot;,
  14. &quot;locationsText&quot;: &quot;7 Locations&quot;,
  15. &quot;postedOn&quot;: &quot;Posted 26 Days Ago&quot;,
  16. &quot;bulletFields&quot;: [&quot;JR1954640&quot;],
  17. },
  18. ...and so on.

答案2

得分: 0

以下是翻译好的部分:

你说你想要li元素,但你的3个不同的xpath变体指向div或单个li。尝试使用你需要的具体xpath:'//li[@class="css-1q2dra3"]'

英文:

You say you want li elements but your 3 variants of xpath point to div or single li. Try out specific xpath you need &#39;//li[@class=&quot;css-1q2dra3&quot;]&#39;

huangapple
  • 本文由 发表于 2023年1月8日 23:56:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/75049214.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定