英文:
Requests-html: searching for the xpath of an element returns an empty list
问题
I'm trying to scrape data from this website: myworkdayjobs link
The data I want to collect are the job advertisemnts and their respective data.
Currently there are 7 jobs active.
On the inspect page I can see the 7 wanted elements all having the same:
li class="css-1q2dra3"
But the page.html.xpath() always returns me an empty list.
The steps I've taken are:
session = HTMLSession()
url = (
'https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite'
'?locations=91336993fab910af6d6f80c09504c167'
'&jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78'
)
page = session.get(url)
page.html.render(sleep=1, keep_page=True, scrolldown=1)
cards = page.html.xpath("the_xpath_here")
print(cards)
I've also tried multiple other xpaths including the xpath I get when I copy the xpath path from rightclicking the element on the inspect page:
//*[@id="mainContent"]/div/div[2]/section/ul/li[1]/div[1]
/html/body/div/div/div/div[3]/div/div/div[2]/section/ul/li[1]/div[1]
//*[@id="mainContent"]/div/div[2]/section/ul/li[1]
Now, the only time I get results for a li element is when I
cards = page.html.xpath('//li')
Which returns me the li elements at the bottom of the page. But it fully ignores the elements that I want...
I'm not an expert on webscraping or so, but I have made it work with an other careers page before. What am I missing? Why can't I access those elements?
=========================================================
Additional information:
The problem that I experience seems to happen after the section element.
When I
cards = page.html.xpath('//*[@id="mainContent"]/div/div[2]/section/*')
print(cards)
I get:
[<Element 'p' data-automation-id='jobFoundText' class=('css-12psxof',)>, <Element 'div' data-automation-id='jobJumpToDetailsContainer' class=('css-14l0ax5',)>, <Element 'div' class=('css-19kzrtu',)>]
Why isn't there no ul element in the list? It's clearly there in the inspect window.
=========================================================
Answer
(Because the answer is in the accepted solution comment)
The page had apparently not fully loaded by the time of the assignment of cards, and thus the ul element was not there yet.
Adding one more second to the renderer sleep did the trick (sleep=2).
session = HTMLSession()
url = (
'https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite'
'?locations=91336993fab910af6d6f80c09504c167'
'&jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78'
)
page = session.get(url)
page.html.render(sleep=2, keep_page=True, scrolldown=1)
cards = page.html.xpath("the_xpath_here")
print(cards)
英文:
I'm trying to scrape data from this website: myworkdayjobs link
The data I want to collect are the job advertisemnts and their respective data.
Currently there are 7 jobs active.
On the inspect page I can see the 7 wanted elements all having the same:
li class="css-1q2dra3"
But the page.html.xpath() always returns me an empty list.
The steps I've taken are:
session = HTMLSession()
url = (
'https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite'
'?locations=91336993fab910af6d6f80c09504c167'
'&jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78'
)
page = session.get(url)
page.html.render(sleep=1, keep_page=True, scrolldown=1)
cards = page.html.xpath("the_xpath_here")
print(cards)
I've also tried multiple other xpaths including the xpath I get when I copy the xpath path from rightclicking the element on the inspect page:
//*[@id="mainContent"]/div/div[2]/section/ul/li[1]/div[1]
/html/body/div/div/div/div[3]/div/div/div[2]/section/ul/li[1]/div[1]
//*[@id="mainContent"]/div/div[2]/section/ul/li[1]
Now, the only time I get results for a li element is when I
cards = page.html.xpath('//li')
Which returns me the li elements at the bottom of the page. But it fully ignores the elements that I want...
I'm not an expert on webscraping or so, but I have made it work with an other careers page before. What am I missing? Why can't I access those elements?
=========================================================
Additional information:
The problem that I experience seems to happen after the section element.
When I
cards = page.html.xpath('//*[@id="mainContent"]/div/div[2]/section/*')
print(cards)
[<Element 'p' data-automation-id='jobFoundText' class=('css-12psxof',)>, <Element 'div' data-automation-id='jobJumpToDetailsContainer' class=('css-14l0ax5',)>, <Element 'div' class=('css-19kzrtu',)>]
Why isn't there no ul element in the list? It's clearly there in the inspect window.
=========================================================
Answer
(Because the answer is in the accepted solution comment)
The page had aparently not fully loaded by the time of the assignement of cards and thus the ul was not there yet.
Adding one more second on the renderer sleep did the trick (sleep=2).
session = HTMLSession()
url = (
'https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite'
'?locations=91336993fab910af6d6f80c09504c167'
'&jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78'
)
page = session.get(url)
page.html.render(sleep=2, keep_page=True, scrolldown=1)
cards = page.html.xpath("the_xpath_here")
print(cards)
答案1
得分: 0
你可以尝试使用他们的Ajax API来获取有关职位的Json数据。例如:
import requests
api_url = (
"https://nvidia.wd5.myworkdayjobs.com/wday/cxs/nvidia/NVIDIAExternalCareerSite/jobs"
)
payload = {
"appliedFacets": {
"jobFamilyGroup": ["0c40f6bd1d8f10ae43ffaefd46dc7e78"],
"locations": ["91336993fab910af6d6f80c09504c167"],
},
"limit": 20,
"offset": 0,
"searchText": "",
}
data = requests.post(api_url, json=payload).json()
print(data)
打印:
{
"total": 7,
"jobPostings": [
{
"title": "Senior CPU Compiler Engineer",
"externalPath": "/job/UK-Remote/Senior-CPU-Compiler-Engineer_JR1954638",
"locationsText": "7 Locations",
"postedOn": "Posted 18 Days Ago",
"bulletFields": ["JR1954638"],
},
{
"title": "CPU Compiler Engineer",
"externalPath": "/job/UK-Remote/CPU-Compiler-Engineer_JR1954640-1",
"locationsText": "7 Locations",
"postedOn": "Posted 26 Days Ago",
"bulletFields": ["JR1954640"],
},
...等等。
英文:
You can try to use their Ajax API to get the Json data about the jobs. For example:
import requests
api_url = (
"https://nvidia.wd5.myworkdayjobs.com/wday/cxs/nvidia/NVIDIAExternalCareerSite/jobs"
)
payload = {
"appliedFacets": {
"jobFamilyGroup": ["0c40f6bd1d8f10ae43ffaefd46dc7e78"],
"locations": ["91336993fab910af6d6f80c09504c167"],
},
"limit": 20,
"offset": 0,
"searchText": "",
}
data = requests.post(api_url, json=payload).json()
print(data)
Prints:
{
"total": 7,
"jobPostings": [
{
"title": "Senior CPU Compiler Engineer",
"externalPath": "/job/UK-Remote/Senior-CPU-Compiler-Engineer_JR1954638",
"locationsText": "7 Locations",
"postedOn": "Posted 18 Days Ago",
"bulletFields": ["JR1954638"],
},
{
"title": "CPU Compiler Engineer",
"externalPath": "/job/UK-Remote/CPU-Compiler-Engineer_JR1954640-1",
"locationsText": "7 Locations",
"postedOn": "Posted 26 Days Ago",
"bulletFields": ["JR1954640"],
},
...and so on.
答案2
得分: 0
以下是翻译好的部分:
你说你想要li元素,但你的3个不同的xpath变体指向div或单个li。尝试使用你需要的具体xpath:'//li[@class="css-1q2dra3"]'
英文:
You say you want li elements but your 3 variants of xpath point to div or single li. Try out specific xpath you need '//li[@class="css-1q2dra3"]'
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论