关于缺失的分页元素,需要一些爬取指导。

huangapple go评论95阅读模式
英文:

Need some scraping guidance regarding missing pagination elements

问题

Noob网络爬虫在此。我使用ScrapyPlaywright构建了一个爬虫,用于从autotrader.com上的参数化搜索URL的结果中爬取汽车广告,目前它可以很好地获取第一页的数据。我现在尝试扩展它以处理爬取其余的页面。我已经确定了第一页底部的分页HTML元素,并验证了我通过DevTools选择它的正确xpath,但当我运行我的爬虫时,response.text不包含该HTML元素或其任何子元素。它包含所有其他HTML元素,但不包括这些...

由于我使用Playwright,对于通过Javascript进行动态插入的任何担忧应该是最小的。我还在相关的分页元素上添加了一个带有60秒超时的“wait_for_selector”方法,但我的脚本最终超时。我还使用“wait_until”和“networkidle”来确保完整页面加载完成后再进行爬取。

对于发生的情况感到有些困惑。我正在使用的start_url是:这里。我会感激您可能会提供的任何反馈。

英文:

Noob web scraper here. I built a spider using Scrapy and Playwright to scrape auto ads for the results of a parameterized search URL on autotrader.com and it's working great to grab data from the first page. I'm now trying to augment it to handle scraping the rest of the pages. I've identified the HTML element for the pagination at the bottom of the first page and have validated that I have the correct xpath to select this via DevTools, yet when I run my spider, response.text doesn't contain that HTML element or any of its child elements. It contains all other HTML elements, just not those...

Since I'm using Playwright, any concerns about dynamic insertion via Javascript should be minimal. I also added in a "wait_for_selector" method on the pagination element in question with a 60 second timeout and my script just ends up timing out. I'm also using "wait_until" with "networkidle" to ensure the full page has loaded before scraping.

Kinda puzzled what is going on here. The start_url I am using is: here . I would appreciate any feedback y'all might have.

答案1

得分: 1

这是您必须使用的XPath,以在页面之间进行导航,您必须将其引用到href中,就是这样,希望对您有帮助。

//*[@aria-label="Next Page"]
英文:

this is the xpath that you must use to move from page to page, you must reference it to the href and that's it, I hope it works for you.

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-html -->

//*[@aria-label=&quot;Next Page&quot;]

<!-- end snippet -->

huangapple
  • 本文由 发表于 2023年5月28日 19:50:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76351334.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定