无法从href中获取URL。

huangapple go评论120阅读模式
英文:

Can't get an URL from an a href

问题

I want to scrap this website: https://www.sortlist.fr/search

There are lines of websites that can be clicked, and it opens a page for more details of the website.
I want to get that URL, but I can't seem to find it in the <a href

I tried inspecting the element, searching if it was somewhere in a script I couldn't find it.
I tried looping at the network option from the dev tools, also couldn't manage to find it.

Did anyone get any idea?

By the way, I want to use Selenium for this, but there is no login system. So, is it a good idea, or is there a better way?

英文:

I want to scrap this website: https://www.sortlist.fr/search

There are lines of websites that can be clicked, and it opens a page for more details of the website.
I want to get that URL, but I can't seem to find it in the <a href

I tried inspecting the element, searching if it was somewhere in a script I couldn't find it.
I tried looping at the network option from the dev tools, also couldn't manage to find it.

Did anyone get any idea?

By the way, I want to use Selenium for this, but there is no login system. So, is it a good idea, or is there a better way?

答案1

得分: 1

以下是已翻译的内容:

"agences trouvées" 元素在网页上找到,但不包含 "href" 属性:

  1. <a href="" class="h5 bold text-secondary-900 text-truncate mb-8" data-testid="name-cell">Pursuit Digital</a>

因此,您无法立即从主页面提取 "href" 属性。

解决方案

相反,您可以点击并在相邻标签中打开 "agences trouvées",并使用以下WebDriverWait来打印当前URL,使用visibility_of_all_elements_located()来定位元素:

  1. driver.get("https://www.sortlist.fr/search")
  2. parent_window = driver.current_window_handle
  3. elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-testid='name-cell']")))
  4. hrefs = []
  5. for elem in elements:
  6. elem.click()
  7. all_windows = driver.window_handles
  8. new_window = [window for window in all_windows if window != parent_window][0]
  9. driver.switch_to.window(new_window)
  10. print(new_window)
  11. print(driver.current_url)
  12. hrefs.append(driver.current_url)
  13. driver.close()
  14. driver.switch_to.window(parent_window)
  15. print(hrefs)
  16. driver.quit()

控制台输出:

  1. 85F8A3B48F9DF45BEB28D7A530E6979E
  2. https://www.sortlist.fr/agency/pursuit-digital
  3. BA4F926FAD46A5EA5F5FC4406861D20D
  4. https://www.sortlist.fr/agency/rozee-digital
  5. 84E3A361C4202C594893546BEF39CD47
  6. https://www.sortlist.fr/agency/trends-tokyo
  7. FC27FFCB9CBE26CD908B8865B8C5CEA5
  8. https://www.sortlist.fr/agency/cortlex
  9. 64E50C5041A98BECCB17475A80477D60
  10. https://www.sortlist.fr/agency/steinpilz-gmbh
  11. 36FF3D6D3C803BF05EEBB676D58E2DE7
  12. https://www.sortlist.fr/agency/everrank-salesdesk24-gmbh
  13. A13B789C8A618AAD5C372219FC5E3E7E
  14. https://www.sortlist.fr/agency/cc-systems
  15. C39AB3659EE6A627044A2A29CC439AFD
  16. https://www.sortlist.fr/agency/snapp-x
  17. 2979C1A6C0FEF21B3499B2184907F28B
  18. https://www.sortlist.fr/agency/scrumble
  19. 452F8D30237A146724055715E9690288
  20. https://www.sortlist.fr/agency/gaofeng-creative
  21. F05A9B4963C54306ABBB74420481989E
  22. https://www.sortlist.fr/agency/dashdot
  23. FE2B66F925ACCA122B86E597D28B5403
  24. https://www.sortlist.fr/agency/therocketsoft
  25. FBBE3D1535D35C230A5C7496632435DC
  26. https://www.sortlist.fr/agency/run-gun-films
  27. D4C5C162F3C422FB44862563D8AB73DD
  28. https://www.sortlist.fr/agency/studio-unbound
  29. 329DA752A15041450FF5DDAA7850C332
  30. https://www.sortlist.fr/agency/contentgo
  31. B35A03AA6947A1EE043E3EE915E219BE
  32. https://www.sortlist.fr/agency/tabua-digital-unipessoal-ldaa
  33. F77913A1097ACD4DB2B78F4E997B4A0E
  34. https://www.sortlist.fr/agency/yarandin-llc
  35. 7A3C75AFF9ED31E5C5E5915A7E9A84EB
  36. https://www.sortlist.fr/agency/fortis-media
  37. C86FCE23AF84B72CFF793A349C005BDD
  38. https://www.sortlist.fr/agency/osenorth
  39. A266A09B3AEDD65E8A43E26DEAECBF22
  40. https://www.sortlist.fr/agency/apps-square
  41. ['https://www.sortlist.fr/agency/pursuit-digital', 'https://www.sortlist.fr/agency/rozee-digital', 'https://www.sortlist.fr/agency/trends-tokyo', 'https://www.sortlist.fr/agency/cortlex', 'https://www.sortlist.fr/agency/steinpilz-gmbh', 'https://www.sortlist.fr/agency/everrank-salesdesk24-gmbh', 'https://www.sortlist.fr/agency/cc-systems', 'https://www.sortlist.fr/agency/snapp-x', 'https://www.sortlist.fr/agency/scrumble', 'https://www.sortlist.fr/agency/gaofeng-creative', 'https://www.sortlist.fr/agency/dashdot', 'https://www.sortlist.fr/agency/therocketsoft', 'https://www.sortlist.fr/agency/run-gun-films', 'https://www.sortlist.fr/agency/studio-unbound', 'https://www.sortlist.fr/agency/contentgo', 'https://www.sortlist.fr/agency/tabua-digital-unipessoal-ldaa', 'https://www.sortlist.fr/agency/yarandin-llc', 'https://www.sortlist.fr/agency/fortis-media', 'https://www.sortlist.fr/agency/osenorth', 'https://www.sortlist.fr/agency/apps-square']
英文:

The agences trouvées elements found on the webpage doesn't contains the href attribute:

  1. &lt;a href=&quot;&quot; class=&quot;h5 bold text-secondary-900 text-truncate mb-8&quot; data-testid=&quot;name-cell&quot;&gt;Pursuit Digital&lt;/a&gt;

So you won't be able to extract the href attributes from the main page straight away.


Solution

Instead you can click and open the agences trouvées in the adjascent tab and print the current URL inducing WebDriverWait for visibility_of_all_elements_located() using the following locator strategy:

  • Code Block:

    1. driver.get(&quot;https://www.sortlist.fr/search&quot;)
    2. parent_window = driver.current_window_handle
    3. elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, &quot;a[data-testid=&#39;name-cell&#39;]&quot;)))
    4. hrefs = []
    5. for elem in elements:
    6. elem.click()
    7. all_windows = driver.window_handles
    8. new_window = [window for window in all_windows if window != parent_window][0]
    9. driver.switch_to.window(new_window)
    10. print(new_window)
    11. print(driver.current_url)
    12. hrefs.append(driver.current_url)
    13. driver.close()
    14. driver.switch_to.window(parent_window)
    15. print(hrefs)
    16. driver.quit()
  • Console Output:

    1. 85F8A3B48F9DF45BEB28D7A530E6979E
    2. https://www.sortlist.fr/agency/pursuit-digital
    3. BA4F926FAD46A5EA5F5FC4406861D20D
    4. https://www.sortlist.fr/agency/rozee-digital
    5. 84E3A361C4202C594893546BEF39CD47
    6. https://www.sortlist.fr/agency/trends-tokyo
    7. FC27FFCB9CBE26CD908B8865B8C5CEA5
    8. https://www.sortlist.fr/agency/cortlex
    9. 64E50C5041A98BECCB17475A80477D60
    10. https://www.sortlist.fr/agency/steinpilz-gmbh
    11. 36FF3D6D3C803BF05EEBB676D58E2DE7
    12. https://www.sortlist.fr/agency/everrank-salesdesk24-gmbh
    13. A13B789C8A618AAD5C372219FC5E3E7E
    14. https://www.sortlist.fr/agency/cc-systems
    15. C39AB3659EE6A627044A2A29CC439AFD
    16. https://www.sortlist.fr/agency/snapp-x
    17. 2979C1A6C0FEF21B3499B2184907F28B
    18. https://www.sortlist.fr/agency/scrumble
    19. 452F8D30237A146724055715E9690288
    20. https://www.sortlist.fr/agency/gaofeng-creative
    21. F05A9B4963C54306ABBB74420481989E
    22. https://www.sortlist.fr/agency/dashdot
    23. FE2B66F925ACCA122B86E597D28B5403
    24. https://www.sortlist.fr/agency/therocketsoft
    25. FBBE3D1535D35C230A5C7496632435DC
    26. https://www.sortlist.fr/agency/run-gun-films
    27. D4C5C162F3C422FB44862563D8AB73DD
    28. https://www.sortlist.fr/agency/studio-unbound
    29. 329DA752A15041450FF5DDAA7850C332
    30. https://www.sortlist.fr/agency/contentgo
    31. B35A03AA6947A1EE043E3EE915E219BE
    32. https://www.sortlist.fr/agency/tabua-digital-unipessoal-ldaa
    33. F77913A1097ACD4DB2B78F4E997B4A0E
    34. https://www.sortlist.fr/agency/yarandin-llc
    35. 7A3C75AFF9ED31E5C5E5915A7E9A84EB
    36. https://www.sortlist.fr/agency/fortis-media
    37. C86FCE23AF84B72CFF793A349C005BDD
    38. https://www.sortlist.fr/agency/osenorth
    39. A266A09B3AEDD65E8A43E26DEAECBF22
    40. https://www.sortlist.fr/agency/apps-square
    41. [&#39;https://www.sortlist.fr/agency/pursuit-digital&#39;, &#39;https://www.sortlist.fr/agency/rozee-digital&#39;, &#39;https://www.sortlist.fr/agency/trends-tokyo&#39;, &#39;https://www.sortlist.fr/agency/cortlex&#39;, &#39;https://www.sortlist.fr/agency/steinpilz-gmbh&#39;, &#39;https://www.sortlist.fr/agency/everrank-salesdesk24-gmbh&#39;, &#39;https://www.sortlist.fr/agency/cc-systems&#39;, &#39;https://www.sortlist.fr/agency/snapp-x&#39;, &#39;https://www.sortlist.fr/agency/scrumble&#39;, &#39;https://www.sortlist.fr/agency/gaofeng-creative&#39;, &#39;https://www.sortlist.fr/agency/dashdot&#39;, &#39;https://www.sortlist.fr/agency/therocketsoft&#39;, &#39;https://www.sortlist.fr/agency/run-gun-films&#39;, &#39;https://www.sortlist.fr/agency/studio-unbound&#39;, &#39;https://www.sortlist.fr/agency/contentgo&#39;, &#39;https://www.sortlist.fr/agency/tabua-digital-unipessoal-ldaa&#39;, &#39;https://www.sortlist.fr/agency/yarandin-llc&#39;, &#39;https://www.sortlist.fr/agency/fortis-media&#39;, &#39;https://www.sortlist.fr/agency/osenorth&#39;, &#39;https://www.sortlist.fr/agency/apps-square&#39;]

huangapple
  • 本文由 发表于 2023年6月26日 15:42:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76554541.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定