如何使用Python Selenium提取SVG视图框内的数据

huangapple go评论140阅读模式
英文:

How to extract data inside svg viewbox using python selenium

问题

我正在使用此URL https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html,页面上有一个svg视图框,当我尝试点击它时,会显示工作时间。我需要使用Python的Selenium来提取这些信息。是否可以请帮助我?我对网页抓取不太熟悉。

英文:

I am using this URL https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html, where there is a svg view box when I try to click it shows working hours. I need to extract those using python selenium. Could anyone please help? I am new to web scraping.
如何使用Python Selenium提取SVG视图框内的数据

答案1

得分: 1

关于营业时间的数据以Json形式存储在HTML页面中,因此要获取营业时间,您可以使用以下示例:

  1. import re
  2. import json
  3. import requests
  4. headers = {
  5. "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0"
  6. }
  7. url = "https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html"
  8. html_text = requests.get(url, headers=headers).text
  9. data = re.search(r"window\.__WEB_CONTEXT__=(\{.*\});\(", html_text).group(1)
  10. data = data.replace("pageManifest", '"pageManifest"')
  11. data = json.loads(data)
  12. data = data["pageManifest"]["redux"]["api"]["responses"]
  13. for k, v in data.items():
  14. if "/hours" in k:
  15. print(v)
  16. break

打印结果:

  1. {
  2. "data": {
  3. "openStatus": "CLOSED",
  4. "openStatusText": "Closed Now",
  5. "hoursTodayText": "Hours Today: 4:00 pm - 11:59 pm",
  6. "currentHoursText": "",
  7. "allOpenHours": [
  8. {"days": "Tue - Fri", "times": ["4:00 pm - 11:59 pm"]},
  9. {"days": "Sat - Sun", "times": ["11:00 am - 11:59 pm"]},
  10. ],
  11. "addHoursLink": {
  12. "url": "/UpdateListing-d7222445#Hours-only",
  13. "text": "+ Add hours",
  14. },
  15. },
  16. "error": None,
  17. }
英文:

The data about opening hours is stored inside the HTML page in Json form, so to get the opening hours, you can use this example:

  1. import re
  2. import json
  3. import requests
  4. headers = {
  5. "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0"
  6. }
  7. url = "https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html"
  8. html_text = requests.get(url, headers=headers).text
  9. data = re.search(r"window\.__WEB_CONTEXT__=(\{.*\});\(", html_text).group(1)
  10. data = data.replace("pageManifest", '"pageManifest"')
  11. data = json.loads(data)
  12. data = data["pageManifest"]["redux"]["api"]["responses"]
  13. for k, v in data.items():
  14. if "/hours" in k:
  15. print(v)
  16. break

Prints:

  1. {
  2. "data": {
  3. "openStatus": "CLOSED",
  4. "openStatusText": "Closed Now",
  5. "hoursTodayText": "Hours Today: 4:00 pm - 11:59 pm",
  6. "currentHoursText": "",
  7. "allOpenHours": [
  8. {"days": "Tue - Fri", "times": ["4:00 pm - 11:59 pm"]},
  9. {"days": "Sat - Sun", "times": ["11:00 am - 11:59 pm"]},
  10. ],
  11. "addHoursLink": {
  12. "url": "/UpdateListing-d7222445#Hours-only",
  13. "text": "+ Add hours",
  14. },
  15. },
  16. "error": None,
  17. }

答案2

得分: 0

点击SVG元素需要使用WebDriverWait等待element_to_be_clickable(),您可以使用以下定位策略

  • 代码块:

    1. driver.get("https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html")
    2. WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span[data-automation='top-info-hours'] > div svg[width='18px'] path:nth-child(2)"))).click()
    3. print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Hours']//following::div[1]"))).text)
  • 控制台输出:

    1. 星期日
    2. 上午11:00 - 下午11:59
    3. 星期二
    4. 下午4:00 - 下午11:59
    5. 星期三
    6. 下午4:00 - 下午11:59
    7. 星期四
    8. 下午4:00 - 下午11:59
    9. 星期五
    10. 下午4:00 - 下午11:59
    11. 星期六
    12. 上午11:00 - 下午11:59
  • 注意: 您需要添加以下导入语句:

    1. from selenium.webdriver.support.ui import WebDriverWait
    2. from selenium.webdriver.common.by import By
    3. from selenium.webdriver.support import expected_conditions as EC
英文:

To click on the SVG element you need to induce WebDriverWait for the element_to_be_clickable() and you can use the following locator strategies:

  • Code block:

    1. driver.get("https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html")
    2. WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span[data-automation='top-info-hours'] > div svg[width='18px'] path:nth-child(2)"))).click()
    3. print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Hours']//following::div[1]"))).text)
  • Console output:

    1. Sun
    2. 11:00 AM - 11:59 PM
    3. Tue
    4. 4:00 PM - 11:59 PM
    5. Wed
    6. 4:00 PM - 11:59 PM
    7. Thu
    8. 4:00 PM - 11:59 PM
    9. Fri
    10. 4:00 PM - 11:59 PM
    11. Sat
    12. 11:00 AM - 11:59 PM
  • Note: You have to add the following imports :

    1. from selenium.webdriver.support.ui import WebDriverWait
    2. from selenium.webdriver.common.by import By
    3. from selenium.webdriver.support import expected_conditions as EC

huangapple
  • 本文由 发表于 2023年7月28日 00:55:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76781940.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定