英文:
How to extract data inside svg viewbox using python selenium
问题
我正在使用此URL https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html,页面上有一个svg
视图框,当我尝试点击它时,会显示工作时间。我需要使用Python的Selenium来提取这些信息。是否可以请帮助我?我对网页抓取不太熟悉。
英文:
I am using this URL https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html, where there is a svg
view box when I try to click it shows working hours. I need to extract those using python selenium. Could anyone please help? I am new to web scraping.
答案1
得分: 1
关于营业时间的数据以Json形式存储在HTML页面中,因此要获取营业时间,您可以使用以下示例:
import re
import json
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0"
}
url = "https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html"
html_text = requests.get(url, headers=headers).text
data = re.search(r"window\.__WEB_CONTEXT__=(\{.*\});\(", html_text).group(1)
data = data.replace("pageManifest", '"pageManifest"')
data = json.loads(data)
data = data["pageManifest"]["redux"]["api"]["responses"]
for k, v in data.items():
if "/hours" in k:
print(v)
break
打印结果:
{
"data": {
"openStatus": "CLOSED",
"openStatusText": "Closed Now",
"hoursTodayText": "Hours Today: 4:00 pm - 11:59 pm",
"currentHoursText": "",
"allOpenHours": [
{"days": "Tue - Fri", "times": ["4:00 pm - 11:59 pm"]},
{"days": "Sat - Sun", "times": ["11:00 am - 11:59 pm"]},
],
"addHoursLink": {
"url": "/UpdateListing-d7222445#Hours-only",
"text": "+ Add hours",
},
},
"error": None,
}
英文:
The data about opening hours is stored inside the HTML page in Json form, so to get the opening hours, you can use this example:
import re
import json
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0"
}
url = "https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html"
html_text = requests.get(url, headers=headers).text
data = re.search(r"window\.__WEB_CONTEXT__=(\{.*\});\(", html_text).group(1)
data = data.replace("pageManifest", '"pageManifest"')
data = json.loads(data)
data = data["pageManifest"]["redux"]["api"]["responses"]
for k, v in data.items():
if "/hours" in k:
print(v)
break
Prints:
{
"data": {
"openStatus": "CLOSED",
"openStatusText": "Closed Now",
"hoursTodayText": "Hours Today: 4:00 pm - 11:59 pm",
"currentHoursText": "",
"allOpenHours": [
{"days": "Tue - Fri", "times": ["4:00 pm - 11:59 pm"]},
{"days": "Sat - Sun", "times": ["11:00 am - 11:59 pm"]},
],
"addHoursLink": {
"url": "/UpdateListing-d7222445#Hours-only",
"text": "+ Add hours",
},
},
"error": None,
}
答案2
得分: 0
点击SVG元素需要使用WebDriverWait等待element_to_be_clickable(),您可以使用以下定位策略:
-
代码块:
driver.get("https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html") WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span[data-automation='top-info-hours'] > div svg[width='18px'] path:nth-child(2)"))).click() print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Hours']//following::div[1]"))).text)
-
控制台输出:
星期日 上午11:00 - 下午11:59 星期二 下午4:00 - 下午11:59 星期三 下午4:00 - 下午11:59 星期四 下午4:00 - 下午11:59 星期五 下午4:00 - 下午11:59 星期六 上午11:00 - 下午11:59
-
注意: 您需要添加以下导入语句:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
英文:
To click on the SVG element you need to induce WebDriverWait for the element_to_be_clickable() and you can use the following locator strategies:
-
Code block:
driver.get("https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html") WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span[data-automation='top-info-hours'] > div svg[width='18px'] path:nth-child(2)"))).click() print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Hours']//following::div[1]"))).text)
-
Console output:
Sun 11:00 AM - 11:59 PM Tue 4:00 PM - 11:59 PM Wed 4:00 PM - 11:59 PM Thu 4:00 PM - 11:59 PM Fri 4:00 PM - 11:59 PM Sat 11:00 AM - 11:59 PM
-
Note: You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论