2023年7月28日 00:55:06go评论140阅读模式

英文:

How to extract data inside svg viewbox using python selenium

问题

我正在使用此URL https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html，页面上有一个svg视图框，当我尝试点击它时，会显示工作时间。我需要使用Python的Selenium来提取这些信息。是否可以请帮助我？我对网页抓取不太熟悉。

英文:

I am using this URL https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html, where there is a svg view box when I try to click it shows working hours. I need to extract those using python selenium. Could anyone please help? I am new to web scraping.

答案1

得分: 1

关于营业时间的数据以Json形式存储在HTML页面中，因此要获取营业时间，您可以使用以下示例：

import re
import json
import requests
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0"
}
url = "https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html"
html_text = requests.get(url, headers=headers).text
data = re.search(r"window\.__WEB_CONTEXT__=(\{.*\});\(", html_text).group(1)
data = data.replace("pageManifest", '"pageManifest"')
data = json.loads(data)
data = data["pageManifest"]["redux"]["api"]["responses"]
for k, v in data.items():
    if "/hours" in k:
        print(v)
        break

打印结果：

{
    "data": {
        "openStatus": "CLOSED",
        "openStatusText": "Closed Now",
        "hoursTodayText": "Hours Today: 4:00 pm - 11:59 pm",
        "currentHoursText": "",
        "allOpenHours": [
            {"days": "Tue - Fri", "times": ["4:00 pm - 11:59 pm"]},
            {"days": "Sat - Sun", "times": ["11:00 am - 11:59 pm"]},
        ],
        "addHoursLink": {
            "url": "/UpdateListing-d7222445#Hours-only",
            "text": "+ Add hours",
        },
    },
    "error": None,
}

英文:

The data about opening hours is stored inside the HTML page in Json form, so to get the opening hours, you can use this example:

import re
import json
import requests
headers = {
    &quot;User-Agent&quot;: &quot;Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0&quot;
}
url = &quot;https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html&quot;
html_text = requests.get(url, headers=headers).text
data = re.search(r&quot;window\.__WEB_CONTEXT__=(\{.*\});\(&quot;, html_text).group(1)
data = data.replace(&quot;pageManifest&quot;, &#39;&quot;pageManifest&quot;&#39;)
data = json.loads(data)
data = data[&quot;pageManifest&quot;][&quot;redux&quot;][&quot;api&quot;][&quot;responses&quot;]
for k, v in data.items():
    if &quot;/hours&quot; in k:
        print(v)
        break

Prints:

{
    &quot;data&quot;: {
        &quot;openStatus&quot;: &quot;CLOSED&quot;,
        &quot;openStatusText&quot;: &quot;Closed Now&quot;,
        &quot;hoursTodayText&quot;: &quot;Hours Today: 4:00 pm - 11:59 pm&quot;,
        &quot;currentHoursText&quot;: &quot;&quot;,
        &quot;allOpenHours&quot;: [
            {&quot;days&quot;: &quot;Tue - Fri&quot;, &quot;times&quot;: [&quot;4:00 pm - 11:59 pm&quot;]},
            {&quot;days&quot;: &quot;Sat - Sun&quot;, &quot;times&quot;: [&quot;11:00 am - 11:59 pm&quot;]},
        ],
        &quot;addHoursLink&quot;: {
            &quot;url&quot;: &quot;/UpdateListing-d7222445#Hours-only&quot;,
            &quot;text&quot;: &quot;+ Add hours&quot;,
        },
    },
    &quot;error&quot;: None,
}

答案2

得分: 0

点击SVG元素需要使用WebDriverWait等待element_to_be_clickable()，您可以使用以下定位策略：

代码块：

driver.get("https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html")
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span[data-automation='top-info-hours'] > div svg[width='18px'] path:nth-child(2)"))).click()
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Hours']//following::div[1]"))).text)

控制台输出：

星期日
上午11:00 - 下午11:59
星期二
下午4:00 - 下午11:59
星期三
下午4:00 - 下午11:59
星期四
下午4:00 - 下午11:59
星期五
下午4:00 - 下午11:59
星期六
上午11:00 - 下午11:59

注意: 您需要添加以下导入语句：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

英文:

To click on the SVG element you need to induce WebDriverWait for the element_to_be_clickable() and you can use the following locator strategies:

Code block:

driver.get(&quot;https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html&quot;)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, &quot;span[data-automation=&#39;top-info-hours&#39;] &gt; div svg[width=&#39;18px&#39;] path:nth-child(2)&quot;))).click()
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, &quot;//span[text()=&#39;Hours&#39;]//following::div[1]&quot;))).text)

Console output:

Sun
11:00 AM - 11:59 PM
Tue
4:00 PM - 11:59 PM
Wed
4:00 PM - 11:59 PM
Thu
4:00 PM - 11:59 PM
Fri
4:00 PM - 11:59 PM
Sat
11:00 AM - 11:59 PM

Note: You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Python Selenium提取SVG视图框内的数据

问题

答案1

答案2

Python多进程比串行慢2倍，不管chunksize如何？

Python and Reddit APIs: my code doesn't give back all results from the huge reddit database. Why?

snakemake 选择要运行的规则。

Python UDP吞吐量远低于TCP吞吐量。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。