2023年6月26日 22:57:24go评论75阅读模式

英文:

Unable to extract bubble rating and date of stay in TripAdvisor hotel reviews using Selenium

问题

我目前正在尝试提取酒店评论（正文和标题），以及评分和入住日期。我已经找到了@Driftr95的出色解决方案，用于抓取有关TripAdvisor景点的数据，并且没有问题地执行了。

原始URL （景点）：https://www.tripadvisor.co.uk/Attraction_Review-g186225-d213774-Reviews-Scudamore_s_Punting_Company-Cambridge_Cambridgeshire_England.html

然而，当我将URL替换为酒店而不是景点时，代码未能返回正确的数据。

期望的URL （酒店）：https://www.tripadvisor.com/Hotel_Review-g186405-d215170-Reviews-Portreeves-Arundel_Arun_District_West_Sussex_England.html

因此，我已经修改了原始代码以反映新的nxt_pg_sel和review_sel容器，如下所示。然而，在生成的数据框中，气泡评分和评论日期为空。

nxt_pg_sel = 'a.next'
review_sel = 'div[data-test-target="HR_CC_CARD"]'
rev_dets_sel = {
    'from_page': ('', '"staticVal"'),
    'profile_name': 'span>a[href^="/Profile/"]',
    'profile_link': ('span>a[href^="/Profile/"]', 'href'),
    'about_reviewer': 'span:has(>a[href^="/Profile/"])+div',
    'review_votes': 'button[aria-label="Click to add helpful vote"]>span',
    'bubbles': ('svg[aria-label$=" of 5 bubbles"]', 'aria-label'),
    'review_link': ('a[href^="/ShowUserReviews-"]', 'href'),
    'review_title': 'a[href^="/ShowUserReviews-"]',
    'about_review': 'div:has(>a[href^="/ShowUserReviews-"])+div:not(:has(div))',
    'review_body': 'div:has(>a[href^="/ShowUserReviews-"])~div>div',
    'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',
}

我还尝试捕获入住日期，但这会导致代码完全崩溃。

'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',

我已经尝试查看新URL中的动态标记，但似乎无法弄清楚为什么这些元素没有被抓取。感谢任何建议。非常感谢！

PS：原始代码在景点方面运行得很完美，但在以下附上。

（原始代码已经超出了翻译范围，请自行查看。）

英文:

I'm currently trying to extract the hotel reviews (body and title) along with the the rating and date of stay. I've come across @Driftr95's brilliant solution for scraping data about TripAdvisor attractions, and executed with no issues.

Original URL (Attraction): https://www.tripadvisor.co.uk/Attraction_Review-g186225-d213774-Reviews-Scudamore_s_Punting_Company-Cambridge_Cambridgeshire_England.html

However, when I replace the URL with a hotel rather than an attraction, the code fails to return the right data.

Desired URL (Hotel): https://www.tripadvisor.com/Hotel_Review-g186405-d215170-Reviews-Portreeves-Arundel_Arun_District_West_Sussex_England.html

As such, I've modified the original code to reflect the new nxt_pg_sel and review_sel containers as follows. However, the bubble rating and review date in the resulting data frame are blank.

nxt_pg_sel = &#39;a.next&#39;
review_sel = &#39;div[data-test-target=&quot;HR_CC_CARD&quot;]&#39;
rev_dets_sel = {
    &#39;from_page&#39;: (&#39;&#39;, &#39;&quot;staticVal&quot;&#39;),
    &#39;profile_name&#39;: &#39;span&gt;a[href^=&quot;\/Profile\/&quot;]&#39;,
    &#39;profile_link&#39;: (&#39;span&gt;a[href^=&quot;\/Profile\/&quot;]&#39;, &#39;href&#39;),
    &#39;about_reviewer&#39;: &#39;span:has(&gt;a[href^=&quot;\/Profile\/&quot;])+div&#39;,
    &#39;review_votes&#39;: &#39;button[aria-label=&quot;Click to add helpful vote&quot;]&gt;span&#39;,
    &#39;bubbles&#39;: (&#39;svg[aria-label$=&quot; of 5 bubbles&quot;]&#39;, &#39;aria-label&#39;),
    &#39;review_link&#39;: (&#39;a[href^=&quot;\/ShowUserReviews-&quot;]&#39;, &#39;href&#39;),
    &#39;review_title&#39;: &#39;a[href^=&quot;\/ShowUserReviews-&quot;]&#39;,
    &#39;about_review&#39;: &#39;div:has(&gt;a[href^=&quot;/ShowUserReviews-&quot;])+div:not(:has(div))&#39;,
    &#39;review_body&#39;: &#39;div:has(&gt;a[href^=&quot;/ShowUserReviews-&quot;])~div&gt;div&#39;,
    &#39;review_date&#39;: &#39;div:has(&gt;a[href^=&quot;/ShowUserReviews-&quot;])~div:last-child&gt;div&#39;,
}

I also attempted to capture the stay_date which broke the code entirely.

> > 'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div', >

I've tried looking at the dynamic tags within the new URL but cannot seem to figure out why these elements are not being scrapped. Would appreciate any suggestions. Many many thanks !

PS the original code that works flawlessly (but for attractions) is attached below.

##selectforlist
def selectForList(tagSoup, selectors, printList=False):
    if isinstance(selectors, dict):
        return dict(zip(selectors.keys(), selectForList(
            tagSoup, selectors.values(), printList)))
    
    selGen = (( list(sel if isinstance(sel, (tuple, list)) ## generate params
                else [sel])+[None]*2 )[:3] for sel in selectors)
    returnList = [  sel[0] if sel[1] == &#39;&quot;staticVal&quot;&#39; ## [allows placeholders]
                    else selectGet(tagSoup, *sel) for sel in selGen   ]
    
    if printList and not isinstance(printList,str): print(returnList)
    if isinstance(printList,str): print(*returnList, sep=printList)
    return returnList

##original selectors
nxt_pg_sel = &#39;a[href][data-smoke-attr=&quot;pagination-next-arrow&quot;]&#39;
review_sel = &#39;div[data-automation=&quot;reviewCard&quot;]&#39;
rev_dets_sel = {
    &#39;from_page&#39;: (&#39;&#39;, &#39;&quot;staticVal&quot;&#39;),
    &#39;profile_name&#39;: &#39;span&gt;a[href^=&quot;\/Profile\/&quot;]&#39;,
    &#39;profile_link&#39;: (&#39;span&gt;a[href^=&quot;\/Profile\/&quot;]&#39;, &#39;href&#39;),
    &#39;about_reviewer&#39;: &#39;span:has(&gt;a[href^=&quot;\/Profile\/&quot;])+div&#39;,
    &#39;review_votes&#39;: &#39;button[aria-label=&quot;Click to add helpful vote&quot;]&gt;span&#39;,
    &#39;bubbles&#39;: (&#39;svg[aria-label$=&quot; of 5 bubbles&quot;]&#39;, &#39;aria-label&#39;),
    &#39;review_link&#39;: (&#39;a[href^=&quot;\/ShowUserReviews-&quot;]&#39;, &#39;href&#39;),
    &#39;review_title&#39;: &#39;a[href^=&quot;\/ShowUserReviews-&quot;]&#39;,
    &#39;about_review&#39;: &#39;div:has(&gt;a[href^=&quot;/ShowUserReviews-&quot;])+div:not(:has(div))&#39;,
    &#39;review_body&#39;: &#39;div:has(&gt;a[href^=&quot;/ShowUserReviews-&quot;])~div&gt;div&#39;,
    &#39;review_date&#39;: &#39;div:has(&gt;a[href^=&quot;/ShowUserReviews-&quot;])~div:last-child&gt;div&#39;,
}

##set variables
csv_fn_revs = &#39;Scudamore_s_Punting_Company-tripadvisor_reviews.csv&#39;
csv_fn_pgs = &#39;Scudamore_s_Punting_Company-tripadvisor_review_pages.csv&#39;
pgNum, maxPages = 0, None
pageUrl = &#39;https://www.tripadvisor.co.uk/Attraction_Review-g186225-d213774-Reviews-Scudamore_s_Punting_Company-Cambridge_Cambridgeshire_England.html&#39;

##scrape data using web driver
browser = webdriver.Chrome()
browser.maximize_window() # maximize window

reveiws_list, pgList = [], []
while pageUrl and (maxPages is None or pgNum &lt; maxPages):
    pgNum += 1
    pgList.append({&#39;page&#39;: pgNum, &#39;URL&#39;: pageUrl})
    try:
        browser.get(pageUrl)
        rev_dets_sel[&#39;from_page&#39;] = (pgNum, &#39;&quot;staticVal&quot;&#39;)
        pgSoup = BeautifulSoup(browser.page_source, &#39;html.parser&#39;)

        rev_cards = pgSoup.select(review_sel)
        reveiws_list += [selectForList(r, rev_dets_sel) for r in rev_cards]
        pgList[-1][&#39;reviews&#39;] = len(rev_cards)

        next_page = pgSoup.select_one(nxt_pg_sel)
        if next_page:
            pageUrl = &#39;https://www.tripadvisor.co.uk&#39; + next_page.get(&#39;href&#39;)
            pgList[-1][&#39;next_page&#39;] = pageUrl
            print(&#39;going to&#39;, pageUrl)
        else:
            pageUrl = None  # stop condition
    except Exception as e:
        print(f&#39;Stopping on pg{pgNum} due to {type(e)}:\n{e}&#39;)
        break

browser.quit() # Close the browser

# Save as csv
pd.DataFrame(reveiws_list).to_csv(csv_fn_revs, index=False)
pd.DataFrame(pgList).to_csv(csv_fn_pgs, index=False)

答案1

得分: 0

"I also attempted to capture the stay_date which broke the code entirely.

Did it raise an error? Could you elaborate on the error? [If it just generates irrelevant data, then that's within normal behavior since that selector now leads to something else.]

^{(Also, I'm actually pleasantly surprised by how many of the old selectors still work - you often have to figure out an entirely new set of selectors for a new type of page...)}

For the bubbles, I suggest now using

    'bubbles': ('div[data-test-target="review-rating"]>span', 'class'),

and for the stay_date, try

    'stay_date': 'div[data-test-target="review-title"]+div>div:nth-child(2)>span:first-child',

overall, the new set of selectors I'd suggest using for hotels:

nxt_pg_sel = 'a.next[href]'  # '[data-smoke-attr="pagination-next-arrow"]'
# review_sel = 'div[data-automation="reviewCard"]'
review_sel = 'div[data-test-target="HR_CC_CARD"]'
rev_dets_sel = {
    'from_page': ('', '"staticVal"'),
    'profile_name': 'span>a[href^="/Profile/"]',
    'profile_link': ('span>a[href^="/Profile/"]', 'href'),
    'about_reviewer': 'a.ui_social_avatar+div>div+div+div',
    'bubbles': ('div[data-test-target="review-rating"]>span', 'class'),
    'review_link': ('a[href^="/ShowUserReviews-"]', 'href'),
    'review_title': 'a[href^="/ShowUserReviews-"]',
    'review_body': 'div:has(>a[href^="/ShowUserReviews-"])~div>div',
    'stay_date': 'div[data-test-target="review-title"]+div>div:nth-child(2)>span:first-child',
}

(I've uploaded my results to the same spreadsheet as before.)"

英文:

> I also attempted to capture the stay_date which broke the code entirely.

Did it raise an error? Could you elaborate on the error? [If it just generates irrelevant data, then that's within normal behavior since that selector now leads to something else.]

<sup>(Also, I'm actually pleasantly surprised by how many of the old selectors still work - you often have to figure out an entirely new set of selectors for a new type of page...)</sup>

For the bubbles, I suggest now using

    &#39;bubbles&#39;: (&#39;div[data-test-target=&quot;review-rating&quot;]&gt;span&#39;, &#39;class&#39;),

and for the stay_date, try

    &#39;stay_date&#39;: &#39;div[data-test-target=&quot;review-title&quot;]+div&gt;div:nth-child(2)&gt;span:first-child&#39;,

overall, the new set of selectors I'd suggest using for hotels:

nxt_pg_sel = &#39;a.next[href]&#39;  # &#39;[data-smoke-attr=&quot;pagination-next-arrow&quot;]&#39;
# review_sel = &#39;div[data-automation=&quot;reviewCard&quot;]&#39;
review_sel = &#39;div[data-test-target=&quot;HR_CC_CARD&quot;]&#39;
rev_dets_sel = {
    &#39;from_page&#39;: (&#39;&#39;, &#39;&quot;staticVal&quot;&#39;),
    &#39;profile_name&#39;: &#39;span&gt;a[href^=&quot;\/Profile\/&quot;]&#39;,
    &#39;profile_link&#39;: (&#39;span&gt;a[href^=&quot;\/Profile\/&quot;]&#39;, &#39;href&#39;),
    # &#39;about_reviewer&#39;: &#39;span:has(&gt;a[href^=&quot;\/Profile\/&quot;])+div&#39;,
    &#39;about_reviewer&#39;: &#39;a.ui_social_avatar+div&gt;div+div+div&#39;,
    # &#39;review_votes&#39;: &#39;button[aria-label=&quot;Click to add helpful vote&quot;]&gt;span&#39;,
    # &#39;bubbles&#39;: (&#39;svg[aria-label$=&quot; of 5 bubbles&quot;]&#39;, &#39;aria-label&#39;),
    &#39;bubbles&#39;: (&#39;div[data-test-target=&quot;review-rating&quot;]&gt;span&#39;, &#39;class&#39;),
    &#39;review_link&#39;: (&#39;a[href^=&quot;\/ShowUserReviews-&quot;]&#39;, &#39;href&#39;),
    &#39;review_title&#39;: &#39;a[href^=&quot;\/ShowUserReviews-&quot;]&#39;,
    # &#39;about_review&#39;: &#39;div:has(&gt;a[href^=&quot;\/ShowUserReviews-&quot;])+div:not(:has(div))&#39;,
    &#39;review_body&#39;: &#39;div:has(&gt;a[href^=&quot;\/ShowUserReviews-&quot;])~div&gt;div&#39;,
    # &#39;review_date&#39;: &#39;div:has(&gt;a[href^=&quot;\/ShowUserReviews-&quot;])~div:last-child&gt;div&#39;,
    &#39;stay_date&#39;: &#39;div[data-test-target=&quot;review-title&quot;]+div&gt;div:nth-child(2)&gt;span:first-child&#39;,
}

(I've uploaded my results to the same spreadsheet as before.)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

无法使用Selenium提取TripAdvisor酒店评论中的气泡评分和入住日期。

问题

答案1

转换 Pandas 系列中的日期。

Set plotly bargap to 0.

QLineEdit: blinking cursor (caret) disappears. How to restore it?

fsspec为什么是可选依赖项，当你需要它来使用pandas读取CSV文件时？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论