无法使用Selenium提取TripAdvisor酒店评论中的气泡评分和入住日期。

huangapple go评论70阅读模式
英文:

Unable to extract bubble rating and date of stay in TripAdvisor hotel reviews using Selenium

问题

我目前正在尝试提取酒店评论(正文和标题),以及评分和入住日期。我已经找到了@Driftr95的出色解决方案,用于抓取有关TripAdvisor景点的数据,并且没有问题地执行了。

原始URL (景点):https://www.tripadvisor.co.uk/Attraction_Review-g186225-d213774-Reviews-Scudamore_s_Punting_Company-Cambridge_Cambridgeshire_England.html

然而,当我将URL替换为酒店而不是景点时,代码未能返回正确的数据。

期望的URL (酒店):https://www.tripadvisor.com/Hotel_Review-g186405-d215170-Reviews-Portreeves-Arundel_Arun_District_West_Sussex_England.html

因此,我已经修改了原始代码以反映新的nxt_pg_sel和review_sel容器,如下所示。然而,在生成的数据框中,气泡评分和评论日期为空。

nxt_pg_sel = 'a.next'
review_sel = 'div[data-test-target="HR_CC_CARD"]'
rev_dets_sel = {
    'from_page': ('', '"staticVal"'),
    'profile_name': 'span>a[href^="/Profile/"]',
    'profile_link': ('span>a[href^="/Profile/"]', 'href'),
    'about_reviewer': 'span:has(>a[href^="/Profile/"])+div',
    'review_votes': 'button[aria-label="Click to add helpful vote"]>span',
    'bubbles': ('svg[aria-label$=" of 5 bubbles"]', 'aria-label'),
    'review_link': ('a[href^="/ShowUserReviews-"]', 'href'),
    'review_title': 'a[href^="/ShowUserReviews-"]',
    'about_review': 'div:has(>a[href^="/ShowUserReviews-"])+div:not(:has(div))',
    'review_body': 'div:has(>a[href^="/ShowUserReviews-"])~div>div',
    'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',
}

我还尝试捕获入住日期,但这会导致代码完全崩溃。

'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',

我已经尝试查看新URL中的动态标记,但似乎无法弄清楚为什么这些元素没有被抓取。感谢任何建议。非常感谢!

PS:原始代码在景点方面运行得很完美,但在以下附上。

(原始代码已经超出了翻译范围,请自行查看。)

英文:

I'm currently trying to extract the hotel reviews (body and title) along with the the rating and date of stay. I've come across @Driftr95's brilliant solution for scraping data about TripAdvisor attractions, and executed with no issues.

Original URL (Attraction): https://www.tripadvisor.co.uk/Attraction_Review-g186225-d213774-Reviews-Scudamore_s_Punting_Company-Cambridge_Cambridgeshire_England.html

However, when I replace the URL with a hotel rather than an attraction, the code fails to return the right data.

Desired URL (Hotel): https://www.tripadvisor.com/Hotel_Review-g186405-d215170-Reviews-Portreeves-Arundel_Arun_District_West_Sussex_England.html

As such, I've modified the original code to reflect the new nxt_pg_sel and review_sel containers as follows. However, the bubble rating and review date in the resulting data frame are blank.

nxt_pg_sel = 'a.next'
review_sel = 'div[data-test-target="HR_CC_CARD"]'
rev_dets_sel = {
    'from_page': ('', '"staticVal"'),
    'profile_name': 'span>a[href^="\/Profile\/"]',
    'profile_link': ('span>a[href^="\/Profile\/"]', 'href'),
    'about_reviewer': 'span:has(>a[href^="\/Profile\/"])+div',
    'review_votes': 'button[aria-label="Click to add helpful vote"]>span',
    'bubbles': ('svg[aria-label$=" of 5 bubbles"]', 'aria-label'),
    'review_link': ('a[href^="\/ShowUserReviews-"]', 'href'),
    'review_title': 'a[href^="\/ShowUserReviews-"]',
    'about_review': 'div:has(>a[href^="/ShowUserReviews-"])+div:not(:has(div))',
    'review_body': 'div:has(>a[href^="/ShowUserReviews-"])~div>div',
    'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',
}

I also attempted to capture the stay_date which broke the code entirely.

>
> 'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',
>

I've tried looking at the dynamic tags within the new URL but cannot seem to figure out why these elements are not being scrapped. Would appreciate any suggestions. Many many thanks !

PS the original code that works flawlessly (but for attractions) is attached below.

##selectforlist
def selectForList(tagSoup, selectors, printList=False):
    if isinstance(selectors, dict):
        return dict(zip(selectors.keys(), selectForList(
            tagSoup, selectors.values(), printList)))
    
    selGen = (( list(sel if isinstance(sel, (tuple, list)) ## generate params
                else [sel])+[None]*2 )[:3] for sel in selectors)
    returnList = [  sel[0] if sel[1] == '"staticVal"' ## [allows placeholders]
                    else selectGet(tagSoup, *sel) for sel in selGen   ]
    
    if printList and not isinstance(printList,str): print(returnList)
    if isinstance(printList,str): print(*returnList, sep=printList)
    return returnList

##original selectors
nxt_pg_sel = 'a[href][data-smoke-attr="pagination-next-arrow"]'
review_sel = 'div[data-automation="reviewCard"]'
rev_dets_sel = {
    'from_page': ('', '"staticVal"'),
    'profile_name': 'span>a[href^="\/Profile\/"]',
    'profile_link': ('span>a[href^="\/Profile\/"]', 'href'),
    'about_reviewer': 'span:has(>a[href^="\/Profile\/"])+div',
    'review_votes': 'button[aria-label="Click to add helpful vote"]>span',
    'bubbles': ('svg[aria-label$=" of 5 bubbles"]', 'aria-label'),
    'review_link': ('a[href^="\/ShowUserReviews-"]', 'href'),
    'review_title': 'a[href^="\/ShowUserReviews-"]',
    'about_review': 'div:has(>a[href^="/ShowUserReviews-"])+div:not(:has(div))',
    'review_body': 'div:has(>a[href^="/ShowUserReviews-"])~div>div',
    'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',
}

##set variables
csv_fn_revs = 'Scudamore_s_Punting_Company-tripadvisor_reviews.csv'
csv_fn_pgs = 'Scudamore_s_Punting_Company-tripadvisor_review_pages.csv'
pgNum, maxPages = 0, None
pageUrl = 'https://www.tripadvisor.co.uk/Attraction_Review-g186225-d213774-Reviews-Scudamore_s_Punting_Company-Cambridge_Cambridgeshire_England.html'

##scrape data using web driver
browser = webdriver.Chrome()
browser.maximize_window() # maximize window

reveiws_list, pgList = [], []
while pageUrl and (maxPages is None or pgNum < maxPages):
    pgNum += 1
    pgList.append({'page': pgNum, 'URL': pageUrl})
    try:
        browser.get(pageUrl)
        rev_dets_sel['from_page'] = (pgNum, '"staticVal"')
        pgSoup = BeautifulSoup(browser.page_source, 'html.parser')

        rev_cards = pgSoup.select(review_sel)
        reveiws_list += [selectForList(r, rev_dets_sel) for r in rev_cards]
        pgList[-1]['reviews'] = len(rev_cards)

        next_page = pgSoup.select_one(nxt_pg_sel)
        if next_page:
            pageUrl = 'https://www.tripadvisor.co.uk' + next_page.get('href')
            pgList[-1]['next_page'] = pageUrl
            print('going to', pageUrl)
        else:
            pageUrl = None  # stop condition
    except Exception as e:
        print(f'Stopping on pg{pgNum} due to {type(e)}:\n{e}')
        break

browser.quit() # Close the browser

# Save as csv
pd.DataFrame(reveiws_list).to_csv(csv_fn_revs, index=False)
pd.DataFrame(pgList).to_csv(csv_fn_pgs, index=False)

答案1

得分: 0

"I also attempted to capture the stay_date which broke the code entirely.

Did it raise an error? Could you elaborate on the error? [If it just generates irrelevant data, then that's within normal behavior since that selector now leads to something else.]

(Also, I'm actually pleasantly surprised by how many of the old selectors still work - you often have to figure out an entirely new set of selectors for a new type of page...)


For the bubbles, I suggest now using

    'bubbles': ('div[data-test-target="review-rating"]>span', 'class'),

and for the stay_date, try

    'stay_date': 'div[data-test-target="review-title"]+div>div:nth-child(2)>span:first-child',

overall, the new set of selectors I'd suggest using for hotels:

nxt_pg_sel = 'a.next[href]'  # '[data-smoke-attr="pagination-next-arrow"]'
# review_sel = 'div[data-automation="reviewCard"]'
review_sel = 'div[data-test-target="HR_CC_CARD"]'
rev_dets_sel = {
    'from_page': ('', '"staticVal"'),
    'profile_name': 'span>a[href^="/Profile/"]',
    'profile_link': ('span>a[href^="/Profile/"]', 'href'),
    'about_reviewer': 'a.ui_social_avatar+div>div+div+div',
    'bubbles': ('div[data-test-target="review-rating"]>span', 'class'),
    'review_link': ('a[href^="/ShowUserReviews-"]', 'href'),
    'review_title': 'a[href^="/ShowUserReviews-"]',
    'review_body': 'div:has(>a[href^="/ShowUserReviews-"])~div>div',
    'stay_date': 'div[data-test-target="review-title"]+div>div:nth-child(2)>span:first-child',
}

(I've uploaded my results to the same spreadsheet as before.)"

英文:

> I also attempted to capture the stay_date which broke the code entirely.

Did it raise an error? Could you elaborate on the error? [If it just generates irrelevant data, then that's within normal behavior since that selector now leads to something else.]

<sup>(Also, I'm actually pleasantly surprised by how many of the old selectors still work - you often have to figure out an entirely new set of selectors for a new type of page...)</sup>


For the bubbles, I suggest now using

    &#39;bubbles&#39;: (&#39;div[data-test-target=&quot;review-rating&quot;]&gt;span&#39;, &#39;class&#39;),

and for the stay_date, try

    &#39;stay_date&#39;: &#39;div[data-test-target=&quot;review-title&quot;]+div&gt;div:nth-child(2)&gt;span:first-child&#39;,

overall, the new set of selectors I'd suggest using for hotels:

nxt_pg_sel = &#39;a.next[href]&#39;  # &#39;[data-smoke-attr=&quot;pagination-next-arrow&quot;]&#39;
# review_sel = &#39;div[data-automation=&quot;reviewCard&quot;]&#39;
review_sel = &#39;div[data-test-target=&quot;HR_CC_CARD&quot;]&#39;
rev_dets_sel = {
    &#39;from_page&#39;: (&#39;&#39;, &#39;&quot;staticVal&quot;&#39;),
    &#39;profile_name&#39;: &#39;span&gt;a[href^=&quot;\/Profile\/&quot;]&#39;,
    &#39;profile_link&#39;: (&#39;span&gt;a[href^=&quot;\/Profile\/&quot;]&#39;, &#39;href&#39;),
    # &#39;about_reviewer&#39;: &#39;span:has(&gt;a[href^=&quot;\/Profile\/&quot;])+div&#39;,
    &#39;about_reviewer&#39;: &#39;a.ui_social_avatar+div&gt;div+div+div&#39;,
    # &#39;review_votes&#39;: &#39;button[aria-label=&quot;Click to add helpful vote&quot;]&gt;span&#39;,
    # &#39;bubbles&#39;: (&#39;svg[aria-label$=&quot; of 5 bubbles&quot;]&#39;, &#39;aria-label&#39;),
    &#39;bubbles&#39;: (&#39;div[data-test-target=&quot;review-rating&quot;]&gt;span&#39;, &#39;class&#39;),
    &#39;review_link&#39;: (&#39;a[href^=&quot;\/ShowUserReviews-&quot;]&#39;, &#39;href&#39;),
    &#39;review_title&#39;: &#39;a[href^=&quot;\/ShowUserReviews-&quot;]&#39;,
    # &#39;about_review&#39;: &#39;div:has(&gt;a[href^=&quot;\/ShowUserReviews-&quot;])+div:not(:has(div))&#39;,
    &#39;review_body&#39;: &#39;div:has(&gt;a[href^=&quot;\/ShowUserReviews-&quot;])~div&gt;div&#39;,
    # &#39;review_date&#39;: &#39;div:has(&gt;a[href^=&quot;\/ShowUserReviews-&quot;])~div:last-child&gt;div&#39;,
    &#39;stay_date&#39;: &#39;div[data-test-target=&quot;review-title&quot;]+div&gt;div:nth-child(2)&gt;span:first-child&#39;,
}

(I've uploaded my results to the same spreadsheet as before.)

huangapple
  • 本文由 发表于 2023年6月26日 22:57:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76557862.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定