英文:
Unable to extract bubble rating and date of stay in TripAdvisor hotel reviews using Selenium
问题
我目前正在尝试提取酒店评论(正文和标题),以及评分和入住日期。我已经找到了@Driftr95的出色解决方案,用于抓取有关TripAdvisor景点的数据,并且没有问题地执行了。
原始URL (景点):https://www.tripadvisor.co.uk/Attraction_Review-g186225-d213774-Reviews-Scudamore_s_Punting_Company-Cambridge_Cambridgeshire_England.html
然而,当我将URL替换为酒店而不是景点时,代码未能返回正确的数据。
期望的URL (酒店):https://www.tripadvisor.com/Hotel_Review-g186405-d215170-Reviews-Portreeves-Arundel_Arun_District_West_Sussex_England.html
因此,我已经修改了原始代码以反映新的nxt_pg_sel和review_sel容器,如下所示。然而,在生成的数据框中,气泡评分和评论日期为空。
nxt_pg_sel = 'a.next'
review_sel = 'div[data-test-target="HR_CC_CARD"]'
rev_dets_sel = {
'from_page': ('', '"staticVal"'),
'profile_name': 'span>a[href^="/Profile/"]',
'profile_link': ('span>a[href^="/Profile/"]', 'href'),
'about_reviewer': 'span:has(>a[href^="/Profile/"])+div',
'review_votes': 'button[aria-label="Click to add helpful vote"]>span',
'bubbles': ('svg[aria-label$=" of 5 bubbles"]', 'aria-label'),
'review_link': ('a[href^="/ShowUserReviews-"]', 'href'),
'review_title': 'a[href^="/ShowUserReviews-"]',
'about_review': 'div:has(>a[href^="/ShowUserReviews-"])+div:not(:has(div))',
'review_body': 'div:has(>a[href^="/ShowUserReviews-"])~div>div',
'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',
}
我还尝试捕获入住日期,但这会导致代码完全崩溃。
'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',
我已经尝试查看新URL中的动态标记,但似乎无法弄清楚为什么这些元素没有被抓取。感谢任何建议。非常感谢!
PS:原始代码在景点方面运行得很完美,但在以下附上。
(原始代码已经超出了翻译范围,请自行查看。)
英文:
I'm currently trying to extract the hotel reviews (body and title) along with the the rating and date of stay. I've come across @Driftr95's brilliant solution for scraping data about TripAdvisor attractions, and executed with no issues.
Original URL (Attraction): https://www.tripadvisor.co.uk/Attraction_Review-g186225-d213774-Reviews-Scudamore_s_Punting_Company-Cambridge_Cambridgeshire_England.html
However, when I replace the URL with a hotel rather than an attraction, the code fails to return the right data.
Desired URL (Hotel): https://www.tripadvisor.com/Hotel_Review-g186405-d215170-Reviews-Portreeves-Arundel_Arun_District_West_Sussex_England.html
As such, I've modified the original code to reflect the new nxt_pg_sel and review_sel containers as follows. However, the bubble rating and review date in the resulting data frame are blank.
nxt_pg_sel = 'a.next'
review_sel = 'div[data-test-target="HR_CC_CARD"]'
rev_dets_sel = {
'from_page': ('', '"staticVal"'),
'profile_name': 'span>a[href^="\/Profile\/"]',
'profile_link': ('span>a[href^="\/Profile\/"]', 'href'),
'about_reviewer': 'span:has(>a[href^="\/Profile\/"])+div',
'review_votes': 'button[aria-label="Click to add helpful vote"]>span',
'bubbles': ('svg[aria-label$=" of 5 bubbles"]', 'aria-label'),
'review_link': ('a[href^="\/ShowUserReviews-"]', 'href'),
'review_title': 'a[href^="\/ShowUserReviews-"]',
'about_review': 'div:has(>a[href^="/ShowUserReviews-"])+div:not(:has(div))',
'review_body': 'div:has(>a[href^="/ShowUserReviews-"])~div>div',
'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',
}
I also attempted to capture the stay_date which broke the code entirely.
>
> 'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',
>
I've tried looking at the dynamic tags within the new URL but cannot seem to figure out why these elements are not being scrapped. Would appreciate any suggestions. Many many thanks !
PS the original code that works flawlessly (but for attractions) is attached below.
##selectforlist
def selectForList(tagSoup, selectors, printList=False):
if isinstance(selectors, dict):
return dict(zip(selectors.keys(), selectForList(
tagSoup, selectors.values(), printList)))
selGen = (( list(sel if isinstance(sel, (tuple, list)) ## generate params
else [sel])+[None]*2 )[:3] for sel in selectors)
returnList = [ sel[0] if sel[1] == '"staticVal"' ## [allows placeholders]
else selectGet(tagSoup, *sel) for sel in selGen ]
if printList and not isinstance(printList,str): print(returnList)
if isinstance(printList,str): print(*returnList, sep=printList)
return returnList
##original selectors
nxt_pg_sel = 'a[href][data-smoke-attr="pagination-next-arrow"]'
review_sel = 'div[data-automation="reviewCard"]'
rev_dets_sel = {
'from_page': ('', '"staticVal"'),
'profile_name': 'span>a[href^="\/Profile\/"]',
'profile_link': ('span>a[href^="\/Profile\/"]', 'href'),
'about_reviewer': 'span:has(>a[href^="\/Profile\/"])+div',
'review_votes': 'button[aria-label="Click to add helpful vote"]>span',
'bubbles': ('svg[aria-label$=" of 5 bubbles"]', 'aria-label'),
'review_link': ('a[href^="\/ShowUserReviews-"]', 'href'),
'review_title': 'a[href^="\/ShowUserReviews-"]',
'about_review': 'div:has(>a[href^="/ShowUserReviews-"])+div:not(:has(div))',
'review_body': 'div:has(>a[href^="/ShowUserReviews-"])~div>div',
'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',
}
##set variables
csv_fn_revs = 'Scudamore_s_Punting_Company-tripadvisor_reviews.csv'
csv_fn_pgs = 'Scudamore_s_Punting_Company-tripadvisor_review_pages.csv'
pgNum, maxPages = 0, None
pageUrl = 'https://www.tripadvisor.co.uk/Attraction_Review-g186225-d213774-Reviews-Scudamore_s_Punting_Company-Cambridge_Cambridgeshire_England.html'
##scrape data using web driver
browser = webdriver.Chrome()
browser.maximize_window() # maximize window
reveiws_list, pgList = [], []
while pageUrl and (maxPages is None or pgNum < maxPages):
pgNum += 1
pgList.append({'page': pgNum, 'URL': pageUrl})
try:
browser.get(pageUrl)
rev_dets_sel['from_page'] = (pgNum, '"staticVal"')
pgSoup = BeautifulSoup(browser.page_source, 'html.parser')
rev_cards = pgSoup.select(review_sel)
reveiws_list += [selectForList(r, rev_dets_sel) for r in rev_cards]
pgList[-1]['reviews'] = len(rev_cards)
next_page = pgSoup.select_one(nxt_pg_sel)
if next_page:
pageUrl = 'https://www.tripadvisor.co.uk' + next_page.get('href')
pgList[-1]['next_page'] = pageUrl
print('going to', pageUrl)
else:
pageUrl = None # stop condition
except Exception as e:
print(f'Stopping on pg{pgNum} due to {type(e)}:\n{e}')
break
browser.quit() # Close the browser
# Save as csv
pd.DataFrame(reveiws_list).to_csv(csv_fn_revs, index=False)
pd.DataFrame(pgList).to_csv(csv_fn_pgs, index=False)
答案1
得分: 0
"I also attempted to capture the stay_date which broke the code entirely.
Did it raise an error? Could you elaborate on the error? [If it just generates irrelevant data, then that's within normal behavior since that selector now leads to something else.]
(Also, I'm actually pleasantly surprised by how many of the old selectors still work - you often have to figure out an entirely new set of selectors for a new type of page...)
For the bubbles
, I suggest now using
'bubbles': ('div[data-test-target="review-rating"]>span', 'class'),
and for the stay_date
, try
'stay_date': 'div[data-test-target="review-title"]+div>div:nth-child(2)>span:first-child',
overall, the new set of selectors I'd suggest using for hotels:
nxt_pg_sel = 'a.next[href]' # '[data-smoke-attr="pagination-next-arrow"]'
# review_sel = 'div[data-automation="reviewCard"]'
review_sel = 'div[data-test-target="HR_CC_CARD"]'
rev_dets_sel = {
'from_page': ('', '"staticVal"'),
'profile_name': 'span>a[href^="/Profile/"]',
'profile_link': ('span>a[href^="/Profile/"]', 'href'),
'about_reviewer': 'a.ui_social_avatar+div>div+div+div',
'bubbles': ('div[data-test-target="review-rating"]>span', 'class'),
'review_link': ('a[href^="/ShowUserReviews-"]', 'href'),
'review_title': 'a[href^="/ShowUserReviews-"]',
'review_body': 'div:has(>a[href^="/ShowUserReviews-"])~div>div',
'stay_date': 'div[data-test-target="review-title"]+div>div:nth-child(2)>span:first-child',
}
(I've uploaded my results to the same spreadsheet as before.)"
英文:
> I also attempted to capture the stay_date which broke the code entirely.
Did it raise an error? Could you elaborate on the error? [If it just generates irrelevant data, then that's within normal behavior since that selector now leads to something else.]
<sup>(Also, I'm actually pleasantly surprised by how many of the old selectors still work - you often have to figure out an entirely new set of selectors for a new type of page...)</sup>
For the bubbles
, I suggest now using
'bubbles': ('div[data-test-target="review-rating"]>span', 'class'),
and for the stay_date
, try
'stay_date': 'div[data-test-target="review-title"]+div>div:nth-child(2)>span:first-child',
overall, the new set of selectors I'd suggest using for hotels:
nxt_pg_sel = 'a.next[href]' # '[data-smoke-attr="pagination-next-arrow"]'
# review_sel = 'div[data-automation="reviewCard"]'
review_sel = 'div[data-test-target="HR_CC_CARD"]'
rev_dets_sel = {
'from_page': ('', '"staticVal"'),
'profile_name': 'span>a[href^="\/Profile\/"]',
'profile_link': ('span>a[href^="\/Profile\/"]', 'href'),
# 'about_reviewer': 'span:has(>a[href^="\/Profile\/"])+div',
'about_reviewer': 'a.ui_social_avatar+div>div+div+div',
# 'review_votes': 'button[aria-label="Click to add helpful vote"]>span',
# 'bubbles': ('svg[aria-label$=" of 5 bubbles"]', 'aria-label'),
'bubbles': ('div[data-test-target="review-rating"]>span', 'class'),
'review_link': ('a[href^="\/ShowUserReviews-"]', 'href'),
'review_title': 'a[href^="\/ShowUserReviews-"]',
# 'about_review': 'div:has(>a[href^="\/ShowUserReviews-"])+div:not(:has(div))',
'review_body': 'div:has(>a[href^="\/ShowUserReviews-"])~div>div',
# 'review_date': 'div:has(>a[href^="\/ShowUserReviews-"])~div:last-child>div',
'stay_date': 'div[data-test-target="review-title"]+div>div:nth-child(2)>span:first-child',
}
(I've uploaded my results to the same spreadsheet as before.)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论