英文:
Using Python Selenium to export Facebook posts - can't separate by post
问题
根据研究,我试图从一个Facebook群组中爬取帖子。我试图获取名称、日期和帖子内容。
因此,我首先尝试以下代码,但似乎它没有逐个捕获数据,而是一次性返回所有的名称,我无法按帖子分开。
在以下代码中,我的目的是找到所有帖子:
posts = browser.find_elements(By.XPATH, "//div[contains(@class,'x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z')]")
然后对每个帖子元素进行迭代:
for index, post in enumerate(posts):
最后,从迭代的帖子中收集所有A类:
name_spans = post.find_elements(By.XPATH,"//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]")
但当我运行以下代码时,它会获取所有的A类,一起返回,所以我无法知道每一个属于哪个帖子。
是否有解决方法?
# 加载Cookie
browser = load_cookies.adding_cookies_browser(cookies)
# 获取页面
print("前往Facebook")
browser.get("https://www.facebook.com/groups/1494117617557521?sorting_setting=CHRONOLOGICAL")
sleep(random.randint(4,6))
# 滚动直到这个成员的帖子(这样它只获取新帖子)
last_member = ["成员名字在这里", "链接在这里"]
new_members = [ ]
# 滚动直到找到上次爬取的最后一个联系人
while True:
# 向下滚动
browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")
print("向下滚动")
try:
# 寻找上次爬取的最后一个名称。如果找到它,停止向下滚动
text = WebDriverWait(browser, random.randint(3,5)).until(EC.presence_of_element_located((By.XPATH, f"//*[text()='{last_member[0]}']")))
print(f"找到{last_member[0]}")
break
except:
pass
group_posts = []
posts = browser.find_elements(By.XPATH, "//div[contains(@class,'x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z')]")
print(posts)
print(len(posts))
# 获取每个帖子的信息:
for index, post in enumerate(posts):
print(f"帖子{index}")
print(post)
# 通过查找A元素的ID来查找人名
name_spans = post.find_elements(By.XPATH,"//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]")
for i, name in enumerate(name_spans):
print(f"帖子 n{i}")
print(name.text)
英文:
Due to a research, I am trying to scrape posts from a Facebook group. I am trying to get name, date and post content.
So, I am first trying the following code, but it looks like that it is not capturing data post by post, it returns all the names together, and I cannot break down by post.
On the following code, my intent was to find all the posts with the
posts = browser.find_elements(By.XPATH, "//div[contains(@class,'x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z')]")
then iterate over each post element with:
for index, post in enumerate(posts):
to finally collect all the a classes from that iterated post with:
name_spans = post.find_elements(By.XPATH,"//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]")
but when I run the following code, it gets me all the A classes, all together, so I cannot know each one belong to what post.
Any work-around?
# Loading Cookies
browser = load_cookies.adding_cookies_browser(cookies)
# Get page
print("getting to fb")
browser.get("https://www.facebook.com/groups/1494117617557521?sorting_setting=CHRONOLOGICAL")
sleep(random.randint(4,6))
#Scroll until this member's post (so it only gets new posts)
last_member = ["member name here", "link here"]
new_members = [ ]
#scroll down until finding the last contact scrapped
while True:
#scroll down
browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")
print("scrolling down")
try:
#looking for the last name that it has scrapped on the previous time. If finds it, stop scrolling down
text = WebDriverWait(browser, random.randint(3,5)).until(EC.presence_of_element_located((By.XPATH, f"//*[text()='{last_member[0]}']")))
print(f"found {last_member[0]}")
break
except:
pass
group_posts = []
posts = browser.find_elements(By.XPATH, "//div[contains(@class,'x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z')]")
print(posts)
print(len(posts))
# Getting each post info:
for index, post in enumerate(posts):
print(f"post{index}")
print(post)
#find people's name by finding the A's element id
name_spans = post.find_elements(By.XPATH,"//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]")
for i, name in enumerate(name_spans):
print(f"post n{i}")
print(name.text)
答案1
得分: 0
在 name_spans
的定义中,你正在使用 post.find_element
,因为你希望将搜索限制在 post
内。但这还不够,你还需要在 xpath 前面添加一个点 .
:
.//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]
记住:
//div
查找所有 HTML 中的 div 元素.//div
查找当前节点的后代中的 div 元素
此外:
.//
查找当前节点的后代./
查找当前节点的直接子节点
英文:
In the definition of name_spans
you are using post.find_element
since you want to restrict the search inside post
. But this is not enough, you also have to add a dot .
in front of the xpath:
.//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]
Remember
//div
finds divs in all the html.//div
finds divs which are descendants of the current node
Moreover
.//
finds the descendants of the current node./
finds the direct children of the current node
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论