使用Python Selenium导出Facebook帖子 – 无法按帖子分开

huangapple go评论109阅读模式
英文:

Using Python Selenium to export Facebook posts - can't separate by post

问题

根据研究,我试图从一个Facebook群组中爬取帖子。我试图获取名称、日期和帖子内容。

因此,我首先尝试以下代码,但似乎它没有逐个捕获数据,而是一次性返回所有的名称,我无法按帖子分开。

在以下代码中,我的目的是找到所有帖子:

  1. posts = browser.find_elements(By.XPATH, "//div[contains(@class,'x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z')]")

然后对每个帖子元素进行迭代:

  1. for index, post in enumerate(posts):

最后,从迭代的帖子中收集所有A类:

  1. name_spans = post.find_elements(By.XPATH,"//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]")

但当我运行以下代码时,它会获取所有的A类,一起返回,所以我无法知道每一个属于哪个帖子。

是否有解决方法?

  1. # 加载Cookie
  2. browser = load_cookies.adding_cookies_browser(cookies)
  3. # 获取页面
  4. print("前往Facebook")
  5. browser.get("https://www.facebook.com/groups/1494117617557521?sorting_setting=CHRONOLOGICAL")
  6. sleep(random.randint(4,6))
  7. # 滚动直到这个成员的帖子(这样它只获取新帖子)
  8. last_member = ["成员名字在这里", "链接在这里"]
  9. new_members = [ ]
  10. # 滚动直到找到上次爬取的最后一个联系人
  11. while True:
  12. # 向下滚动
  13. browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")
  14. print("向下滚动")
  15. try:
  16. # 寻找上次爬取的最后一个名称。如果找到它,停止向下滚动
  17. text = WebDriverWait(browser, random.randint(3,5)).until(EC.presence_of_element_located((By.XPATH, f"//*[text()='{last_member[0]}']")))
  18. print(f"找到{last_member[0]}")
  19. break
  20. except:
  21. pass
  22. group_posts = []
  23. posts = browser.find_elements(By.XPATH, "//div[contains(@class,'x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z')]")
  24. print(posts)
  25. print(len(posts))
  26. # 获取每个帖子的信息:
  27. for index, post in enumerate(posts):
  28. print(f"帖子{index}")
  29. print(post)
  30. # 通过查找A元素的ID来查找人名
  31. name_spans = post.find_elements(By.XPATH,"//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]")
  32. for i, name in enumerate(name_spans):
  33. print(f"帖子 n{i}")
  34. print(name.text)
英文:

Due to a research, I am trying to scrape posts from a Facebook group. I am trying to get name, date and post content.

So, I am first trying the following code, but it looks like that it is not capturing data post by post, it returns all the names together, and I cannot break down by post.

On the following code, my intent was to find all the posts with the

  1. posts = browser.find_elements(By.XPATH, "//div[contains(@class,'x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z')]")

then iterate over each post element with:

  1. for index, post in enumerate(posts):

to finally collect all the a classes from that iterated post with:

  1. name_spans = post.find_elements(By.XPATH,"//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]")

but when I run the following code, it gets me all the A classes, all together, so I cannot know each one belong to what post.

Any work-around?

  1. # Loading Cookies
  2. browser = load_cookies.adding_cookies_browser(cookies)
  3. # Get page
  4. print("getting to fb")
  5. browser.get("https://www.facebook.com/groups/1494117617557521?sorting_setting=CHRONOLOGICAL")
  6. sleep(random.randint(4,6))
  7. #Scroll until this member's post (so it only gets new posts)
  8. last_member = ["member name here", "link here"]
  9. new_members = [ ]
  10. #scroll down until finding the last contact scrapped
  11. while True:
  12. #scroll down
  13. browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")
  14. print("scrolling down")
  15. try:
  16. #looking for the last name that it has scrapped on the previous time. If finds it, stop scrolling down
  17. text = WebDriverWait(browser, random.randint(3,5)).until(EC.presence_of_element_located((By.XPATH, f"//*[text()='{last_member[0]}']")))
  18. print(f"found {last_member[0]}")
  19. break
  20. except:
  21. pass
  22. group_posts = []
  23. posts = browser.find_elements(By.XPATH, "//div[contains(@class,'x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z')]")
  24. print(posts)
  25. print(len(posts))
  26. # Getting each post info:
  27. for index, post in enumerate(posts):
  28. print(f"post{index}")
  29. print(post)
  30. #find people's name by finding the A's element id
  31. name_spans = post.find_elements(By.XPATH,"//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]")
  32. for i, name in enumerate(name_spans):
  33. print(f"post n{i}")
  34. print(name.text)

答案1

得分: 0

name_spans 的定义中,你正在使用 post.find_element,因为你希望将搜索限制在 post 内。但这还不够,你还需要在 xpath 前面添加一个点 .

  1. .//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]

记住:

  • //div 查找所有 HTML 中的 div 元素
  • .//div 查找当前节点的后代中的 div 元素

此外:

  • .// 查找当前节点的后代
  • ./ 查找当前节点的直接子节点
英文:

In the definition of name_spans you are using post.find_element since you want to restrict the search inside post. But this is not enough, you also have to add a dot . in front of the xpath:

  1. .//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]

Remember

  • //div finds divs in all the html
  • .//div finds divs which are descendants of the current node

Moreover

  • .// finds the descendants of the current node
  • ./ finds the direct children of the current node

huangapple
  • 本文由 发表于 2023年3月3日 22:18:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/75628235.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定