使用Python Selenium导出Facebook帖子 – 无法按帖子分开

huangapple go评论81阅读模式
英文:

Using Python Selenium to export Facebook posts - can't separate by post

问题

根据研究,我试图从一个Facebook群组中爬取帖子。我试图获取名称、日期和帖子内容。

因此,我首先尝试以下代码,但似乎它没有逐个捕获数据,而是一次性返回所有的名称,我无法按帖子分开。

在以下代码中,我的目的是找到所有帖子:

posts = browser.find_elements(By.XPATH, "//div[contains(@class,'x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z')]")

然后对每个帖子元素进行迭代:

for index, post in enumerate(posts):

最后,从迭代的帖子中收集所有A类:

name_spans = post.find_elements(By.XPATH,"//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]")

但当我运行以下代码时,它会获取所有的A类,一起返回,所以我无法知道每一个属于哪个帖子。

是否有解决方法?

# 加载Cookie
browser = load_cookies.adding_cookies_browser(cookies)

# 获取页面
print("前往Facebook")
browser.get("https://www.facebook.com/groups/1494117617557521?sorting_setting=CHRONOLOGICAL")
sleep(random.randint(4,6))

# 滚动直到这个成员的帖子(这样它只获取新帖子)
last_member = ["成员名字在这里", "链接在这里"]
new_members = [ ]

# 滚动直到找到上次爬取的最后一个联系人
while True:
    # 向下滚动
    browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")
    print("向下滚动")
    try:
        # 寻找上次爬取的最后一个名称。如果找到它,停止向下滚动
        text = WebDriverWait(browser, random.randint(3,5)).until(EC.presence_of_element_located((By.XPATH, f"//*[text()='{last_member[0]}']")))
        print(f"找到{last_member[0]}")
        break
    except:
        pass

group_posts = []
posts = browser.find_elements(By.XPATH, "//div[contains(@class,'x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z')]")
print(posts)
print(len(posts))

# 获取每个帖子的信息:
for index, post in enumerate(posts):
    print(f"帖子{index}")
    print(post)

    # 通过查找A元素的ID来查找人名
    name_spans = post.find_elements(By.XPATH,"//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]")

    for i, name in enumerate(name_spans):
        print(f"帖子 n{i}")
        print(name.text)
英文:

Due to a research, I am trying to scrape posts from a Facebook group. I am trying to get name, date and post content.

So, I am first trying the following code, but it looks like that it is not capturing data post by post, it returns all the names together, and I cannot break down by post.

On the following code, my intent was to find all the posts with the

posts = browser.find_elements(By.XPATH, "//div[contains(@class,'x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z')]")

then iterate over each post element with:

for index, post in enumerate(posts):

to finally collect all the a classes from that iterated post with:

name_spans = post.find_elements(By.XPATH,"//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]")

but when I run the following code, it gets me all the A classes, all together, so I cannot know each one belong to what post.

Any work-around?

# Loading Cookies
browser = load_cookies.adding_cookies_browser(cookies)

# Get page
print("getting to fb")
browser.get("https://www.facebook.com/groups/1494117617557521?sorting_setting=CHRONOLOGICAL")
sleep(random.randint(4,6))

#Scroll until this member's post (so it only gets new posts)
last_member = ["member name here", "link here"]
new_members = [ ]

#scroll down until finding the last contact scrapped

while True:
    #scroll down
    browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")
    print("scrolling down")
    try:
        #looking for the last name that it has scrapped on the previous time. If finds it, stop scrolling down
                    
        text = WebDriverWait(browser, random.randint(3,5)).until(EC.presence_of_element_located((By.XPATH, f"//*[text()='{last_member[0]}']")))
        print(f"found {last_member[0]}")
        break
    except:
        pass

group_posts = []
posts = browser.find_elements(By.XPATH, "//div[contains(@class,'x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z')]")
print(posts)
print(len(posts))

# Getting each post info:
for index, post in enumerate(posts):
    print(f"post{index}")
    print(post)

    #find people's name by finding the A's element id
    name_spans = post.find_elements(By.XPATH,"//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]")

    for i, name in enumerate(name_spans):
        print(f"post n{i}")
        print(name.text)

答案1

得分: 0

name_spans 的定义中,你正在使用 post.find_element,因为你希望将搜索限制在 post 内。但这还不够,你还需要在 xpath 前面添加一个点 .

.//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]

记住:

  • //div 查找所有 HTML 中的 div 元素
  • .//div 查找当前节点的后代中的 div 元素

此外:

  • .// 查找当前节点的后代
  • ./ 查找当前节点的直接子节点
英文:

In the definition of name_spans you are using post.find_element since you want to restrict the search inside post. But this is not enough, you also have to add a dot . in front of the xpath:

.//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]

Remember

  • //div finds divs in all the html
  • .//div finds divs which are descendants of the current node

Moreover

  • .// finds the descendants of the current node
  • ./ finds the direct children of the current node

huangapple
  • 本文由 发表于 2023年3月3日 22:18:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/75628235.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定