2023年3月3日 22:18:32go评论173阅读模式

英文:

Using Python Selenium to export Facebook posts - can't separate by post

问题

根据研究，我试图从一个Facebook群组中爬取帖子。我试图获取名称、日期和帖子内容。

因此，我首先尝试以下代码，但似乎它没有逐个捕获数据，而是一次性返回所有的名称，我无法按帖子分开。

在以下代码中，我的目的是找到所有帖子：

posts = browser.find_elements(By.XPATH, &quot;//div[contains(@class,&#39;x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z&#39;)]&quot;)

然后对每个帖子元素进行迭代：

for index, post in enumerate(posts):

最后，从迭代的帖子中收集所有A类：

name_spans = post.find_elements(By.XPATH,&quot;//a[contains(@class,&#39;x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f&#39;)]&quot;)

但当我运行以下代码时，它会获取所有的A类，一起返回，所以我无法知道每一个属于哪个帖子。

是否有解决方法？

# 加载Cookie
browser = load_cookies.adding_cookies_browser(cookies)

# 获取页面
print(&quot;前往Facebook&quot;)
browser.get(&quot;https://www.facebook.com/groups/1494117617557521?sorting_setting=CHRONOLOGICAL&quot;)
sleep(random.randint(4,6))

# 滚动直到这个成员的帖子（这样它只获取新帖子）
last_member = [&quot;成员名字在这里&quot;, &quot;链接在这里&quot;]
new_members = [ ]

# 滚动直到找到上次爬取的最后一个联系人
while True:
    # 向下滚动
    browser.execute_script(&quot;window.scrollBy(0, document.body.scrollHeight);&quot;)
    print(&quot;向下滚动&quot;)
    try:
        # 寻找上次爬取的最后一个名称。如果找到它，停止向下滚动
        text = WebDriverWait(browser, random.randint(3,5)).until(EC.presence_of_element_located((By.XPATH, f&quot;//*[text()=&#39;{last_member[0]}&#39;]&quot;)))
        print(f&quot;找到{last_member[0]}&quot;)
        break
    except:
        pass

group_posts = []
posts = browser.find_elements(By.XPATH, &quot;//div[contains(@class,&#39;x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z&#39;)]&quot;)
print(posts)
print(len(posts))

# 获取每个帖子的信息：
for index, post in enumerate(posts):
    print(f&quot;帖子{index}&quot;)
    print(post)

    # 通过查找A元素的ID来查找人名
    name_spans = post.find_elements(By.XPATH,&quot;//a[contains(@class,&#39;x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f&#39;)]&quot;)

    for i, name in enumerate(name_spans):
        print(f&quot;帖子 n{i}&quot;)
        print(name.text)

英文:

Due to a research, I am trying to scrape posts from a Facebook group. I am trying to get name, date and post content.

So, I am first trying the following code, but it looks like that it is not capturing data post by post, it returns all the names together, and I cannot break down by post.

On the following code, my intent was to find all the posts with the

posts = browser.find_elements(By.XPATH, &quot;//div[contains(@class,&#39;x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z&#39;)]&quot;)

then iterate over each post element with:

for index, post in enumerate(posts):

to finally collect all the a classes from that iterated post with:

name_spans = post.find_elements(By.XPATH,&quot;//a[contains(@class,&#39;x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f&#39;)]&quot;)

but when I run the following code, it gets me all the A classes, all together, so I cannot know each one belong to what post.

Any work-around?

# Loading Cookies
browser = load_cookies.adding_cookies_browser(cookies)

# Get page
print(&quot;getting to fb&quot;)
browser.get(&quot;https://www.facebook.com/groups/1494117617557521?sorting_setting=CHRONOLOGICAL&quot;)
sleep(random.randint(4,6))

#Scroll until this member&#39;s post (so it only gets new posts)
last_member = [&quot;member name here&quot;, &quot;link here&quot;]
new_members = [ ]

#scroll down until finding the last contact scrapped

while True:
    #scroll down
    browser.execute_script(&quot;window.scrollBy(0, document.body.scrollHeight);&quot;)
    print(&quot;scrolling down&quot;)
    try:
        #looking for the last name that it has scrapped on the previous time. If finds it, stop scrolling down
                    
        text = WebDriverWait(browser, random.randint(3,5)).until(EC.presence_of_element_located((By.XPATH, f&quot;//*[text()=&#39;{last_member[0]}&#39;]&quot;)))
        print(f&quot;found {last_member[0]}&quot;)
        break
    except:
        pass

group_posts = []
posts = browser.find_elements(By.XPATH, &quot;//div[contains(@class,&#39;x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z&#39;)]&quot;)
print(posts)
print(len(posts))

# Getting each post info:
for index, post in enumerate(posts):
    print(f&quot;post{index}&quot;)
    print(post)

    #find people&#39;s name by finding the A&#39;s element id
    name_spans = post.find_elements(By.XPATH,&quot;//a[contains(@class,&#39;x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f&#39;)]&quot;)

    for i, name in enumerate(name_spans):
        print(f&quot;post n{i}&quot;)
        print(name.text)

答案1

得分: 0

在 name_spans 的定义中，你正在使用 post.find_element，因为你希望将搜索限制在 post 内。但这还不够，你还需要在 xpath 前面添加一个点 .：

.//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]

记住：

//div 查找所有 HTML 中的 div 元素
.//div 查找当前节点的后代中的 div 元素

此外：

.// 查找当前节点的后代
./ 查找当前节点的直接子节点

英文:

In the definition of name_spans you are using post.find_element since you want to restrict the search inside post. But this is not enough, you also have to add a dot . in front of the xpath:

.//a[contains(@class,&#39;x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f&#39;)]

Remember

//div finds divs in all the html
.//div finds divs which are descendants of the current node

Moreover

.// finds the descendants of the current node
./ finds the direct children of the current node

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Python Selenium导出Facebook帖子 – 无法按帖子分开

问题

答案1

从项目根目录获取文件路径。

alembic没有在模式中生成表格。

如何在Python中通过ID查找过去的数值

如何计算数组中循环排列的数量（可变换位）

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论