如何使用snscrape基于特定查询来抓取Twitter用户

huangapple go评论85阅读模式
英文:

How to scrape twitter users based on a certain query using snscrape

问题

我正在使用snscrape来抓取在其个人简介中包含特定关键词的用户。我目前正在使用以下算法:

  1. 搜索包含特定关键词的推文。
  2. 提取发推文的用户。
  3. 过滤提取的用户个人简介,检查是否包含所需的关键词(如果用户个人简介包含关键词,则将该用户添加到数据框中,否则丢弃该用户)。

现在我想知道是否有一种方法可以立即根据用户个人简介搜索用户,而不是像我现在所做的那样模拟Twitter网页的高级搜索功能?我查看了snscrape文档,但所有处理用户的类似乎只与特定用户有关,而不是根据某些查询搜索用户。

以下是我目前正在运行的代码:

import snscrape.modules.twitter as sntwt

query = "co founder (CEO OR Congress OR CTO) lang:en"
tweets = []
limit = 5000
# 与其搜索推文,我想搜索用户
for tweet in sntwt.TwitterSearchScraper(query).get_items():

    print(vars(tweet))
    print('\n\n\n\n')
    # 一些过滤用户的代码

最后,这是模拟我想要的行为的Twitter高级搜索的屏幕截图。

如何使用snscrape基于特定查询来抓取Twitter用户

英文:

I am using snscrape to scrape users that have a certain keyword in their bio.
the algorithm that I am using right-now is the following:

  1. search for tweets that contains a certain words
  2. extract the user who tweeted this tweet
  3. filter the extracted user bio against the the wanted keyword (if the user bio includes the keywords append that user in a data frame if not, discard this user).

now what I want to know is there a method that can immediately search for users based on their bio instead of what I am doing right-now i.e. simulating the advance search feature of the Twitter web page ?
I looked at snscrape docs but all classes that deals with users appears to be only dealing with a specific user not search for users based on some query.

here is my code that I am currently running

import snscrape.modules.twitter as sntwt

query = "co founder (CEO OR Congrees OR CTO) lang:en"
tweets = []
limit = 5000
# instead of searching for tweets I want to search for users
for tweet in sntwt.TwitterSearchScraper(query).get_items():

    print(vars(tweet))
    print('\n\n\n\n')
    # some code that filters the users

finally a screen-shot of Twitter advance-search that simulate the behavior I want.

如何使用snscrape基于特定查询来抓取Twitter用户

答案1

得分: 1

请查看 https://github.com/JustAnotherArchivist/snscrape/issues/263。截至本次撰写,这仍然是一个未解决的问题,但JustAnotherArchivist(存储库所有者)似乎在几周前提交了一个更新,允许此功能(可能尚未记录,或可能不可靠)。

我认为这需要snscrape的开发者版本。如果您还没有安装/升级到该版本,请执行以下操作(来自Medium文章):

$ pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git

然后,这将允许“--user”标志工作(我正在从命令行中使用snscrape;关于Python包装器我不太确定)。例如:

$ snscrape --jsonl --max-results 10 twitter-search --user "go bananas since:2022-12-31" > out_file.json

这似乎是搜索查询字符串“go bananas”在用户对象中的任何位置。这将返回例如以下用户对象:'username': 'gobananagoband' 和 'displayname': 'Go Banana Go!' 也将返回一个用户对象,其中 'description': "When games go BANANAS, we've got you covered in bunches. Tips? @ reply or bananasalert at gmail."(据我所知,'description'、'rawDescription' 和 'renderedDescription' 都是用户简介。)

我不确定您是否可以只选择“description”。我尚未进行太多实验。

这还支持一些其他操作符/限定词。例如,地理位置(来自列表;在Twitter总部100公里内):

$ snscrape --jsonl --max-results 10 twitter-search --user "elephant geocode:37.7,-122.4,100km lang:eng since:2022-12-31" > out_file.json
英文:

Check out https://github.com/JustAnotherArchivist/snscrape/issues/263. As of this writing, this is still an open issue, but JustAnotherArchivist (the repository owner) appears to have committed an update a few weeks ago that allows this functionality (it might not be documented yet, or might not be reliable).

I think this requires the developer version of snscrape. So install/upgrade to that if you don't have it yet (from Medium article):

$ pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git

This then should allow the "--user" flag to work (I'm using snscrape from the command line; not sure about the Python wrapper). For example:

$ snscrape --jsonl --max-results 10 twitter-search --user "go bananas since:2022-12-31" > out_file.json

This seems to search for the query string "go bananas" anywhere in the user object. This returns user object for example with: 'username': 'gobananagoband' and 'displayname': 'Go Banana Go!' It also returns a user object with: 'description': "When games go BANANAS, we've got you covered in bunches. Tips? @ reply or bananasalert at gmail." (As far as I can tell, 'description', 'rawDescription', and 'renderedDescription' are all the user bio.)

I am not sure if you can just select for "description." I have not experimented much yet.

This does support some of the other operators / qualifiers. For example, geolocation (from list; within 100km of Twitter HQ):

$ snscrape --jsonl --max-results 10 twitter-search --user "elephant geocode:37.7,-122.4,100km lang:eng since:2022-12-31" > out_file.json

huangapple
  • 本文由 发表于 2023年3月7日 06:11:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75656309.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定