英文:
How to scrape twitter users based on a certain query using snscrape
问题
我正在使用snscrape
来抓取在其个人简介中包含特定关键词的用户。我目前正在使用以下算法:
- 搜索包含特定关键词的推文。
- 提取发推文的用户。
- 过滤提取的用户个人简介,检查是否包含所需的关键词(如果用户个人简介包含关键词,则将该用户添加到数据框中,否则丢弃该用户)。
现在我想知道是否有一种方法可以立即根据用户个人简介搜索用户,而不是像我现在所做的那样模拟Twitter网页的高级搜索功能?我查看了snscrape
的文档,但所有处理用户的类似乎只与特定用户有关,而不是根据某些查询搜索用户。
以下是我目前正在运行的代码:
import snscrape.modules.twitter as sntwt
query = "co founder (CEO OR Congress OR CTO) lang:en"
tweets = []
limit = 5000
# 与其搜索推文,我想搜索用户
for tweet in sntwt.TwitterSearchScraper(query).get_items():
print(vars(tweet))
print('\n\n\n\n')
# 一些过滤用户的代码
最后,这是模拟我想要的行为的Twitter高级搜索的屏幕截图。
英文:
I am using snscrape
to scrape users that have a certain keyword in their bio.
the algorithm that I am using right-now is the following:
- search for tweets that contains a certain words
- extract the user who tweeted this tweet
- filter the extracted user bio against the the wanted keyword (if the user bio includes the keywords append that user in a data frame if not, discard this user).
now what I want to know is there a method that can immediately search for users based on their bio instead of what I am doing right-now i.e. simulating the advance search feature of the Twitter web page ?
I looked at snscrape
docs but all classes that deals with users appears to be only dealing with a specific user not search for users based on some query.
here is my code that I am currently running
import snscrape.modules.twitter as sntwt
query = "co founder (CEO OR Congrees OR CTO) lang:en"
tweets = []
limit = 5000
# instead of searching for tweets I want to search for users
for tweet in sntwt.TwitterSearchScraper(query).get_items():
print(vars(tweet))
print('\n\n\n\n')
# some code that filters the users
finally a screen-shot of Twitter advance-search that simulate the behavior I want.
答案1
得分: 1
请查看 https://github.com/JustAnotherArchivist/snscrape/issues/263。截至本次撰写,这仍然是一个未解决的问题,但JustAnotherArchivist(存储库所有者)似乎在几周前提交了一个更新,允许此功能(可能尚未记录,或可能不可靠)。
我认为这需要snscrape的开发者版本。如果您还没有安装/升级到该版本,请执行以下操作(来自Medium文章):
$ pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git
然后,这将允许“--user”标志工作(我正在从命令行中使用snscrape;关于Python包装器我不太确定)。例如:
$ snscrape --jsonl --max-results 10 twitter-search --user "go bananas since:2022-12-31" > out_file.json
这似乎是搜索查询字符串“go bananas”在用户对象中的任何位置。这将返回例如以下用户对象:'username': 'gobananagoband' 和 'displayname': 'Go Banana Go!' 也将返回一个用户对象,其中 'description': "When games go BANANAS, we've got you covered in bunches. Tips? @ reply or bananasalert at gmail."(据我所知,'description'、'rawDescription' 和 'renderedDescription' 都是用户简介。)
我不确定您是否可以只选择“description”。我尚未进行太多实验。
这还支持一些其他操作符/限定词。例如,地理位置(来自列表;在Twitter总部100公里内):
$ snscrape --jsonl --max-results 10 twitter-search --user "elephant geocode:37.7,-122.4,100km lang:eng since:2022-12-31" > out_file.json
英文:
Check out https://github.com/JustAnotherArchivist/snscrape/issues/263. As of this writing, this is still an open issue, but JustAnotherArchivist (the repository owner) appears to have committed an update a few weeks ago that allows this functionality (it might not be documented yet, or might not be reliable).
I think this requires the developer version of snscrape. So install/upgrade to that if you don't have it yet (from Medium article):
$ pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git
This then should allow the "--user" flag to work (I'm using snscrape from the command line; not sure about the Python wrapper). For example:
$ snscrape --jsonl --max-results 10 twitter-search --user "go bananas since:2022-12-31" > out_file.json
This seems to search for the query string "go bananas" anywhere in the user object. This returns user object for example with: 'username': 'gobananagoband' and 'displayname': 'Go Banana Go!' It also returns a user object with: 'description': "When games go BANANAS, we've got you covered in bunches. Tips? @ reply or bananasalert at gmail." (As far as I can tell, 'description', 'rawDescription', and 'renderedDescription' are all the user bio.)
I am not sure if you can just select for "description." I have not experimented much yet.
This does support some of the other operators / qualifiers. For example, geolocation (from list; within 100km of Twitter HQ):
$ snscrape --jsonl --max-results 10 twitter-search --user "elephant geocode:37.7,-122.4,100km lang:eng since:2022-12-31" > out_file.json
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论