Python and Reddit APIs: my code doesn't give back all results from the huge reddit database. Why?

huangapple go评论79阅读模式
英文:

Python and Reddit APIs: my code doesn't give back all results from the huge reddit database. Why?

问题

我正在练习从Reddit提取数据。我已经尝试获取包含词汇"sport"的最相关的20个社区。虽然有数百个这样的社区,但我的API请求仅返回了不到20个。你知道为什么吗?

以下是代码部分:

parameters = {'query': 'sport', "limit": 20, 'sort':'relevance'}

res = requests.get("https://oauth.reddit.com//api/search_reddit_names", headers=headers, params=parameters)
res.json()

输出部分:

{'names': ['sports',
  'sportsbook',
  'sportsarefun',
  'sportsbetting',
  'SportsFR',
  'sportscards',
  'SportsPorn',
  'SportingCP',
  'Sports_Women']}
英文:

I am practicing on extracting data from Reddit. I have tried to obtain the 20 most relevant communities that contain the word "sport". There are hundreds of them, but my API request gave me back not even 20 of them. Do you know why?

Here is the code:

parameters = {'query': 'sport', "limit":20, 'sort':'relevance'}

res = requests.get("https://oauth.reddit.com//api/search_reddit_names", headers=headers, params=parameters)
res.json()

output:

{'names': ['sports',
  'sportsbook',
  'sportsarefun',
  'sportsbetting',
  'SportsFR',
  'sportscards',
  'SportsPorn',
  'SportingCP',
  'Sports_Women']}

答案1

得分: 1

根据Reddit API文档的描述,

Python and Reddit APIs: my code doesn't give back all results from the huge reddit database. Why?

GET /api/search_reddit_names

  • 列出以查询字符串开头的子论坛名称。

"以开头"和"包含"之间的差异可能是问题的原因。

从快速浏览API来看,我还没有找到适合您需求的合适函数。

编辑:

您可以使用不同的API,专门用于搜索。

import requests

# 设置API请求的URL端点
url = "https://www.reddit.com/subreddits/search.json"

# 设置API请求的参数

search_param = "sport"

params = {
    "q": search_param,  # 搜索查询
    "limit": 100,  # 最多检索的结果数
    "type": "sr"   # 限制搜索到子论坛
}

# 发送API请求
response = requests.get(url, params=params)

# 检查请求是否成功(状态码为200)
if response.status_code == 200:
    # 从响应中提取JSON数据
    data = response.json()

    # 从JSON数据中提取子论坛列表
    subreddits = [item["data"]["display_name"] for item in data["data"]["children"]]

    # 打印包含搜索参数的子论坛列表
    print(f"包含 '{search_param}' 的子论坛:")
    for subreddit in subreddits:
        print(subreddit)
else:
    print("检索数据时发生错误。状态码:", response.status_code)

这会输出:

包含 'sport' 的子论坛:
sport
soccer
sports
BroncoSport
AskReddit
nba
sportsbetting
formula1
baseball
leagueoflegends
sportsbook
teenagers
Dualsport
sportvids
SportWagon
nfl
CFB
OriginSport
sportsarefun
hockey
granturismo
weightlifting
football
unpopularopinion
MMA
GranTurismoSport
...

正如您所见,许多子论坛不包含'sport',因此您需要进一步筛选。

例如,在for subreddit in subreddits中:

if search_param in subreddit:

或者在之前的循环理解中。

英文:

As seen in the Reddit API Docs,

Python and Reddit APIs: my code doesn't give back all results from the huge reddit database. Why?

GET /api/search_reddit_names
- List subreddit names that begin with a query string.

This difference between "begins with" and "contains" is probably the cause of your problem.

From a quick glance at the API, I haven't found a suitable function for your needs.

EDIT:

you could use a different API, specifically for search.

import requests

# Set the URL endpoint for the API request
url = "https://www.reddit.com/subreddits/search.json"

# Set the parameters for the API request

search_param = "sport"

params = {
    "q": search_param,  # Search query
    "limit": 100,  # Maximum number of results to retrieve
    "type": "sr"   # Limit search to subreddits
}

# Send the API request
response = requests.get(url, params=params)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Extract the JSON data from the response
    data = response.json()

    # Extract the list of subreddits from the JSON data
    subreddits = [item["data"]["display_name"] for item in data["data"]["children"]]

    # Print the list of subreddits
    print(f"Subreddits containing '{search_param}':")
    for subreddit in subreddits:
        print(subreddit)
else:
    print("An error occurred while retrieving the data. Status code:", response.status_code)

That outputs:

Subreddits containing 'sport':
sport
soccer
sports
BroncoSport
AskReddit
nba
sportsbetting
formula1
baseball
leagueoflegends
sportsbook
teenagers
Dualsport
sportvids
SportWagon
nfl
CFB
OriginSport
sportsarefun
hockey
granturismo
weightlifting
football
unpopularopinion
MMA
GranTurismoSport
...

As you can see, many don't contain 'sport', so you're gonna have to filter some more.

E.g., in the for subreddit in subreddits:

if search_param in subreddit:

or in the loop comprehension beforehand.

huangapple
  • 本文由 发表于 2023年5月24日 20:02:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76323310.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定