2023年2月10日 04:52:31go评论129阅读模式

英文:

How do I scrape all spotify playlists ever?

问题

我正在尝试分析所有用户自定义的 Spotify 播放列表以及其中的曲目，特别是在嘻哈音乐流派中。我想要的结果是用户自定义播放列表的列表 ID（最好是 50,000 个播放列表 ID）。

我尝试使用search API和获取类别播放列表 Spotify API。问题在于：

search API 有 1,000 条数据的限制。
获取类别播放列表 Spotify API 只提供每种流派的 Spotify 官方播放列表。

我还尝试绕过search API，考虑解析不同的查询（即搜索 'a'，'b'，'c'，'d'，...）。然而，我仍然不清楚哪些查询最能代表整个 Spotify 播放列表（因为搜索 'a'，'b'，... 将被视为太随机）。我会感激任何帮助或想法！

这是我在 Google Colab 中使用 Spotipy 库尝试的获取类别播放列表 Spotify API：

import pandas as pd
import numpy as np
import spotipy
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.oauth2 as oauth2

# 用你的客户端 ID、秘钥替换认证详情
spotify_details = {
    'client_id' : 'Client ID',
    'client_secret':'Client Secret',
    'redirect_uri':'Redirect_uri'}

scope = "user-library-read user-follow-read user-top-read playlist-read-private playlist-read-collaborative playlist-modify-public playlist-modify-private" 

sp = spotipy.Spotify(
        auth_manager=spotipy.SpotifyOAuth(
          client_id=spotify_details['client_id'],
          client_secret=spotify_details['client_secret'],
          redirect_uri=spotify_details['redirect_uri'],    
          scope=scope,open_browser=False))


results = sp.category_playlists(category_id="hiphop", limit = 5, country="US", offset=0)
total = results["playlists"]["total"]
df=pd.DataFrame([],columns = ['id', 'name', 'external_urls.spotify'])
for offset in range(0,total,50):
  results = sp.category_playlists(category_id="hiphop", limit = 50, country="US", offset=offset)
  playlists = pd.json_normalize(results['playlists']['items'])
  #print(playlists.keys)
  df=pd.concat([df,playlists])
df

当我运行以下代码时，我只能得到大约 104 个播放列表：

print(len(df)) 
>>104

备注：这个数字大约在 80-100+ 左右，具体取决于你账户的位置。

英文:

I am trying to analyze all user-curated Spotify playlists and the tracks inside all of them, especially in the hip-hop genre. The result that I want is a list of user-curated playlists ID (preferably 50,000 playlist IDs)

I have tried using search API and Get Category’s Playlist Spotify API.
The problem is that

There is a 1,000 data limit forsearch API.
Get Category’s Playlist Spotify API only gives out Spotify-curated playlists on each genre.

I also tried to go around the search API by thinking of parsing different queries (i.e. search on 'a','b','c','d',...). However, I still have no idea which queries will best represent Spotify playlists as a whole (as searching 'a','b',... would be considered too random). I would appreciate any help or ideas!

This is what I have tried with Get Category’s Playlist Spotify API with Spotipy Library in Google Colab

import pandas as pd
import numpy as np
import spotipy
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.oauth2 as oauth2

# Replace Auth details with your Client ID, Secret
spotify_details = {
    &#39;client_id&#39; : &#39;Client ID&#39;,
    &#39;client_secret&#39;:&#39;Client Secret&#39;,
    &#39;redirect_uri&#39;:&#39;Redirect_uri&#39;}

scope = &quot;user-library-read user-follow-read user-top-read playlist-read-private playlist-read-collaborative playlist-modify-public playlist-modify-private&quot; 

sp = spotipy.Spotify(
        auth_manager=spotipy.SpotifyOAuth(
          client_id=spotify_details[&#39;client_id&#39;],
          client_secret=spotify_details[&#39;client_secret&#39;],
          redirect_uri=spotify_details[&#39;redirect_uri&#39;],    
          scope=scope,open_browser=False))


results = sp.category_playlists(category_id=&quot;hiphop&quot;, limit = 5, country=&quot;US&quot;, offset=0)
total = results[&quot;playlists&quot;][&quot;total&quot;]
df=pd.DataFrame([],columns = [&#39;id&#39;, &#39;name&#39;, &#39;external_urls.spotify&#39;])
for offset in range(0,total,50):
  results = sp.category_playlists(category_id=&quot;hiphop&quot;, limit = 50, country=&quot;US&quot;, offset=offset)
  playlists = pd.json_normalize(results[&#39;playlists&#39;][&#39;items&#39;])
  #print(playlists.keys)
  df=pd.concat([df,playlists])
df

I only can get around 104 playlists when I run

print(len(df)) 
&gt;&gt;104

P.S. This number varies around 80-100+ depending on the location of your account.

答案1

得分: 2

主要思想与@Nima Akbarzadeh的使用offset相同。

我正在使用Node.js上的axios调用Spotify API。

首先获取播放列表，然后在循环中获取每个播放列表中的曲目。

这段代码可以从Spotify获取所有的hiphop歌曲。

我获得了6435首歌曲。

在Python版本中的更新如下：

import spotipy
from spotipy.oauth2 import SpotifyOAuth
import json
import re

SCOPE = ['user-library-read',
    'user-follow-read',
    'user-top-read',
    'playlist-read-private',
    'playlist-read-collaborative',
    'playlist-modify-public',
    'playlist-modify-private']
USER_ID = '<你的用户ID>'
REDIRECT_URI = '<你的重定向URI>'
CLIENT_ID = '<你的客户端ID>'
CLIENT_SECRET = '<你的客户端密钥>'
auth_manager = SpotifyOAuth(
    scope=SCOPE,
    username=USER_ID,
    redirect_uri=REDIRECT_URI,
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET)

def get_categories():
    # 获取分类的函数
    # 代码内容
    pass

def get_songs(categories):
    # 获取歌曲的函数
    # 代码内容
    pass

categories = get_categories()
songs = get_songs(categories)
print(json.dumps(songs))
# print(len(songs)) -> 6021

以上是部分代码和描述的翻译，内容涉及Node.js和Python的使用。

英文:

Main idea is same as @Nima Akbarzadeh's idea with offset

I am using axios call with Spotify API call on node.js

Got the playlists first, then get track within loop each playlist.

This Code can get all of hiphop songs from Spotify.

const axios = require(&#39;axios&#39;)

const API_KEY=&#39;&lt;your client ID&gt;&#39;
const API_KEY_SECRET=&#39;&lt;your client Secret&gt;&#39;

const getToken = async () =&gt; {
    try {
        const resp = await axios.post(
            url = &#39;https://accounts.spotify.com/api/token&#39;,
            data = &#39;&#39;,
            config = {
                params: {
                    &#39;grant_type&#39;: &#39;client_credentials&#39;
                },
                auth: {
                    username: API_KEY,
                    password: API_KEY_SECRET
                }
            }
        );
        return Promise.resolve(resp.data.access_token);
    } catch (err) {
        console.error(err)
        return Promise.reject(err)
    }
};
const getCategories = async (category_id, token) =&gt; {
    try {
        let offset = 0
        let next = 1
        const songs = [];
        while (next != null) {
            const resp = await axios.get(
                url = `https://api.spotify.com/v1/browse/categories/${category_id}/playlists?country=US&amp;offset=${offset}&amp;limit=20`,
                config = {
                    headers: {
                        &#39;Accept-Encoding&#39;: &#39;application/json&#39;,
                        &#39;Authorization&#39;: `Bearer ${token}`,
                    }
                }
            );
            
            for(const item of resp.data.playlists.items) {
                if(item?.name != null) {
                    songs.push({
                        name: item.name,
                        external_urls: item.external_urls.spotify,
                        type: item.type,
                        id : item.id
                    })
                }
            }

            offset = offset + 20

            next = resp.data.playlists.next
        }
        return Promise.resolve(songs)
    } catch (err) {
        console.error(err)
        return Promise.reject(err)
    }
}

const getTracks = async (playlists, token) =&gt; {
    try {
        const tracks = [];
        for(const playlist of playlists) {
            const resp = await axios.get(
                url = `https://api.spotify.com/v1/playlists/${playlist.id}`,
                config = {
                    headers: {
                        &#39;Accept-Encoding&#39;: &#39;application/json&#39;,
                        &#39;Authorization&#39;: `Bearer ${token}`,
                    }
                }
            );
            for(const item of resp.data.tracks.items) {
                if(item.track?.name != null) {
                    tracks.push({
                        name: item.track.name,
                        external_urls: item.track.external_urls.spotify
                    })
                }
            }
        }
        return Promise.resolve(tracks)
    } catch (err) {
        console.error(err)
        return Promise.reject(err)
    }
};

getToken()
    .then(token =&gt; {
        getCategories(&#39;hiphop&#39;, token)
            .then(playlists =&gt; {
                getTracks(playlists, token)
                    .then(tracks =&gt; {
                        for(const track of tracks) {
                            console.log(track)
                        }
                    })
                    .catch(error =&gt; {
                        console.log(error.message);
                    });  
            })
            .catch(error =&gt; {
                console.log(error.message);
            });
      
    })
    .catch(error =&gt; {
        console.log(error.message);
    });

I got 6435 songs

$ node get-data.js
[
{
name: &#39;RapCaviar&#39;,
external_urls: &#39;https://open.spotify.com/playlist/37i9dQZF1DX0XUsuxWHRQd&#39;,
type: &#39;playlist&#39;,
id: &#39;37i9dQZF1DX0XUsuxWHRQd&#39;
},
{
name: &quot;Feelin&#39; Myself&quot;,
external_urls: &#39;https://open.spotify.com/playlist/37i9dQZF1DX6GwdWRQMQpq&#39;,
type: &#39;playlist&#39;,
id: &#39;37i9dQZF1DX6GwdWRQMQpq&#39;
},
{
name: &#39;Most Necessary&#39;,
external_urls: &#39;https://open.spotify.com/playlist/37i9dQZF1DX2RxBh64BHjQ&#39;,
type: &#39;playlist&#39;,
id: &#39;37i9dQZF1DX2RxBh64BHjQ&#39;
},
{
name: &#39;Gold School&#39;,
external_urls: &#39;https://open.spotify.com/playlist/37i9dQZF1DWVA1Gq4XHa6U&#39;,
type: &#39;playlist&#39;,
id: &#39;37i9dQZF1DWVA1Gq4XHa6U&#39;
},
{
name: &#39;Locked In&#39;,
external_urls: &#39;https://open.spotify.com/playlist/37i9dQZF1DWTl4y3vgJOXW&#39;,
type: &#39;playlist&#39;,
id: &#39;37i9dQZF1DWTl4y3vgJOXW&#39;
},
{
name: &#39;Taste&#39;,
external_urls: &#39;https://open.spotify.com/playlist/37i9dQZF1DWSUur0QPPsOn&#39;,
type: &#39;playlist&#39;,
id: &#39;37i9dQZF1DWSUur0QPPsOn&#39;
},
{
name: &#39;Get Turnt&#39;,
external_urls: &#39;https://open.spotify.com/playlist/37i9dQZF1DWY4xHQp97fN6&#39;,
type: &#39;playlist&#39;,
id: &#39;37i9dQZF1DWY4xHQp97fN6&#39;
},
...
{
name: &#39;BILLS PAID (feat. Latto &amp; City Girls)&#39;,
external_urls: &#39;https://open.spotify.com/track/0JiLQRLOeWQdPC9rVpOqqo&#39;
},
{
name: &#39;Persuasive (with SZA)&#39;,
external_urls: &#39;https://open.spotify.com/track/67v2UHujFruxWrDmjPYxD6&#39;
},
{
name: &#39;Shirt&#39;,
external_urls: &#39;https://open.spotify.com/track/34ZAzO78a5DAVNrYIGWcPm&#39;
},
{
name: &#39;Back 2 the Streets&#39;,
external_urls: &#39;https://open.spotify.com/track/3Z9aukqdW2HuzFF1x9lKUm&#39;
},
{
name: &#39;FTCU (feat. GloRilla &amp; Gangsta Boo)&#39;,
external_urls: &#39;https://open.spotify.com/track/4lxTmHPgoRWwM9QisWobJL&#39;
},
{
name: &#39;My Way&#39;,
external_urls: &#39;https://open.spotify.com/track/5BcIBbBdkjSYnf5jNlLG7j&#39;
},
{
name: &#39;Donk&#39;,
external_urls: &#39;https://open.spotify.com/track/58lmOL5ql1YIXrpRpoYi3i&#39;
},
... 6335 more items
]

node get-data.js &gt; result.json

Update with Python version

import spotipy
from spotipy.oauth2 import SpotifyOAuth
import json
import re

SCOPE = [&#39;user-library-read&#39;,
    &#39;user-follow-read&#39;,
    &#39;user-top-read&#39;,
    &#39;playlist-read-private&#39;,
    &#39;playlist-read-collaborative&#39;,
    &#39;playlist-modify-public&#39;,
    &#39;playlist-modify-private&#39;]
USER_ID = &#39;&lt;your user id&gt;&#39;
REDIRECT_URI = &#39;&lt;your redirect uri&gt;&#39;
CLIENT_ID = &#39;&lt;your client id&gt;&#39;
CLIENT_SECRET = &#39;&lt;your client secret&gt;&#39;
auth_manager = SpotifyOAuth(
    scope=SCOPE,
    username=USER_ID,
    redirect_uri=REDIRECT_URI,
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET)

def get_categories():
    try:
        sp = spotipy.Spotify(auth_manager=auth_manager)
        query_limit = 50
        categories=[]
        new_offset = 0
        while True:
            results=sp.category_playlists(category_id=&#39;hiphop&#39;, limit = query_limit, country=&#39;US&#39;, offset=new_offset)
            for item in results[&#39;playlists&#39;][&#39;items&#39;]:
                if (item is not None and item[&#39;name&#39;] is not None):
                    # [&#39;https:&#39;, &#39;&#39;, &#39;api.spotify.com&#39;, &#39;v1&#39;, &#39;playlists&#39;, &#39;37i9dQZF1DX0XUsuxWHRQd&#39;, &#39;tracks&#39;]
                    tokens = re.split(r&quot;[\/]&quot;, item[&#39;tracks&#39;][&#39;href&#39;])
                    categories.append({
                        &#39;id&#39; : item[&#39;id&#39;],
                        &#39;name&#39;: item[&#39;name&#39;],
                        &#39;url&#39;: item[&#39;external_urls&#39;][&#39;spotify&#39;],
                        &#39;tracks&#39;: item[&#39;tracks&#39;][&#39;href&#39;],
                        &#39;playlist_id&#39;: tokens[5],
                        &#39;type&#39;: item[&#39;type&#39;]
                    })
            new_offset = new_offset + query_limit
            next = results[&#39;playlists&#39;][&#39;next&#39;]
            if next is None:
                break
        return categories
    except Exception as e:
        print(&#39;Failed to upload to call get_categories: &#39;+ str(e))

def get_songs(categories):
    try:
        sp = spotipy.Spotify(auth_manager=auth_manager)
        songs=[]
        for category in categories:
            if category is None:
                break
            playlist_id = category[&#39;playlist_id&#39;]
            results=sp.playlist(playlist_id=playlist_id)
            for item in results[&#39;tracks&#39;][&#39;items&#39;]:
                if (item is not None and item[&#39;track&#39;] is not None and item[&#39;track&#39;][&#39;id&#39;] is not None and item[&#39;track&#39;][&#39;name&#39;] is not None and item[&#39;track&#39;][&#39;external_urls&#39;][&#39;spotify&#39;] is not None):
                    songs.append({
                        &#39;id&#39; : item[&#39;track&#39;][&#39;id&#39;],
                        &#39;name&#39;: item[&#39;track&#39;][&#39;name&#39;],
                        &#39;url&#39;: item[&#39;track&#39;][&#39;external_urls&#39;][&#39;spotify&#39;]
                    })
                else:
                    break
        return songs
    except Exception as e:
        print(&#39;Failed to upload to call get_songs: &#39;+ str(e))

categories = get_categories()
songs = get_songs(categories)
print(json.dumps(songs))
# print(len(songs)) -&gt; 6021

Result by

$ python get-songs.py &gt; all-songs.json

答案2

得分: 0

目前，Spotify 不允许你抓取超过1,000条数据，因为他们的应用程序甚至只显示最多1,000首歌曲（参考此回答）。

另外，如果有偏移选项，你可以将其设置为1,000，它会跳过前1,000条，这样你就可以获取第二个分块。

英文:

Currently, Spotify will not let you scrape more than 1K as their application even show maximum 1k music (based on this answer).

Also, if there is any offset option, you can set it to 1k, and it will skip the first 1k, so you can get the second chunk.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何才能爬取所有的 Spotify 播放列表？

问题

答案1

Update with Python version

答案2

如何提高比对一列来自MongoDB的已知ID列表和另一列ID列表的速度？

如何使用R中的rvest从存储在AWS上的网站下载PDF文件。

问题与数据爬取有关。

Python BeautifulSoup 无法识别 div 标签

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论