如何在Python中从新的(2023年)PGA巡回赛网站上抓取数据

huangapple go评论55阅读模式
英文:

How to scrape data from new (2023) PGA Tour website in Python

问题

PGA巡回赛更新了他们的网站(截止到2023年2月7日),完全破坏了我用于数据抓取的方式。以前它有一个“隐藏”的URL,可以通过查看开发者工具中的网络选项卡来发现。然后我可以使用Python中的Requests库来获取数据表格的内容。

关于以前的工作原理,可以参考我的先前帖子的回应:https://stackoverflow.com/questions/70141129/what-to-do-when-python-requests-get-gets-a-browser-error-from-the-website。

现在似乎所有的数据都被隐藏,无法像以前那样通过URL访问。我希望有更熟练于网络抓取技巧的人可以指导我如何做到之前的链接所做的事情:

  1. 对于任何锦标赛,能够从任何年份/赛季获取锦标赛历史数据(新网站的示例:https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results)。
  2. 对于任何统计数据,能够从任何年份/赛季获取统计数据(新网站的示例:https://www.pgatour.com/stats/detail/02674)。

初始尝试显示可以从当前页面获取表格(但不包括以前的年份),而且提取的一些数据不是文本,而是格式化代码。

import requests
import pandas as pd

tournament_url = 'https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results'
headers = {'User-Agent': 'Mozilla/5.0'}
t = pd.read_html(requests.get(tournament_url, headers=headers).text)[0]
t

编辑:我从下面的回应中看到这是使用GraphQL。我发现如果你点击Network选项卡中的graphql行,然后查看Payload选项卡,你会看到这些变量:

{
  "tournamentPastResultsId": "R2023464",
  "year": 2022
}

这些似乎提供了有关问题的锦标赛ID和年份,理论上你可以简单地更新这些值并进行查询,选择任何锦标赛,任何年份。将这些集成到抓取中会模仿之前的做法。但我不确定如何做到这一点,我将进一步研究Selenium。希望它能够以某种方式传递这些变量。

编辑 2
对于统计数据(例如https://www.pgatour.com/stats/detail/02567),我能够修改代码以获取适当的表格(参见下文)。

以下是供参考的代码(感谢@Jurakin!):

import pandas as pd
from numpy import NaN
import requests

# 在请求头中似乎需要一个常量令牌('x-api-key')
X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"

YEAR = 2022  # 统计季节
STAT_ID = "02567"  # 统计ID

# 准备负载
payload = {
    "operationName": "StatDetails",
    "variables": {
        "tourCode": "R",
        "statId": STAT_ID,
        "year": YEAR,
        "eventQuery": None
    },
    "query": "query StatDetails($tourCode: TourCode!, $statId: String!, $year: Int, $eventQuery: StatDetailEventQuery) {\n  statDetails(\n    tourCode: $tourCode\n    statId: $statId\n    year: $year\n    eventQuery: $eventQuery\n  ) {\n    tourCode\n    year\n    displaySeason\n    statId\n    statType\n    tournamentPills {\n      tournamentId\n      displayName\n    }\n    yearPills {\n      year\n      displaySeason\n    }\n    statTitle\n    statDescription\n    tourAvg\n    lastProcessed\n    statHeaders\n    statCategories {\n      category\n      displayName\n      subCategories {\n        displayName\n        stats {\n          statId\n          statTitle\n        }\n      }\n    }\n    rows {\n      ... on StatDetailsPlayer {\n        __typename\n        playerId\n        playerName\n        country\n        countryFlag\n        rank\n        rankDiff\n        rankChangeTendency\n        stats {\n          statName\n          statValue\n          color\n        }\n      }\n      ... on StatDetailTourAvg {\n        __typename\n        displayName\n        value\n      }\n    }\n  }\n}"
}

# 发送请求
page = requests.post("https://orchestrator.pgatour.com/graphql", json=payload, headers={"x-api-key": X_API_KEY})

# 检查状态码
page.raise_for_status()

# 获取数据
data = page.json()["data"]["statDetails"]["rows"]

# 格式化为网页中的表格
table = map(lambda item: {
    "rank": item["rank"],
    "player": item["playerName"],
    "average": item["stats"][0]["statValue"],
}, data)

# 转换为数据框
s = pd.DataFrame(table)

s

编辑 3 - 后续问题
上面的回应适用于统计ID中包含5个字符的统计数据。但对于统计ID中只有3个字符的统计数据(例如https://www.pgatour.com/stats/detail/156),它可以正确提取数据,但在表格映射部分失败,尽管我可以看到它们的响应格式是相同的,所以我不明白为什么这不起作用,而其他的却起作用。

import pandas as pd
from numpy import NaN
import requests

# 在请求头中似乎需要一个常量令牌('x-api-key')
X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"

YEAR = 2022  # 统计季节
# STAT_ID = "02567"  # 统计ID SGOTT
STAT_ID = "156"  # 平鸟平均值 -

<details>
<summary>英文:</summary>

The PGA tour updated their website (as of Feb 7, 2023) that completely broke the way I was scraping it for data.  It used to have a &quot;hidden&quot; URL that you could uncover by looking at the Network tab in Developer tools.  Then I could use that &quot;hidden&quot; URL with Requests in Python to pull the data tables.

For background on how it used to work, see the response from this previous post of mine: https://stackoverflow.com/questions/70141129/what-to-do-when-python-requests-get-gets-a-browser-error-from-the-website. 

Now it seems like all the data is obscured away from accessing it via a URL like before.  I&#39;m hoping someone more fluent in web-scraping tricks can point me in the right direction to do what that previous link did:

 1. For any tournament, be able to pull tournament history from any year/season. (Example from the new site: https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results)
2. For any statistic, be able to pull stats from any year/season. (Example from the new site: https://www.pgatour.com/stats/detail/02674)

Initial try shows ability to pull the table off the current page (but not previous years) and some of the data that is pulled is not text, but rather formatting code. 

    import requests
    import pandas as pd
    
    tournament_url = &#39;https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results&#39;
    headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0&#39;}
    t = pd.read_html(requests.get(tournament_url, headers=headers).text)[0]
    t

**EDIT**: I see from a response below that this is using GraphQL.  I discovered that if you click on the `graphql` line in the Network tab and then look at the Payload tab, you&#39;ll see these variables: `{
  &quot;tournamentPastResultsId&quot;: &quot;R2023464&quot;,
  &quot;year&quot;: 2022
}`.

These seem to give the tournament ID and year in question so that in theory you can simply update these values in a query and pick any tournament, any year.  Integrating these into the scraping would mimic how it was done prior.  I&#39;m not sure how to do that though.  I&#39;ll do some more research on Selenium.  Hopefully it is able to pass through these variables somehow.

**EDIT 2**:
The answer was given below for how to do this for the tournament data.  For Stats data (e.g. https://www.pgatour.com/stats/detail/02567), I was able to modify the code to get the appropriate table (see below).

Posted below for reference (thanks to @Jurakin!)

    import pandas as pd
    from numpy import NaN
    import requests
    
    # in the requests header seems to be a constant token (&#39;x-api-key&#39;) that is needed
    X_API_KEY = &quot;da2-gsrx5bibzbb4njvhl7t37wqyl4&quot;
    
    YEAR = 2022  # Stats Season
    STAT_ID = &quot;02567&quot;  # Stat ID
    
    # prepare the payload
    payload = {
        &quot;operationName&quot;: &quot;StatDetails&quot;,
        &quot;variables&quot;: {
            &quot;tourCode&quot;: &quot;R&quot;,
            &quot;statId&quot;: STAT_ID,
            &quot;year&quot;: YEAR,
            &quot;eventQuery&quot;: None
        },
        &quot;query&quot;: &quot;query StatDetails($tourCode: TourCode!, $statId: String!, $year: Int, $eventQuery: StatDetailEventQuery) {\n  statDetails(\n    tourCode: $tourCode\n    statId: $statId\n    year: $year\n    eventQuery: $eventQuery\n  ) {\n    tourCode\n    year\n    displaySeason\n    statId\n    statType\n    tournamentPills {\n      tournamentId\n      displayName\n    }\n    yearPills {\n      year\n      displaySeason\n    }\n    statTitle\n    statDescription\n    tourAvg\n    lastProcessed\n    statHeaders\n    statCategories {\n      category\n      displayName\n      subCategories {\n        displayName\n        stats {\n          statId\n          statTitle\n        }\n      }\n    }\n    rows {\n      ... on StatDetailsPlayer {\n        __typename\n        playerId\n        playerName\n        country\n        countryFlag\n        rank\n        rankDiff\n        rankChangeTendency\n        stats {\n          statName\n          statValue\n          color\n        }\n      }\n      ... on StatDetailTourAvg {\n        __typename\n        displayName\n        value\n      }\n    }\n  }\n}&quot;  
      }
    
    # post the request
    page = requests.post(&quot;https://orchestrator.pgatour.com/graphql&quot;, json=payload, headers={&quot;x-api-key&quot;: X_API_KEY})
    
    # check for status code
    page.raise_for_status()
    
    # get the data
    data = page.json()[&quot;data&quot;][&quot;statDetails&quot;][&quot;rows&quot;]
    
    # print(data)
    
    # format to a table that is in the webpage
    table = map(lambda item: {
        &quot;rank&quot;: item[&quot;rank&quot;],
        &quot;player&quot;: item[&quot;playerName&quot;],
        &quot;average&quot;: item[&quot;stats&quot;][0][&quot;statValue&quot;],
    }, data)
    
    # convert the dataframe
    s = pd.DataFrame(table)
    
    s

**EDIT 3 - FOLLOW UP QUESTION**:
The answer above for stats work for stats with 5 characters in the Stat ID.  But there are others with 3 characters (e.g. https://www.pgatour.com/stats/detail/156) that do grab the data correctly, but fail in the table mapping portion despite what I can tell are identical Response formats, so I am at a loss why this does not work and the other does.

    import pandas as pd
    from numpy import NaN
    import requests
    
    # in the requests header seems to be a constant token (&#39;x-api-key&#39;) that is needed
    X_API_KEY = &quot;da2-gsrx5bibzbb4njvhl7t37wqyl4&quot;
    
    YEAR = 2022  # Stats Season
    # STAT_ID = &quot;02567&quot;  # Stat ID SGOTT
    STAT_ID = &quot;156&quot;  # Birdie Average - doesn&#39;t work for stats that only have three numbers and I can&#39;t figure out why
    
    # prepare the payload
    payload = {
        &quot;operationName&quot;: &quot;StatDetails&quot;,
        &quot;variables&quot;: {
            &quot;tourCode&quot;: &quot;R&quot;,
            &quot;statId&quot;: STAT_ID,
            &quot;year&quot;: YEAR,
            &quot;eventQuery&quot;: None
        },
        &quot;query&quot;: &quot;query StatDetails($tourCode: TourCode!, $statId: String!, $year: Int, $eventQuery: StatDetailEventQuery) {\n  statDetails(\n    tourCode: $tourCode\n    statId: $statId\n    year: $year\n    eventQuery: $eventQuery\n  ) {\n    tourCode\n    year\n    displaySeason\n    statId\n    statType\n    tournamentPills {\n      tournamentId\n      displayName\n    }\n    yearPills {\n      year\n      displaySeason\n    }\n    statTitle\n    statDescription\n    tourAvg\n    lastProcessed\n    statHeaders\n    statCategories {\n      category\n      displayName\n      subCategories {\n        displayName\n        stats {\n          statId\n          statTitle\n        }\n      }\n    }\n    rows {\n      ... on StatDetailsPlayer {\n        __typename\n        playerId\n        playerName\n        country\n        countryFlag\n        rank\n        rankDiff\n        rankChangeTendency\n        stats {\n          statName\n          statValue\n          color\n        }\n      }\n      ... on StatDetailTourAvg {\n        __typename\n        displayName\n        value\n      }\n    }\n  }\n}&quot;  
      }
    
    # post the request
    page = requests.post(&quot;https://orchestrator.pgatour.com/graphql&quot;, json=payload, headers={&quot;x-api-key&quot;: X_API_KEY})
    
    # check for status code
    page.raise_for_status()
    
    # get the data
    data = page.json()[&quot;data&quot;][&quot;statDetails&quot;][&quot;rows&quot;]
    
    print(data)
    
    # format to a table that is in the webpage
    table = map(lambda item: {
        &quot;RANK&quot;: item[&quot;rank&quot;],
        &quot;PLAYER&quot;: item[&quot;playerName&quot;],
        &quot;AVERAGE&quot;: item[&quot;stats&quot;][0][&quot;statValue&quot;],
    }, data)
    
    # convert the dataframe
    s = pd.DataFrame(table)
    
    s



</details>


# 答案1
**得分**: 2

I will provide translations for the code portions you provided:

```python
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

driver = webdriver.Chrome()

# load page
driver.get("https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results")

# get table
table = driver.find_element(By.CSS_SELECTOR, "table.chakra-table")
assert table, "table not found"

# remove empty rows
driver.execute_script("""arguments[0].querySelectorAll("td.css-1au52ex").forEach((e) => e.parentElement.remove())""", table)

# get html of the table
table_html = table.get_attribute("outerHTML")

# quit selenium
driver.quit()

df = pd.read_html(table_html)[0]

print(df)

Outputs:

     Pos             Player  R1   R2   R3   R4 To Par  FedExCup Pts Official Money
0      1           Max Homa  -7   -5    E   -4    -16         500.0     $1,440,000
1      2      Danny Willett  -4   -8    E   -3    -15         300.0       $872,000
2      3  Taylor Montgomery  -4   -1    E   -8    -13         190.0       $552,000
3     T4       Justin Lower  -9   -1   -3   +1    -12         122.5       $360,000
4     T4      Byeong Hun An  -6   -4   -1   -1    -12         122.5       $360,000
..   ...                ...  ..  ...  ...  ...    ...           ...            ...
151  CUT         Doc Redman  +2   +6  NaN  NaN     +8           0.0             $0
152  CUT       Kyle Stanley  +6   +2  NaN  NaN     +8           0.0             $0
153  CUT         Jim Herman  -1  +10  NaN  NaN     +9           0.0             $0
154  CUT        Taylor Lowe  +9   +8  NaN  NaN    +17           0.0             $0
155  W/D   Brandon Matthews   -  NaN  NaN  NaN      E           0.0             $0

[156 rows x 9 columns]

Let me know if you need a translation for the second part of your code as well.

英文:

As you can see in devtools, the page uses graphql. graphql is a bit complicated for me and would take a long time to deobfuscate and understand the code, so I used selenium4 to run the javascript and build the table.

from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

driver = webdriver.Chrome()

# load page
driver.get(&quot;https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results&quot;)

# get table
table = driver.find_element(By.CSS_SELECTOR, &quot;table.chakra-table&quot;)
assert table, &quot;table not found&quot;

# remove empty rows
driver.execute_script(&quot;&quot;&quot;arguments[0].querySelectorAll(&quot;td.css-1au52ex&quot;).forEach((e) =&gt; e.parentElement.remove())&quot;&quot;&quot;, table)

# get html of the table
table_html = table.get_attribute(&quot;outerHTML&quot;)

# quit selenium
driver.quit()

df = pd.read_html(table_html)[0]

print(df)

Outputs:

     Pos             Player  R1   R2   R3   R4 To Par  FedExCup Pts Official Money
0      1           Max Homa  -7   -5    E   -4    -16         500.0     $1,440,000
1      2      Danny Willett  -4   -8    E   -3    -15         300.0       $872,000
2      3  Taylor Montgomery  -4   -1    E   -8    -13         190.0       $552,000
3     T4       Justin Lower  -9   -1   -3   +1    -12         122.5       $360,000
4     T4      Byeong Hun An  -6   -4   -1   -1    -12         122.5       $360,000
..   ...                ...  ..  ...  ...  ...    ...           ...            ...
151  CUT         Doc Redman  +2   +6  NaN  NaN     +8           0.0             $0
152  CUT       Kyle Stanley  +6   +2  NaN  NaN     +8           0.0             $0
153  CUT         Jim Herman  -1  +10  NaN  NaN     +9           0.0             $0
154  CUT        Taylor Lowe  +9   +8  NaN  NaN    +17           0.0             $0
155  W/D   Brandon Matthews   -  NaN  NaN  NaN      E           0.0             $0
[156 rows x 9 columns]

EDIT:

I created script that uses graphql api to fetch the data as you told me in the comments.

import pandas as pd
from numpy import NaN
import requests

# in the requests header seems to be a constant token
X_API_KEY = &quot;da2-gsrx5bibzbb4njvhl7t37wqyl4&quot;

YEAR = 2023
PAST_RESULTS_ID = &quot;R2023464&quot;

# prepare the payload
payload = {
    &quot;operationName&quot;: &quot;TournamentPastResults&quot;,
    &quot;variables&quot;: {
        &quot;tournamentPastResultsId&quot;: PAST_RESULTS_ID,
        &quot;year&quot;: YEAR
    },
    &quot;query&quot;: &quot;query TournamentPastResults($tournamentPastResultsId: ID!, $year: Int) {\n  tournamentPastResults(id: $tournamentPastResultsId, year: $year) {\n    id\n    players {\n      id\n      position\n      player {\n        id\n        firstName\n        lastName\n        shortName\n        displayName\n        abbreviations\n        abbreviationsAccessibilityText\n        amateur\n        country\n        countryFlag\n        lineColor\n      }\n      rounds {\n        score\n        parRelativeScore\n      }\n      additionalData\n      total\n      parRelativeScore\n    }\n    rounds\n    additionalDataHeaders\n    availableSeasons {\n      year\n      displaySeason\n    }\n    winner {\n      id\n      firstName\n      lastName\n      totalStrokes\n      totalScore\n      countryFlag\n      countryName\n      purse\n      points\n    }\n  }\n}&quot;
}

# post the request
page = requests.post(&quot;https://orchestrator.pgatour.com/graphql&quot;, json=payload, headers={&quot;x-api-key&quot;: X_API_KEY})

# check for status code
page.raise_for_status()

# get the data
data = page.json()[&quot;data&quot;][&quot;tournamentPastResults&quot;][&quot;players&quot;]

# format to a table that is in the webpage
table = map(lambda item: {
    &quot;pos&quot;: item[&quot;position&quot;],
    &quot;player&quot;: item[&quot;player&quot;][&quot;displayName&quot;],
    &quot;r1&quot;: item[&quot;rounds&quot;][0][&quot;parRelativeScore&quot;] if len(item[&quot;rounds&quot;]) &gt; 0 else NaN,
    &quot;r2&quot;: item[&quot;rounds&quot;][1][&quot;parRelativeScore&quot;] if len(item[&quot;rounds&quot;]) &gt; 1 else NaN,
    &quot;r3&quot;: item[&quot;rounds&quot;][2][&quot;parRelativeScore&quot;] if len(item[&quot;rounds&quot;]) &gt; 2 else NaN,
    &quot;r4&quot;: item[&quot;rounds&quot;][3][&quot;parRelativeScore&quot;] if len(item[&quot;rounds&quot;]) &gt; 3 else NaN,
    &quot;to par&quot;: item[&quot;parRelativeScore&quot;],
    &quot;fedexcup pts&quot;: item[&quot;additionalData&quot;][0],
    &quot;official money&quot;: item[&quot;additionalData&quot;][1],
}, data)

# convert the dataframe
df = pd.DataFrame(table)

print(df)

EDIT 3:

The code raises a KeyError: &#39;rank&#39; error because the item does not have a rank attribute. I used the following code to get an invalid item:

# get the data
data = page.json()[&quot;data&quot;][&quot;statDetails&quot;][&quot;rows&quot;]
for item in data:
if &quot;rank&quot; not in item:
print(item)
# Outputs:
# {&quot;__typename&quot;: &quot;StatDetailTourAvg&quot;, &quot;displayName&quot;: &quot;Tour Average&quot;, &quot;value&quot;: &quot;3.64&quot;},

As you can see, his __typename is different from all the others. I found two solutions:

Solution A

Filter out items that's __typename is not equal to StatDetailsPlayer:

...

# get the data
data = page.json()[&quot;data&quot;][&quot;statDetails&quot;][&quot;rows&quot;]

# print(data)

# filter out items, thats __typename is not &quot;StatDetailsPlayer&quot; like
# {&quot;__typename&quot;: &quot;StatDetailTourAvg&quot;, &quot;displayName&quot;: &quot;Tour Average&quot;, &quot;value&quot;: &quot;3.64&quot;}
data = filter(lambda item: item.get(&quot;__typename&quot;, NaN) == &quot;StatDetailsPlayer&quot;, data)

# format to a table that is in the webpage
table = map(lambda item: {
    &quot;RANK&quot;: item[&quot;rank&quot;],
    &quot;PLAYER&quot;: item[&quot;playerName&quot;],
    &quot;AVERAGE&quot;: item[&quot;stats&quot;][0][&quot;statValue&quot;],
}, data)


# convert the dataframe
s = pd.DataFrame(table)

print(s)

Solution B

Attempts to retrieve attributes from the object if possible, otherwise returns NaN.

...
def get(obj: object, keys: list, default=NaN):
&quot;&quot;&quot;
obj = {&quot;a&quot;: {&quot;b&quot;: {&quot;c&quot;: [0, 1, 2, 3]}}}
keys = [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, 0]
# returns 0
out = get(obj, keys, default=NaN)
# return NaN
out = get(obj, [&quot;a&quot;, &quot;c&quot;])
&quot;&quot;&quot;
for key in keys:
try:
obj = obj[key]
except KeyError:
return default
return obj
# format to a table that is in the webpage
table = map(lambda item: {
&quot;RANK&quot;: item.get(&quot;rank&quot;, NaN), # NaN is default (using buit-in function)
&quot;PLAYER&quot;: item.get(&quot;playerName&quot;, NaN),
&quot;AVERAGE&quot;: get(item, [&quot;stats&quot;, 0, &quot;statValue&quot;] default=NaN), # my function (built-in function does not support multiple keys)
}, data)
# convert the dataframe
s = pd.DataFrame(table)
print(s)

Difference

Solution A does not contain invalid row, Solution B does.

huangapple
  • 本文由 发表于 2023年2月8日 08:30:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/75380332.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定