下载来自URL的压缩CSV文件并转换为数据框。

huangapple go评论78阅读模式
英文:

download zipped csv from url and convert to dataframe

问题

I want to download and read a file from this site: http://data.gdeltproject.org/events/index.html

each of the files are of the named YYYYMMDD.export.CSV.zip

I am stuck at this point in my code:

import pandas as pd
import zipfile
import requests
from datetime import date, timedelta  
url = 'http://data.gdeltproject.org/events/index.html'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
file_from_url = yesterday + '.export.CSV.zip'
with open(file_from_url, "wb") as f: 
    f.write(resp.content) 

now I am stuck trying to read the contents

I tried readlines, but this did not work

Any suggestions how I can read my zipped file

英文:

I want to download and read a file from this site: http://data.gdeltproject.org/events/index.html

each of the files are of the named YYYYMMDD.export.CSV.zip

I am stuck at this point in my code:

import pandas as pd
import zipfile
import requests
from datetime import date, timedelta  
url = 'http://data.gdeltproject.org/events/index.html'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
file_from_url = yesterday + '.export.CSV.zip'
with open(file_from_url, "wb") as f: 
    f.write(resp.content) 

now I am stuck trying to read the contents

I tried readlines, but this did not work

Any suggestions how I can read my zipped file

答案1

得分: 2

以下是翻译好的内容:

第一个问题是url变量被定义但从未被使用。

url = 'http://data.gdeltproject.org/events/index.html'

url中的index.html部分在下载压缩的CSV文件时没有用处 - 你需要构建一个类似这样的URL字符串:http://data.gdeltproject.org/events/20230412.export.CSV.zip。你需要类似这样的代码:

from datetime import date, timedelta  

base_url = 'http://data.gdeltproject.org/events/'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
filename = yesterday + '.export.CSV'
url = base_url + filename + ".zip"
print(f'URL是"{url}"')

这会为我输出以下内容:

URL是"http://data.gdeltproject.org/events/20230412.export.CSV.zip"

接下来,你没有从data.gdeltproject.org下载任何内容。你的代码尝试打开一个本地文件以进行写入:

with open(file_from_url, "wb") as f: 
    f.write(resp.content) 

你需要下载文件并打开它以进行读取。以下代码应该可以解决问题:

import pandas as pd
import requests
from io import BytesIO
from datetime import date, timedelta  

base_url = 'http://data.gdeltproject.org/events/'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
filename = yesterday + '.export.CSV'
url = base_url + filename + ".zip"

r = requests.get(url)

# 从CSV数据创建一个数据框架
# CSV是以制表符分隔的,没有标题行
df = pd.read_csv(BytesIO(r.content), compression='zip', delimiter='\t', header=None)

print(df.head())

这会输出如下内容:

sys:1: DtypeWarning: Columns (24,26,27,28) have mixed types.Specify dtype option on import or set low_memory=False.
           0         1       2     3          4    5       6    7    8    9    10  ...        47      48 49                                50  51    52       53        54      55        56                                                 57
0  1095081749  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ... -119.7460      CA  2         California, United States  US  USCA  36.1700 -119.7460      CA  20230412  https://www.sandiegouniontribune.com/news/cali...
1  1095081750  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ...  -81.9059  294668  3  Bay Lake, Florida, United States  US  USFL  28.4775  -81.9059  294668  20230412  https://www.streetinsider.com/Reuters/New+Flor...
2  1095081751  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ...  -81.9059  294668  2            Florida, United States  US  USFL  27.8333  -81.7170      FL  20230412  https://www.streetinsider.com/Reuters/New+Flor...
3  1095081752  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ...  -81.7170      FL  2            Florida, United States  US  USFL  27.8333  -81.7170      FL  20230412  https://financialpost.com/pmn/business-pmn/new...
4  1095081753  20220412  202204  2022  2022.2795  COP  POLICE  NaN  NaN  NaN  NaN  ...  -89.0022      IL  2           Illinois, United States  US  USIL  40.3363  -89.0022      IL  20230412  https://chicago.suntimes.com/crime/2023/4/11/2...

[5 rows x 58 columns]

请注意,输出中报告了数据类型错误。

英文:

First issue is the url variable is defined but never used

url = 'http://data.gdeltproject.org/events/index.html'

The index.html part of the url is of no use when downloading the zipped CSV files -- you need to construct a url string like this http://data.gdeltproject.org/events/20230412.export.CSV.zip. Something like this is what you need

from datetime import date, timedelta  

base_url = 'http://data.gdeltproject.org/events/'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
filename = yesterday + '.export.CSV'
url = base_url + filename + ".zip"
print(f'URL is "{url}"')

that outputs this for me

URL is "http://data.gdeltproject.org/events/20230412.export.CSV.zip"

Next, you are not downloading anything from data.gdeltproject.org. Your code, below, is attempting to open a local file for writing

with open(file_from_url, "wb") as f: 
    f.write(resp.content) 

You need to download the file file and open that for reading. Something like this should do the trick

import pandas as pd
import requests
from io import BytesIO
from datetime import date, timedelta  

base_url = 'http://data.gdeltproject.org/events/'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
filename = yesterday + '.export.CSV'
url = base_url + filename + ".zip"

r = requests.get(url)

# Create a dataframe from the CSV data
# CSV is tab-separated and doesn't have a header row 
df = pd.read_csv(BytesIO(r.content), compression='zip', delimiter='\t', header=None)

print(df.head())

That gives

sys:1: DtypeWarning: Columns (24,26,27,28) have mixed types.Specify dtype option on import or set low_memory=False.
           0         1       2     3          4    5       6    7    8    9    10  ...        47      48 49                                50  51    52       53        54      55        56                                                 57
0  1095081749  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ... -119.7460      CA  2         California, United States  US  USCA  36.1700 -119.7460      CA  20230412  https://www.sandiegouniontribune.com/news/cali...
1  1095081750  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ...  -81.9059  294668  3  Bay Lake, Florida, United States  US  USFL  28.4775  -81.9059  294668  20230412  https://www.streetinsider.com/Reuters/New+Flor...
2  1095081751  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ...  -81.9059  294668  2            Florida, United States  US  USFL  27.8333  -81.7170      FL  20230412  https://www.streetinsider.com/Reuters/New+Flor...
3  1095081752  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ...  -81.7170      FL  2            Florida, United States  US  USFL  27.8333  -81.7170      FL  20230412  https://financialpost.com/pmn/business-pmn/new...
4  1095081753  20220412  202204  2022  2022.2795  COP  POLICE  NaN  NaN  NaN  NaN  ...  -89.0022      IL  2           Illinois, United States  US  USIL  40.3363  -89.0022      IL  20230412  https://chicago.suntimes.com/crime/2023/4/11/2...

[5 rows x 58 columns]

Note the datatype errors reported in the output.

huangapple
  • 本文由 发表于 2023年4月13日 19:40:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76004994.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定