英文:
download zipped csv from url and convert to dataframe
问题
I want to download and read a file from this site: http://data.gdeltproject.org/events/index.html
each of the files are of the named YYYYMMDD.export.CSV.zip
I am stuck at this point in my code:
import pandas as pd
import zipfile
import requests
from datetime import date, timedelta
url = 'http://data.gdeltproject.org/events/index.html'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
file_from_url = yesterday + '.export.CSV.zip'
with open(file_from_url, "wb") as f:
f.write(resp.content)
now I am stuck trying to read the contents
I tried readlines, but this did not work
Any suggestions how I can read my zipped file
英文:
I want to download and read a file from this site: http://data.gdeltproject.org/events/index.html
each of the files are of the named YYYYMMDD.export.CSV.zip
I am stuck at this point in my code:
import pandas as pd
import zipfile
import requests
from datetime import date, timedelta
url = 'http://data.gdeltproject.org/events/index.html'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
file_from_url = yesterday + '.export.CSV.zip'
with open(file_from_url, "wb") as f:
f.write(resp.content)
now I am stuck trying to read the contents
I tried readlines, but this did not work
Any suggestions how I can read my zipped file
答案1
得分: 2
以下是翻译好的内容:
第一个问题是url
变量被定义但从未被使用。
url = 'http://data.gdeltproject.org/events/index.html'
url
中的index.html
部分在下载压缩的CSV文件时没有用处 - 你需要构建一个类似这样的URL字符串:http://data.gdeltproject.org/events/20230412.export.CSV.zip
。你需要类似这样的代码:
from datetime import date, timedelta
base_url = 'http://data.gdeltproject.org/events/'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
filename = yesterday + '.export.CSV'
url = base_url + filename + ".zip"
print(f'URL是"{url}"')
这会为我输出以下内容:
URL是"http://data.gdeltproject.org/events/20230412.export.CSV.zip"
接下来,你没有从data.gdeltproject.org
下载任何内容。你的代码尝试打开一个本地文件以进行写入:
with open(file_from_url, "wb") as f:
f.write(resp.content)
你需要下载文件并打开它以进行读取。以下代码应该可以解决问题:
import pandas as pd
import requests
from io import BytesIO
from datetime import date, timedelta
base_url = 'http://data.gdeltproject.org/events/'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
filename = yesterday + '.export.CSV'
url = base_url + filename + ".zip"
r = requests.get(url)
# 从CSV数据创建一个数据框架
# CSV是以制表符分隔的,没有标题行
df = pd.read_csv(BytesIO(r.content), compression='zip', delimiter='\t', header=None)
print(df.head())
这会输出如下内容:
sys:1: DtypeWarning: Columns (24,26,27,28) have mixed types.Specify dtype option on import or set low_memory=False.
0 1 2 3 4 5 6 7 8 9 10 ... 47 48 49 50 51 52 53 54 55 56 57
0 1095081749 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -119.7460 CA 2 California, United States US USCA 36.1700 -119.7460 CA 20230412 https://www.sandiegouniontribune.com/news/cali...
1 1095081750 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -81.9059 294668 3 Bay Lake, Florida, United States US USFL 28.4775 -81.9059 294668 20230412 https://www.streetinsider.com/Reuters/New+Flor...
2 1095081751 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -81.9059 294668 2 Florida, United States US USFL 27.8333 -81.7170 FL 20230412 https://www.streetinsider.com/Reuters/New+Flor...
3 1095081752 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -81.7170 FL 2 Florida, United States US USFL 27.8333 -81.7170 FL 20230412 https://financialpost.com/pmn/business-pmn/new...
4 1095081753 20220412 202204 2022 2022.2795 COP POLICE NaN NaN NaN NaN ... -89.0022 IL 2 Illinois, United States US USIL 40.3363 -89.0022 IL 20230412 https://chicago.suntimes.com/crime/2023/4/11/2...
[5 rows x 58 columns]
请注意,输出中报告了数据类型错误。
英文:
First issue is the url
variable is defined but never used
url = 'http://data.gdeltproject.org/events/index.html'
The index.html
part of the url
is of no use when downloading the zipped CSV files -- you need to construct a url string like this http://data.gdeltproject.org/events/20230412.export.CSV.zip
. Something like this is what you need
from datetime import date, timedelta
base_url = 'http://data.gdeltproject.org/events/'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
filename = yesterday + '.export.CSV'
url = base_url + filename + ".zip"
print(f'URL is "{url}"')
that outputs this for me
URL is "http://data.gdeltproject.org/events/20230412.export.CSV.zip"
Next, you are not downloading anything from data.gdeltproject.org
. Your code, below, is attempting to open a local file for writing
with open(file_from_url, "wb") as f:
f.write(resp.content)
You need to download the file file and open that for reading. Something like this should do the trick
import pandas as pd
import requests
from io import BytesIO
from datetime import date, timedelta
base_url = 'http://data.gdeltproject.org/events/'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
filename = yesterday + '.export.CSV'
url = base_url + filename + ".zip"
r = requests.get(url)
# Create a dataframe from the CSV data
# CSV is tab-separated and doesn't have a header row
df = pd.read_csv(BytesIO(r.content), compression='zip', delimiter='\t', header=None)
print(df.head())
That gives
sys:1: DtypeWarning: Columns (24,26,27,28) have mixed types.Specify dtype option on import or set low_memory=False.
0 1 2 3 4 5 6 7 8 9 10 ... 47 48 49 50 51 52 53 54 55 56 57
0 1095081749 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -119.7460 CA 2 California, United States US USCA 36.1700 -119.7460 CA 20230412 https://www.sandiegouniontribune.com/news/cali...
1 1095081750 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -81.9059 294668 3 Bay Lake, Florida, United States US USFL 28.4775 -81.9059 294668 20230412 https://www.streetinsider.com/Reuters/New+Flor...
2 1095081751 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -81.9059 294668 2 Florida, United States US USFL 27.8333 -81.7170 FL 20230412 https://www.streetinsider.com/Reuters/New+Flor...
3 1095081752 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -81.7170 FL 2 Florida, United States US USFL 27.8333 -81.7170 FL 20230412 https://financialpost.com/pmn/business-pmn/new...
4 1095081753 20220412 202204 2022 2022.2795 COP POLICE NaN NaN NaN NaN ... -89.0022 IL 2 Illinois, United States US USIL 40.3363 -89.0022 IL 20230412 https://chicago.suntimes.com/crime/2023/4/11/2...
[5 rows x 58 columns]
Note the datatype errors reported in the output.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论