2023年4月13日 19:40:59go评论110阅读模式

英文:

download zipped csv from url and convert to dataframe

问题

I want to download and read a file from this site: http://data.gdeltproject.org/events/index.html

each of the files are of the named YYYYMMDD.export.CSV.zip

I am stuck at this point in my code:

import pandas as pd
import zipfile
import requests
from datetime import date, timedelta  
url = 'http://data.gdeltproject.org/events/index.html'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
file_from_url = yesterday + '.export.CSV.zip'
with open(file_from_url, "wb") as f: 
    f.write(resp.content)

now I am stuck trying to read the contents

I tried readlines, but this did not work

Any suggestions how I can read my zipped file

英文:

I want to download and read a file from this site: http://data.gdeltproject.org/events/index.html

each of the files are of the named YYYYMMDD.export.CSV.zip

I am stuck at this point in my code:

import pandas as pd
import zipfile
import requests
from datetime import date, timedelta  
url = &#39;http://data.gdeltproject.org/events/index.html&#39;
yesterday = (date.today() - timedelta(days=1)).strftime(&#39;%Y%m%d&#39;)
file_from_url = yesterday + &#39;.export.CSV.zip&#39;
with open(file_from_url, &quot;wb&quot;) as f: 
    f.write(resp.content)

now I am stuck trying to read the contents

I tried readlines, but this did not work

Any suggestions how I can read my zipped file

答案1

得分: 2

以下是翻译好的内容：

第一个问题是url变量被定义但从未被使用。

url = 'http://data.gdeltproject.org/events/index.html'

url中的index.html部分在下载压缩的CSV文件时没有用处 - 你需要构建一个类似这样的URL字符串：http://data.gdeltproject.org/events/20230412.export.CSV.zip。你需要类似这样的代码：

from datetime import date, timedelta  
base_url = 'http://data.gdeltproject.org/events/'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
filename = yesterday + '.export.CSV'
url = base_url + filename + ".zip"
print(f'URL是"{url}"')

这会为我输出以下内容：

URL是"http://data.gdeltproject.org/events/20230412.export.CSV.zip"

接下来，你没有从data.gdeltproject.org下载任何内容。你的代码尝试打开一个本地文件以进行写入：

with open(file_from_url, "wb") as f: 
    f.write(resp.content)

你需要下载文件并打开它以进行读取。以下代码应该可以解决问题：

import pandas as pd
import requests
from io import BytesIO
from datetime import date, timedelta  
base_url = 'http://data.gdeltproject.org/events/'
yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
filename = yesterday + '.export.CSV'
url = base_url + filename + ".zip"
r = requests.get(url)
# 从CSV数据创建一个数据框架
# CSV是以制表符分隔的，没有标题行
df = pd.read_csv(BytesIO(r.content), compression='zip', delimiter='\t', header=None)
print(df.head())

这会输出如下内容：

sys:1: DtypeWarning: Columns (24,26,27,28) have mixed types.Specify dtype option on import or set low_memory=False.
           0         1       2     3          4    5       6    7    8    9    10  ...        47      48 49                                50  51    52       53        54      55        56                                                 57
0  1095081749  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ... -119.7460      CA  2         California, United States  US  USCA  36.1700 -119.7460      CA  20230412  https://www.sandiegouniontribune.com/news/cali...
1  1095081750  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ...  -81.9059  294668  3  Bay Lake, Florida, United States  US  USFL  28.4775  -81.9059  294668  20230412  https://www.streetinsider.com/Reuters/New+Flor...
2  1095081751  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ...  -81.9059  294668  2            Florida, United States  US  USFL  27.8333  -81.7170      FL  20230412  https://www.streetinsider.com/Reuters/New+Flor...
3  1095081752  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ...  -81.7170      FL  2            Florida, United States  US  USFL  27.8333  -81.7170      FL  20230412  https://financialpost.com/pmn/business-pmn/new...
4  1095081753  20220412  202204  2022  2022.2795  COP  POLICE  NaN  NaN  NaN  NaN  ...  -89.0022      IL  2           Illinois, United States  US  USIL  40.3363  -89.0022      IL  20230412  https://chicago.suntimes.com/crime/2023/4/11/2...
[5 rows x 58 columns]

请注意，输出中报告了数据类型错误。

英文:

First issue is the url variable is defined but never used

url = &#39;http://data.gdeltproject.org/events/index.html&#39;

The index.html part of the url is of no use when downloading the zipped CSV files -- you need to construct a url string like this http://data.gdeltproject.org/events/20230412.export.CSV.zip. Something like this is what you need

from datetime import date, timedelta  
base_url = &#39;http://data.gdeltproject.org/events/&#39;
yesterday = (date.today() - timedelta(days=1)).strftime(&#39;%Y%m%d&#39;)
filename = yesterday + &#39;.export.CSV&#39;
url = base_url + filename + &quot;.zip&quot;
print(f&#39;URL is &quot;{url}&quot;&#39;)

that outputs this for me

URL is &quot;http://data.gdeltproject.org/events/20230412.export.CSV.zip&quot;

Next, you are not downloading anything from data.gdeltproject.org. Your code, below, is attempting to open a local file for writing

with open(file_from_url, &quot;wb&quot;) as f: 
    f.write(resp.content)

You need to download the file file and open that for reading. Something like this should do the trick

import pandas as pd
import requests
from io import BytesIO
from datetime import date, timedelta  
base_url = &#39;http://data.gdeltproject.org/events/&#39;
yesterday = (date.today() - timedelta(days=1)).strftime(&#39;%Y%m%d&#39;)
filename = yesterday + &#39;.export.CSV&#39;
url = base_url + filename + &quot;.zip&quot;
r = requests.get(url)
# Create a dataframe from the CSV data
# CSV is tab-separated and doesn&#39;t have a header row 
df = pd.read_csv(BytesIO(r.content), compression=&#39;zip&#39;, delimiter=&#39;\t&#39;, header=None)
print(df.head())

That gives

sys:1: DtypeWarning: Columns (24,26,27,28) have mixed types.Specify dtype option on import or set low_memory=False.
           0         1       2     3          4    5       6    7    8    9    10  ...        47      48 49                                50  51    52       53        54      55        56                                                 57
0  1095081749  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ... -119.7460      CA  2         California, United States  US  USCA  36.1700 -119.7460      CA  20230412  https://www.sandiegouniontribune.com/news/cali...
1  1095081750  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ...  -81.9059  294668  3  Bay Lake, Florida, United States  US  USFL  28.4775  -81.9059  294668  20230412  https://www.streetinsider.com/Reuters/New+Flor...
2  1095081751  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ...  -81.9059  294668  2            Florida, United States  US  USFL  27.8333  -81.7170      FL  20230412  https://www.streetinsider.com/Reuters/New+Flor...
3  1095081752  20220412  202204  2022  2022.2795  NaN     NaN  NaN  NaN  NaN  NaN  ...  -81.7170      FL  2            Florida, United States  US  USFL  27.8333  -81.7170      FL  20230412  https://financialpost.com/pmn/business-pmn/new...
4  1095081753  20220412  202204  2022  2022.2795  COP  POLICE  NaN  NaN  NaN  NaN  ...  -89.0022      IL  2           Illinois, United States  US  USIL  40.3363  -89.0022      IL  20230412  https://chicago.suntimes.com/crime/2023/4/11/2...
[5 rows x 58 columns]

Note the datatype errors reported in the output.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

下载来自URL的压缩CSV文件并转换为数据框。

问题

答案1

使用super()和多重继承进行超类属性设置

在Python中变量的大小

hasattr():属性名必须是字符串

如何使用GoLang通过HTTP的”Host”头部来定位特定的应用服务器。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。