下载来自URL的压缩CSV文件并转换为数据框。

huangapple go评论110阅读模式
英文:

download zipped csv from url and convert to dataframe

问题

I want to download and read a file from this site: http://data.gdeltproject.org/events/index.html

each of the files are of the named YYYYMMDD.export.CSV.zip

I am stuck at this point in my code:

  1. import pandas as pd
  2. import zipfile
  3. import requests
  4. from datetime import date, timedelta
  5. url = 'http://data.gdeltproject.org/events/index.html'
  6. yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
  7. file_from_url = yesterday + '.export.CSV.zip'
  8. with open(file_from_url, "wb") as f:
  9. f.write(resp.content)

now I am stuck trying to read the contents

I tried readlines, but this did not work

Any suggestions how I can read my zipped file

英文:

I want to download and read a file from this site: http://data.gdeltproject.org/events/index.html

each of the files are of the named YYYYMMDD.export.CSV.zip

I am stuck at this point in my code:

  1. import pandas as pd
  2. import zipfile
  3. import requests
  4. from datetime import date, timedelta
  5. url = 'http://data.gdeltproject.org/events/index.html'
  6. yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
  7. file_from_url = yesterday + '.export.CSV.zip'
  8. with open(file_from_url, "wb") as f:
  9. f.write(resp.content)

now I am stuck trying to read the contents

I tried readlines, but this did not work

Any suggestions how I can read my zipped file

答案1

得分: 2

以下是翻译好的内容:

第一个问题是url变量被定义但从未被使用。

  1. url = 'http://data.gdeltproject.org/events/index.html'

url中的index.html部分在下载压缩的CSV文件时没有用处 - 你需要构建一个类似这样的URL字符串:http://data.gdeltproject.org/events/20230412.export.CSV.zip。你需要类似这样的代码:

  1. from datetime import date, timedelta
  2. base_url = 'http://data.gdeltproject.org/events/'
  3. yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
  4. filename = yesterday + '.export.CSV'
  5. url = base_url + filename + ".zip"
  6. print(f'URL是"{url}"')

这会为我输出以下内容:

  1. URL"http://data.gdeltproject.org/events/20230412.export.CSV.zip"

接下来,你没有从data.gdeltproject.org下载任何内容。你的代码尝试打开一个本地文件以进行写入:

  1. with open(file_from_url, "wb") as f:
  2. f.write(resp.content)

你需要下载文件并打开它以进行读取。以下代码应该可以解决问题:

  1. import pandas as pd
  2. import requests
  3. from io import BytesIO
  4. from datetime import date, timedelta
  5. base_url = 'http://data.gdeltproject.org/events/'
  6. yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
  7. filename = yesterday + '.export.CSV'
  8. url = base_url + filename + ".zip"
  9. r = requests.get(url)
  10. # 从CSV数据创建一个数据框架
  11. # CSV是以制表符分隔的,没有标题行
  12. df = pd.read_csv(BytesIO(r.content), compression='zip', delimiter='\t', header=None)
  13. print(df.head())

这会输出如下内容:

  1. sys:1: DtypeWarning: Columns (24,26,27,28) have mixed types.Specify dtype option on import or set low_memory=False.
  2. 0 1 2 3 4 5 6 7 8 9 10 ... 47 48 49 50 51 52 53 54 55 56 57
  3. 0 1095081749 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -119.7460 CA 2 California, United States US USCA 36.1700 -119.7460 CA 20230412 https://www.sandiegouniontribune.com/news/cali...
  4. 1 1095081750 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -81.9059 294668 3 Bay Lake, Florida, United States US USFL 28.4775 -81.9059 294668 20230412 https://www.streetinsider.com/Reuters/New+Flor...
  5. 2 1095081751 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -81.9059 294668 2 Florida, United States US USFL 27.8333 -81.7170 FL 20230412 https://www.streetinsider.com/Reuters/New+Flor...
  6. 3 1095081752 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -81.7170 FL 2 Florida, United States US USFL 27.8333 -81.7170 FL 20230412 https://financialpost.com/pmn/business-pmn/new...
  7. 4 1095081753 20220412 202204 2022 2022.2795 COP POLICE NaN NaN NaN NaN ... -89.0022 IL 2 Illinois, United States US USIL 40.3363 -89.0022 IL 20230412 https://chicago.suntimes.com/crime/2023/4/11/2...
  8. [5 rows x 58 columns]

请注意,输出中报告了数据类型错误。

英文:

First issue is the url variable is defined but never used

  1. url = 'http://data.gdeltproject.org/events/index.html'

The index.html part of the url is of no use when downloading the zipped CSV files -- you need to construct a url string like this http://data.gdeltproject.org/events/20230412.export.CSV.zip. Something like this is what you need

  1. from datetime import date, timedelta
  2. base_url = 'http://data.gdeltproject.org/events/'
  3. yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
  4. filename = yesterday + '.export.CSV'
  5. url = base_url + filename + ".zip"
  6. print(f'URL is "{url}"')

that outputs this for me

  1. URL is "http://data.gdeltproject.org/events/20230412.export.CSV.zip"

Next, you are not downloading anything from data.gdeltproject.org. Your code, below, is attempting to open a local file for writing

  1. with open(file_from_url, "wb") as f:
  2. f.write(resp.content)

You need to download the file file and open that for reading. Something like this should do the trick

  1. import pandas as pd
  2. import requests
  3. from io import BytesIO
  4. from datetime import date, timedelta
  5. base_url = 'http://data.gdeltproject.org/events/'
  6. yesterday = (date.today() - timedelta(days=1)).strftime('%Y%m%d')
  7. filename = yesterday + '.export.CSV'
  8. url = base_url + filename + ".zip"
  9. r = requests.get(url)
  10. # Create a dataframe from the CSV data
  11. # CSV is tab-separated and doesn't have a header row
  12. df = pd.read_csv(BytesIO(r.content), compression='zip', delimiter='\t', header=None)
  13. print(df.head())

That gives

  1. sys:1: DtypeWarning: Columns (24,26,27,28) have mixed types.Specify dtype option on import or set low_memory=False.
  2. 0 1 2 3 4 5 6 7 8 9 10 ... 47 48 49 50 51 52 53 54 55 56 57
  3. 0 1095081749 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -119.7460 CA 2 California, United States US USCA 36.1700 -119.7460 CA 20230412 https://www.sandiegouniontribune.com/news/cali...
  4. 1 1095081750 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -81.9059 294668 3 Bay Lake, Florida, United States US USFL 28.4775 -81.9059 294668 20230412 https://www.streetinsider.com/Reuters/New+Flor...
  5. 2 1095081751 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -81.9059 294668 2 Florida, United States US USFL 27.8333 -81.7170 FL 20230412 https://www.streetinsider.com/Reuters/New+Flor...
  6. 3 1095081752 20220412 202204 2022 2022.2795 NaN NaN NaN NaN NaN NaN ... -81.7170 FL 2 Florida, United States US USFL 27.8333 -81.7170 FL 20230412 https://financialpost.com/pmn/business-pmn/new...
  7. 4 1095081753 20220412 202204 2022 2022.2795 COP POLICE NaN NaN NaN NaN ... -89.0022 IL 2 Illinois, United States US USIL 40.3363 -89.0022 IL 20230412 https://chicago.suntimes.com/crime/2023/4/11/2...
  8. [5 rows x 58 columns]

Note the datatype errors reported in the output.

huangapple
  • 本文由 发表于 2023年4月13日 19:40:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76004994.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定