如何使用Python从API下载的CSV文件中删除空单元格?

huangapple go评论95阅读模式
英文:

How can I get rid of empty cells in csv files that I am downloading from an API using Python?

问题

我正在为ArcGIS Pro和ArcGIS GeoEvent Server进行一个复杂的项目,需要不断从API下载CSV文件(每10分钟生成一个CSV文件)。问题是,虽然我可以完美地下载它,但有些单元格是空的,因此我无法将CSV文件与GeoEvent Server一起使用。

所以基本上,我需要能够选择为空的字段(单元格),并在下载CSV文件时自动丢弃它们,因为我不想手动处理它们。

然后,我需要将CSV文件转换为JSON并进行处理,但我遇到了另一个问题,那就是文件的名称,因为我需要与CSV文件相同,但我不知道如何自动执行它。

这是我用于下载CSV文件的代码:

# 这里放你的Python代码,不包括注释部分

这是我用于转换为JSON的代码:

# 这里放你的Python代码,不包括注释部分

如果你有关于这些代码的具体问题或需要进一步的帮助,请告诉我。

英文:

I am working in a complex project for ArcGIS pro and ArcGIS GeoEvent Server that requires downloading CSV files constantly from an API (every 10 minutes I generate a CSV file). The problem is that I can download it perfectly, but some cells are empty, because of this, I cannot use the CSV files with GeoEvent Server.

So, basically, I need to be able to select the fields (cells) that are empty, and discard them automatically when downloading the CSV file, because I do not want to do it manually

Then, the CSV files I need to convert it to JSON and work with it, but with this I have another problem, which is the name of the file, because I need to have the same as the CSV file, and I do not know how to do it automatically.
I tried to read all the contents of the folder (filtering by .csv extension) using os module in order to the file names, and then copy it to the JSON, but I cannot.

This is the code I use to download the CSV file.

import os
import yaml
import logging
from datetime import datetime, timedelta
import sys
import copy
import json
import csv
import requests
import time

from csvToJSON_01 import *
from exceptions import MaxRetries, ConnectionLost
from requests.adapters import HTTPAdapter
from requests.exceptions import RetryError
from requests.packages.urllib3.util.retry import Retry
from apscheduler.schedulers.background import BackgroundScheduler

log = logging.getLogger(__name__)

target_updates = []
time_from = None


def reset_bucket():
    global target_updates

    target_updates = []  # Initially is an empty array.


def export_to_csv_job():
    global time_from
    global target_updates

    to_proccess = copy.deepcopy(target_updates)
    old_time_from = copy.deepcopy(time_from)
    time_from = datetime.now()  # To get the current local time and date.

    reset_bucket()  # Reset target_updates.

    if len(to_proccess) > 0:  # If you have one or more thing to process.
        print(to_proccess[0])
        data_file = open(
            f"csvFiles/data_{old_time_from.strftime('%m_%d_%Y_%H_%M_%S')}_{datetime.now().strftime('%m_%d_%Y_%H_%M_%S')}.csv",
            "w",
        )  # Creates a file named "data", strftime() transforms date, time and datetime to string.

        csv_writer = csv.writer(
            data_file
        )  # Write the infomation to the CSV created before.

        most_keys = max(
            to_proccess, key=lambda item: len(item.keys())
        )  # Define the maximum amount of CSV files to process.
        csv_writer.writerow(most_keys.keys())

        for elem in to_proccess:
            csv_writer.writerow(
                map(lambda key: elem.get(key, ""), most_keys.keys())
            )  # Insert data to the csv file, row by row.
        data_file.close()


def listen_to_stream(timeout=None):
    global time_from
    reset_bucket()
    if timeout is not None:
        timeout = datetime.now() + timedelta(0, timeout)

    scheduler = BackgroundScheduler()
    retry_strategy = Retry(
        # 10 retries before throwing exception.
        total=10,
        backoff_factor=3,
        status_forcelist=[429, 500, 502, 503, 504, 422],
        allowed_methods=["HEAD", "GET", "OPTIONS"],
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    http = requests.Session()
    http.mount("https://", adapter)
    http.mount("http://", adapter)

    try:
        response = http.get(
            "https://api.airsafe.spire.com/v2/targets/stream?compression=none",
            headers={"Authorization": f"Bearer {os.environ['AVIATION_TOKEN']}"},
            stream=True,
        )
    except RetryError:
        log.warn(RetryError)
        raise MaxRetries()

    if response.status_code == 401:  # If it's unauthorized.
        print("Unauthorized, token might be invalid")
        sys.exit()  # Break.

    try:
        scheduler.add_job(
            export_to_csv_job,
            "cron",
            minute="*/10",
            id="airsafe_stream_csv",
        )  # I execute this with the Task Manager every 10 minutes.
        time_from = datetime.now()  # I start counting.
        scheduler.start()
    except Exception as e:
        log.warn(e)
        print("failed to start scheduler")
        raise ConnectionLost()

    try:
        for line in response.iter_lines(decode_unicode=True):
            if timeout is not None and datetime.now() >= timeout:
                scheduler.remove_job("airsafe_stream_csv")
                scheduler.shutdown()
                export_to_csv_job()
                response.close()
                sys.exit()
            if line and '"target":{' in line:
                target = json.loads(line)["target"]
                target_updates.append(target)
    except Exception as e:
        log.warn(e)
        scheduler.remove_job("airsafe_stream_csv")
        scheduler.shutdown()
        export_to_csv_job()
        raise ConnectionLost()


def connection_manager():
    try:
        # If you wish to listen for a specific time:
        # Wlisten_to_stream(70) will listen for 70 seconds
        listen_to_stream()
    except MaxRetries:
        print("stream failed to connect multiple times, will retry in 30mn")
        time.sleep(60 * 30)
        connection_manager()
    except ConnectionLost:
        print("Connection was lost retrying now ...")
        connection_manager()


csvJSON = csv_to_json(
    csvFilePath, jsonFilePath
)  # Now, I call this function from csvToJSON.py


if __name__ == "__main__":  # If the file is the main one.
    config = yaml.load(
        open("env.yaml"), Loader=yaml.FullLoader
    )  # Open env.yaml and pass the object, AVIATION_TOKEN in this case.
    os.environ.update(config)

    connection_manager()

And this other, is the one I'm using to convert to JSON.

import csv
import json
import time
import os


directory = "../csvFiles"


def csv_to_json(csvFilePath, jsonFilePath):
    jsonArray = []

    # Read csv file.
    with open(csvFilePath, encoding="utf-8") as csvf:
        # Load csv file data using csv library's dictionary reader.
        csvReader = csv.DictReader(csvf)

        # Convert each csv row into python dict.
        for row in csvReader:
            # Add this python dict to json array.
            jsonArray.append(row)

    # Convert python jsonArray to JSON String and write to file.
    with open(jsonFilePath, "w", encoding="utf-8") as jsonf:
        jsonString = json.dumps(jsonArray, indent=4)
        jsonf.write(jsonString)


for file in os.listdir(directory):
    if file.endswith(".csv"):
        # Prints only CSV files present in csvFiles folder.
        # csvFilePath = os.path.splitext(file)[0]
        csvFilePath = file
        # print(csvFilePath)
        jsonFilePath = r"../csvFiles/data.json"


start = time.perf_counter()
csv_to_json(csvFilePath, jsonFilePath)
csv_to_json()
finish = time.perf_counter()

print(f"Conversion completed successfully in {finish - start:0.4f} seconds")

答案1

得分: 1

df = df.dropna()
将删除所有包含NaN值的行。
英文:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

df = df.dropna()

Will remove all rows than contain nan values.

huangapple
  • 本文由 发表于 2023年7月10日 23:12:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76655107.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定