将巨大的CSV文件拆分成多个文件,使用Python。

huangapple go评论72阅读模式
英文:

Split Huge CSV file into multiple files using Python

问题

我有这个庞大的 CSV 文件(大约 70GB),需要使用 Databricks 中的 PySpark 读取,以创建一个 PySpark 数据框。该文件保存在挂载到 Databricks 的存储帐户中。

现在,我无法做到这一点,因为它给了我一个错误:

驱动程序已启动但无响应,可能是由于 GC。

作为替代方案,我考虑将文件拆分成多个 CSV。有办法简单地这样做吗?

英文:

I have this huge CSV file (70 GB approx.) that I need to read using PySpark in Databricks, to create a Pyspark Dataframe. The file is saved in a Storage Account mounted to Databricks.

Now, I am not able to do so as it gives me an error:

Driver is up but is not responsive, likely due to GC.

As an alternative I thought about splitting the file into multiple CSV. Is there a way to do so easily?

答案1

得分: 1

以下是您要翻译的内容:

"I'd say splitting a large CSV is fairly easy with Python.

I also recommend that for most tasks with CSV files (like, 99.999%) use a CSV-aware tool. Suggestions like, 'use this POSIX tool', will always struggle with headers, and rows that span multiple lines because of quoted newlines. So whether you use the following, or a tool like GoCSV's split command, use a tool that conforms to the CSV spec. But if you know 100% your CSV doesn't have a header, and doesn't have multi-line rows, then you might be able to get away with a regular text processing tool.

To roll your own splitter in Python, you'll need some mechanism to create a new file and csv.writer after so many rows have been written to the previous file/writer.

I don't know how Pythonic the following is, but: I think it's fairly legible; and it works!

  • The next_writer(header) function looks in the global space for the already-established csv.writer and its underlying output file.
  • Every call to next_writer(...) closes the currently-open file (which flushes any buffered rows from the writer), creates the next file and its accompanying writer, and writes a header (if you need one).
  • The main read-loop enumerates the reader so it knows which row (not line) it's on, and when the row number is a multiple of MAX_ROWS gets the next writer.
import csv
import io

MAX_ROWS = 10

# "Initialize" f_out and writer as their respective types
f_out = io.StringIO("")
writer = csv.writer(f_out)
out_num = 0

def next_writer(header: list[str]):
    """Gets the next-numbered CSV writer; closes the previous file and flushes its writer."""
    global out_num
    global f_out
    global writer

    f_out.close()

    out_num += 1
    f_out = open(f"output_{out_num:03}.csv", "w", newline="")
    writer = csv.writer(f_out)
    writer.writerow(header)

with open("input.csv", newline="") as f_in:
    reader = csv.reader(f_in)
    header = next(reader)

    for i, row in enumerate(reader):
        if i % MAX_ROWS == 0:
            next_writer(header)

        writer.writerow(row)

f_out.close()  # close underlying file; flush writer

"Initializing" the out-file and writer seems a bit clunky, but, for my example that ensures we have the globally scoped writer (which next_writer(...) and the main read-loop need) and its underlying, numbered output file (for next_writer(...)).

I generated a CSV with 100 rows:

H____1,H____2
r001c1,r001c2
r002c1,r002c2
r003c1,r003c2
......,......
r098c1,r098c2
r099c1,r099c2
r100c1,r100c2

Running the above (with MAX_ROWS = 10) gives me 10 files, output_001.csv to output_010.csv (the format string f"{out_num:03}" pads the number with leading zeroes up to three places, to allow for 999 files).

Each file looks like:

output_001.csv
--------------
H____1,H____2
r001c1,r001c2
r002c1,r002c2
r003c1,r003c2
......,......

output_010.csv
--------------
H____1,H____2
......,......
r098c1,r098c2
r099c1,r099c2
r100c1,r100c2
英文:

I'd say splitting a large CSV is fairly easy with Python.

I also recommend that for most tasks with CSV files (like, 99.999%) use a CSV-aware tool. Suggestions like, "use this POSIX tool", will always struggle with headers, and rows that span mulitple lines because of quoted newlines. So whether you use the following, or a tool like GoCSV's split command, use a tool that conforms to the CSV spec. But if you know 100% your CSV doesn't have a header, and doesn't have multi-line rows, then you might be able to get away with a regular text processing tool.

To roll your own splitter in Python, you'll need some mechansim to create a new file and csv.writer after so many rows have been written to the previous file/writer.

I don't know how Pythonic the following is, but: I think it's fairly legible; and it works!

  • The next_writer(header) function looks in the global space for the already-established csv.writer and its underlying output file.
  • Every call to next_writer(...) closes the currently-open file (which flushes any buffered rows from the writer), creates the next file and its accompanying writer, and writes a header (if you need one).
  • The main read-loop enumerates the reader so it knows which row (not line) it's on, and when the row number is a multiple of MAX_ROWS gets the next writer.
import csv
import io

MAX_ROWS = 10

# "Initialize" f_out and writer as their respective types
f_out = io.StringIO("")
writer = csv.writer(f_out)
out_num = 0


def next_writer(header: list[str]):
    """Gets the next-numbered CSV writer; closes the previous file and flushes its writer."""
    global out_num
    global f_out
    global writer

    f_out.close()

    out_num += 1
    f_out = open(f"output_{out_num:03}.csv", "w", newline="")
    writer = csv.writer(f_out)
    writer.writerow(header)


with open("input.csv", newline="") as f_in:
    reader = csv.reader(f_in)
    header = next(reader)

    for i, row in enumerate(reader):
        if i % MAX_ROWS == 0:
            next_writer(header)

        writer.writerow(row)

f_out.close()  # close underlying file; flush writer

"Initializing" the out-file and writer seems a bit clunky, but, for my example that ensures we have the globally scoped writer (which next_writer(...) and the main read-loop need) and its underlying, numbered output file (for next_writer(...)).

I generated a CSV with 100 rows:

H____1,H____2
r001c1,r001c2
r002c1,r002c2
r003c1,r003c2
......,......
r098c1,r098c2
r099c1,r099c2
r100c1,r100c2

Running the above (with MAX_ROWS = 10) gives me 10 files, output_001.csv to output_010.csv (the format string f"{out_num:03}" pads the number with leading zeroes up to three places, to allow for 999 files).

Each file looks like:

output_001.csv
--------------
H____1,H____2
r001c1,r001c2
r002c1,r002c2
r003c1,r003c2
......,......

output_010.csv
--------------
H____1,H____2
......,......
r098c1,r098c2
r099c1,r099c2
r100c1,r100c2

huangapple
  • 本文由 发表于 2023年5月29日 22:00:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76357993.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定