将巨大的CSV文件拆分成多个文件,使用Python。

huangapple go评论127阅读模式
英文:

Split Huge CSV file into multiple files using Python

问题

我有这个庞大的 CSV 文件(大约 70GB),需要使用 Databricks 中的 PySpark 读取,以创建一个 PySpark 数据框。该文件保存在挂载到 Databricks 的存储帐户中。

现在,我无法做到这一点,因为它给了我一个错误:

驱动程序已启动但无响应,可能是由于 GC。

作为替代方案,我考虑将文件拆分成多个 CSV。有办法简单地这样做吗?

英文:

I have this huge CSV file (70 GB approx.) that I need to read using PySpark in Databricks, to create a Pyspark Dataframe. The file is saved in a Storage Account mounted to Databricks.

Now, I am not able to do so as it gives me an error:

Driver is up but is not responsive, likely due to GC.

As an alternative I thought about splitting the file into multiple CSV. Is there a way to do so easily?

答案1

得分: 1

以下是您要翻译的内容:

"I'd say splitting a large CSV is fairly easy with Python.

I also recommend that for most tasks with CSV files (like, 99.999%) use a CSV-aware tool. Suggestions like, 'use this POSIX tool', will always struggle with headers, and rows that span multiple lines because of quoted newlines. So whether you use the following, or a tool like GoCSV's split command, use a tool that conforms to the CSV spec. But if you know 100% your CSV doesn't have a header, and doesn't have multi-line rows, then you might be able to get away with a regular text processing tool.

To roll your own splitter in Python, you'll need some mechanism to create a new file and csv.writer after so many rows have been written to the previous file/writer.

I don't know how Pythonic the following is, but: I think it's fairly legible; and it works!

  • The next_writer(header) function looks in the global space for the already-established csv.writer and its underlying output file.
  • Every call to next_writer(...) closes the currently-open file (which flushes any buffered rows from the writer), creates the next file and its accompanying writer, and writes a header (if you need one).
  • The main read-loop enumerates the reader so it knows which row (not line) it's on, and when the row number is a multiple of MAX_ROWS gets the next writer.
  1. import csv
  2. import io
  3. MAX_ROWS = 10
  4. # "Initialize" f_out and writer as their respective types
  5. f_out = io.StringIO("")
  6. writer = csv.writer(f_out)
  7. out_num = 0
  8. def next_writer(header: list[str]):
  9. """Gets the next-numbered CSV writer; closes the previous file and flushes its writer."""
  10. global out_num
  11. global f_out
  12. global writer
  13. f_out.close()
  14. out_num += 1
  15. f_out = open(f"output_{out_num:03}.csv", "w", newline="")
  16. writer = csv.writer(f_out)
  17. writer.writerow(header)
  18. with open("input.csv", newline="") as f_in:
  19. reader = csv.reader(f_in)
  20. header = next(reader)
  21. for i, row in enumerate(reader):
  22. if i % MAX_ROWS == 0:
  23. next_writer(header)
  24. writer.writerow(row)
  25. f_out.close() # close underlying file; flush writer

"Initializing" the out-file and writer seems a bit clunky, but, for my example that ensures we have the globally scoped writer (which next_writer(...) and the main read-loop need) and its underlying, numbered output file (for next_writer(...)).

I generated a CSV with 100 rows:

  1. H____1,H____2
  2. r001c1,r001c2
  3. r002c1,r002c2
  4. r003c1,r003c2
  5. ......,......
  6. r098c1,r098c2
  7. r099c1,r099c2
  8. r100c1,r100c2

Running the above (with MAX_ROWS = 10) gives me 10 files, output_001.csv to output_010.csv (the format string f"{out_num:03}" pads the number with leading zeroes up to three places, to allow for 999 files).

Each file looks like:

  1. output_001.csv
  2. --------------
  3. H____1,H____2
  4. r001c1,r001c2
  5. r002c1,r002c2
  6. r003c1,r003c2
  7. ......,......
  8. output_010.csv
  9. --------------
  10. H____1,H____2
  11. ......,......
  12. r098c1,r098c2
  13. r099c1,r099c2
  14. r100c1,r100c2
英文:

I'd say splitting a large CSV is fairly easy with Python.

I also recommend that for most tasks with CSV files (like, 99.999%) use a CSV-aware tool. Suggestions like, "use this POSIX tool", will always struggle with headers, and rows that span mulitple lines because of quoted newlines. So whether you use the following, or a tool like GoCSV's split command, use a tool that conforms to the CSV spec. But if you know 100% your CSV doesn't have a header, and doesn't have multi-line rows, then you might be able to get away with a regular text processing tool.

To roll your own splitter in Python, you'll need some mechansim to create a new file and csv.writer after so many rows have been written to the previous file/writer.

I don't know how Pythonic the following is, but: I think it's fairly legible; and it works!

  • The next_writer(header) function looks in the global space for the already-established csv.writer and its underlying output file.
  • Every call to next_writer(...) closes the currently-open file (which flushes any buffered rows from the writer), creates the next file and its accompanying writer, and writes a header (if you need one).
  • The main read-loop enumerates the reader so it knows which row (not line) it's on, and when the row number is a multiple of MAX_ROWS gets the next writer.
  1. import csv
  2. import io
  3. MAX_ROWS = 10
  4. # "Initialize" f_out and writer as their respective types
  5. f_out = io.StringIO("")
  6. writer = csv.writer(f_out)
  7. out_num = 0
  8. def next_writer(header: list[str]):
  9. """Gets the next-numbered CSV writer; closes the previous file and flushes its writer."""
  10. global out_num
  11. global f_out
  12. global writer
  13. f_out.close()
  14. out_num += 1
  15. f_out = open(f"output_{out_num:03}.csv", "w", newline="")
  16. writer = csv.writer(f_out)
  17. writer.writerow(header)
  18. with open("input.csv", newline="") as f_in:
  19. reader = csv.reader(f_in)
  20. header = next(reader)
  21. for i, row in enumerate(reader):
  22. if i % MAX_ROWS == 0:
  23. next_writer(header)
  24. writer.writerow(row)
  25. f_out.close() # close underlying file; flush writer

"Initializing" the out-file and writer seems a bit clunky, but, for my example that ensures we have the globally scoped writer (which next_writer(...) and the main read-loop need) and its underlying, numbered output file (for next_writer(...)).

I generated a CSV with 100 rows:

  1. H____1,H____2
  2. r001c1,r001c2
  3. r002c1,r002c2
  4. r003c1,r003c2
  5. ......,......
  6. r098c1,r098c2
  7. r099c1,r099c2
  8. r100c1,r100c2

Running the above (with MAX_ROWS = 10) gives me 10 files, output_001.csv to output_010.csv (the format string f"{out_num:03}" pads the number with leading zeroes up to three places, to allow for 999 files).

Each file looks like:

  1. output_001.csv
  2. --------------
  3. H____1,H____2
  4. r001c1,r001c2
  5. r002c1,r002c2
  6. r003c1,r003c2
  7. ......,......
  8. output_010.csv
  9. --------------
  10. H____1,H____2
  11. ......,......
  12. r098c1,r098c2
  13. r099c1,r099c2
  14. r100c1,r100c2

huangapple
  • 本文由 发表于 2023年5月29日 22:00:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76357993.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定