英文:
Split Huge CSV file into multiple files using Python
问题
我有这个庞大的 CSV 文件(大约 70GB),需要使用 Databricks 中的 PySpark 读取,以创建一个 PySpark 数据框。该文件保存在挂载到 Databricks 的存储帐户中。
现在,我无法做到这一点,因为它给了我一个错误:
驱动程序已启动但无响应,可能是由于 GC。
作为替代方案,我考虑将文件拆分成多个 CSV。有办法简单地这样做吗?
英文:
I have this huge CSV file (70 GB approx.) that I need to read using PySpark in Databricks, to create a Pyspark Dataframe. The file is saved in a Storage Account mounted to Databricks.
Now, I am not able to do so as it gives me an error:
Driver is up but is not responsive, likely due to GC.
As an alternative I thought about splitting the file into multiple CSV. Is there a way to do so easily?
答案1
得分: 1
以下是您要翻译的内容:
"I'd say splitting a large CSV is fairly easy with Python.
I also recommend that for most tasks with CSV files (like, 99.999%) use a CSV-aware tool. Suggestions like, 'use this POSIX tool', will always struggle with headers, and rows that span multiple lines because of quoted newlines. So whether you use the following, or a tool like GoCSV's split command, use a tool that conforms to the CSV spec. But if you know 100% your CSV doesn't have a header, and doesn't have multi-line rows, then you might be able to get away with a regular text processing tool.
To roll your own splitter in Python, you'll need some mechanism to create a new file and csv.writer after so many rows have been written to the previous file/writer.
I don't know how Pythonic the following is, but: I think it's fairly legible; and it works!
- The next_writer(header) function looks in the global space for the already-established csv.writer and its underlying output file.
- Every call to next_writer(...) closes the currently-open file (which flushes any buffered rows from the writer), creates the next file and its accompanying writer, and writes a header (if you need one).
- The main read-loop enumerates the reader so it knows which row (not line) it's on, and when the row number is a multiple of MAX_ROWS gets the next writer.
import csv
import io
MAX_ROWS = 10
# "Initialize" f_out and writer as their respective types
f_out = io.StringIO("")
writer = csv.writer(f_out)
out_num = 0
def next_writer(header: list[str]):
"""Gets the next-numbered CSV writer; closes the previous file and flushes its writer."""
global out_num
global f_out
global writer
f_out.close()
out_num += 1
f_out = open(f"output_{out_num:03}.csv", "w", newline="")
writer = csv.writer(f_out)
writer.writerow(header)
with open("input.csv", newline="") as f_in:
reader = csv.reader(f_in)
header = next(reader)
for i, row in enumerate(reader):
if i % MAX_ROWS == 0:
next_writer(header)
writer.writerow(row)
f_out.close() # close underlying file; flush writer
"Initializing" the out-file and writer seems a bit clunky, but, for my example that ensures we have the globally scoped writer (which next_writer(...) and the main read-loop need) and its underlying, numbered output file (for next_writer(...)).
I generated a CSV with 100 rows:
H____1,H____2
r001c1,r001c2
r002c1,r002c2
r003c1,r003c2
......,......
r098c1,r098c2
r099c1,r099c2
r100c1,r100c2
Running the above (with MAX_ROWS = 10) gives me 10 files, output_001.csv to output_010.csv (the format string f"{out_num:03}"
pads the number with leading zeroes up to three places, to allow for 999 files).
Each file looks like:
output_001.csv
--------------
H____1,H____2
r001c1,r001c2
r002c1,r002c2
r003c1,r003c2
......,......
output_010.csv
--------------
H____1,H____2
......,......
r098c1,r098c2
r099c1,r099c2
r100c1,r100c2
英文:
I'd say splitting a large CSV is fairly easy with Python.
I also recommend that for most tasks with CSV files (like, 99.999%) use a CSV-aware tool. Suggestions like, "use this POSIX tool", will always struggle with headers, and rows that span mulitple lines because of quoted newlines. So whether you use the following, or a tool like GoCSV's split command, use a tool that conforms to the CSV spec. But if you know 100% your CSV doesn't have a header, and doesn't have multi-line rows, then you might be able to get away with a regular text processing tool.
To roll your own splitter in Python, you'll need some mechansim to create a new file and csv.writer after so many rows have been written to the previous file/writer.
I don't know how Pythonic the following is, but: I think it's fairly legible; and it works!
- The next_writer(header) function looks in the global space for the already-established csv.writer and its underlying output file.
- Every call to next_writer(...) closes the currently-open file (which flushes any buffered rows from the writer), creates the next file and its accompanying writer, and writes a header (if you need one).
- The main read-loop enumerates the reader so it knows which row (not line) it's on, and when the row number is a multiple of MAX_ROWS gets the next writer.
import csv
import io
MAX_ROWS = 10
# "Initialize" f_out and writer as their respective types
f_out = io.StringIO("")
writer = csv.writer(f_out)
out_num = 0
def next_writer(header: list[str]):
"""Gets the next-numbered CSV writer; closes the previous file and flushes its writer."""
global out_num
global f_out
global writer
f_out.close()
out_num += 1
f_out = open(f"output_{out_num:03}.csv", "w", newline="")
writer = csv.writer(f_out)
writer.writerow(header)
with open("input.csv", newline="") as f_in:
reader = csv.reader(f_in)
header = next(reader)
for i, row in enumerate(reader):
if i % MAX_ROWS == 0:
next_writer(header)
writer.writerow(row)
f_out.close() # close underlying file; flush writer
"Initializing" the out-file and writer seems a bit clunky, but, for my example that ensures we have the globally scoped writer (which next_writer(...) and the main read-loop need) and its underlying, numbered output file (for next_writer(...)).
I generated a CSV with 100 rows:
H____1,H____2
r001c1,r001c2
r002c1,r002c2
r003c1,r003c2
......,......
r098c1,r098c2
r099c1,r099c2
r100c1,r100c2
Running the above (with MAX_ROWS = 10) gives me 10 files, output_001.csv to output_010.csv (the format string f"{out_num:03}"
pads the number with leading zeroes up to three places, to allow for 999 files).
Each file looks like:
output_001.csv
--------------
H____1,H____2
r001c1,r001c2
r002c1,r002c2
r003c1,r003c2
......,......
output_010.csv
--------------
H____1,H____2
......,......
r098c1,r098c2
r099c1,r099c2
r100c1,r100c2
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论