2023年5月29日 22:00:42go评论127阅读模式

英文:

Split Huge CSV file into multiple files using Python

问题

我有这个庞大的 CSV 文件（大约 70GB），需要使用 Databricks 中的 PySpark 读取，以创建一个 PySpark 数据框。该文件保存在挂载到 Databricks 的存储帐户中。

现在，我无法做到这一点，因为它给了我一个错误：

驱动程序已启动但无响应，可能是由于 GC。

作为替代方案，我考虑将文件拆分成多个 CSV。有办法简单地这样做吗？

英文:

I have this huge CSV file (70 GB approx.) that I need to read using PySpark in Databricks, to create a Pyspark Dataframe. The file is saved in a Storage Account mounted to Databricks.

Now, I am not able to do so as it gives me an error:

Driver is up but is not responsive, likely due to GC.

As an alternative I thought about splitting the file into multiple CSV. Is there a way to do so easily?

答案1

得分: 1

以下是您要翻译的内容：

"I'd say splitting a large CSV is fairly easy with Python.

I also recommend that for most tasks with CSV files (like, 99.999%) use a CSV-aware tool. Suggestions like, 'use this POSIX tool', will always struggle with headers, and rows that span multiple lines because of quoted newlines. So whether you use the following, or a tool like GoCSV's split command, use a tool that conforms to the CSV spec. But if you know 100% your CSV doesn't have a header, and doesn't have multi-line rows, then you might be able to get away with a regular text processing tool.

To roll your own splitter in Python, you'll need some mechanism to create a new file and csv.writer after so many rows have been written to the previous file/writer.

I don't know how Pythonic the following is, but: I think it's fairly legible; and it works!

The next_writer(header) function looks in the global space for the already-established csv.writer and its underlying output file.
Every call to next_writer(...) closes the currently-open file (which flushes any buffered rows from the writer), creates the next file and its accompanying writer, and writes a header (if you need one).
The main read-loop enumerates the reader so it knows which row (not line) it's on, and when the row number is a multiple of MAX_ROWS gets the next writer.

import csv
import io
MAX_ROWS = 10
# "Initialize" f_out and writer as their respective types
f_out = io.StringIO("")
writer = csv.writer(f_out)
out_num = 0
def next_writer(header: list[str]):
    """Gets the next-numbered CSV writer; closes the previous file and flushes its writer."""
    global out_num
    global f_out
    global writer
    f_out.close()
    out_num += 1
    f_out = open(f"output_{out_num:03}.csv", "w", newline="")
    writer = csv.writer(f_out)
    writer.writerow(header)
with open("input.csv", newline="") as f_in:
    reader = csv.reader(f_in)
    header = next(reader)
    for i, row in enumerate(reader):
        if i % MAX_ROWS == 0:
            next_writer(header)
        writer.writerow(row)
f_out.close()  # close underlying file; flush writer

"Initializing" the out-file and writer seems a bit clunky, but, for my example that ensures we have the globally scoped writer (which next_writer(...) and the main read-loop need) and its underlying, numbered output file (for next_writer(...)).

I generated a CSV with 100 rows:

H____1,H____2
r001c1,r001c2
r002c1,r002c2
r003c1,r003c2
......,......
r098c1,r098c2
r099c1,r099c2
r100c1,r100c2

Running the above (with MAX_ROWS = 10) gives me 10 files, output_001.csv to output_010.csv (the format string f"{out_num:03}" pads the number with leading zeroes up to three places, to allow for 999 files).

Each file looks like:

output_001.csv
--------------
H____1,H____2
r001c1,r001c2
r002c1,r002c2
r003c1,r003c2
......,......
output_010.csv
--------------
H____1,H____2
......,......
r098c1,r098c2
r099c1,r099c2
r100c1,r100c2

英文:

I'd say splitting a large CSV is fairly easy with Python.

I also recommend that for most tasks with CSV files (like, 99.999%) use a CSV-aware tool. Suggestions like, "use this POSIX tool", will always struggle with headers, and rows that span mulitple lines because of quoted newlines. So whether you use the following, or a tool like GoCSV's split command, use a tool that conforms to the CSV spec. But if you know 100% your CSV doesn't have a header, and doesn't have multi-line rows, then you might be able to get away with a regular text processing tool.

To roll your own splitter in Python, you'll need some mechansim to create a new file and csv.writer after so many rows have been written to the previous file/writer.

I don't know how Pythonic the following is, but: I think it's fairly legible; and it works!

The next_writer(header) function looks in the global space for the already-established csv.writer and its underlying output file.
Every call to next_writer(...) closes the currently-open file (which flushes any buffered rows from the writer), creates the next file and its accompanying writer, and writes a header (if you need one).
The main read-loop enumerates the reader so it knows which row (not line) it's on, and when the row number is a multiple of MAX_ROWS gets the next writer.

import csv
import io
MAX_ROWS = 10
# &quot;Initialize&quot; f_out and writer as their respective types
f_out = io.StringIO(&quot;&quot;)
writer = csv.writer(f_out)
out_num = 0
def next_writer(header: list[str]):
    &quot;&quot;&quot;Gets the next-numbered CSV writer; closes the previous file and flushes its writer.&quot;&quot;&quot;
    global out_num
    global f_out
    global writer
    f_out.close()
    out_num += 1
    f_out = open(f&quot;output_{out_num:03}.csv&quot;, &quot;w&quot;, newline=&quot;&quot;)
    writer = csv.writer(f_out)
    writer.writerow(header)
with open(&quot;input.csv&quot;, newline=&quot;&quot;) as f_in:
    reader = csv.reader(f_in)
    header = next(reader)
    for i, row in enumerate(reader):
        if i % MAX_ROWS == 0:
            next_writer(header)
        writer.writerow(row)
f_out.close()  # close underlying file; flush writer

I generated a CSV with 100 rows:

H____1,H____2
r001c1,r001c2
r002c1,r002c2
r003c1,r003c2
......,......
r098c1,r098c2
r099c1,r099c2
r100c1,r100c2

Each file looks like:

output_001.csv
--------------
H____1,H____2
r001c1,r001c2
r002c1,r002c2
r003c1,r003c2
......,......
output_010.csv
--------------
H____1,H____2
......,......
r098c1,r098c2
r099c1,r099c2
r100c1,r100c2

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将巨大的CSV文件拆分成多个文件，使用Python。

问题

答案1

Pandas数据框可视化表示连接数据框。

如何在polars中从日期计算月初和月末日期？

Qpid Proton在Python 3.10中的SSL问题

2D字典 – while True循环正在覆盖所有键值

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。