如何在Windows或Linux上有条件地拆分大型CSV文件?

huangapple go评论96阅读模式
英文:

How to split a large CSV file conditionally in Windows or Linux?

问题

我有一个太大无法在Excel中处理的CSV文件,我想将它拆分成较小的文件。我已经找到了一种方法来做到这一点这里,而且它甚至保留了标题行,这非常好。

然而,我想要在这个过程中添加一些条件。例如,文件中有一列名为"Confirmed",可能的行值是"0"、"1"或""(空白)。

是否可能只保留在较小文件中该列值为"1"的行?我还有一些其他条件,但至少我想知道从哪里开始。

英文:

I have a CSV file that's way too large for Excel and I'd like to split it into smaller files. I already found a way to do this here, and it even preserves the header row which is great.

However, I want to add some conditions to this process. For example, there's a "Confirmed" column in the file and possible row values are "0", "1", or "" (empty).

Is it possible to only keep rows where this column value is "1" in the smaller files? I have some other conditions too, but I want to know at least where to start with this.

答案1

得分: 2

I like the command-line utility GoCSV, by aotimme. It follows the Unix philosophy of having a number of small tools, each of which does one thing very well, and the tools can be pipelined. It also has pre-built binaries for Linux and Windows.

_ID__,Confirmed
00001,1
00002,0
00003,0
00004,1
00005,1
...
09996,1
09997,0
09998,0
09999,0
10000,1

GoCSV's filter and split subcommands can be piped together to first filter out any "non 1" row; then break up the remaining "1" rows into files of 1000 rows each:

gocsv filter -c Confirmed -eq 1 input.csv | gocsv split --max-rows 999

The filter subcommand specifies with column to consider, -c Confirmed the Confirmed column, then -eq 1 to specify that only rows with a 1 in the Confirmed column should be output.

GoCSV always treats the first row as the header (a number of its subcommands only make sense if they interpret the first row as a header), so I subtracted 1 for --max-rows.

For my mock input.csv, that yielded 5 output CSVs:

ls out*.csv | while read CSV; do
    echo "--$CSV--"
    gocsv dims $CSV
done
--out-1.csv--
Dimensions:
  Rows: 999
  Columns: 2
--out-2.csv--
Dimensions:
  Rows: 999
  Columns: 2
--out-3.csv--
Dimensions:
  Rows: 999
  Columns: 2
--out-4.csv--
Dimensions:
  Rows: 999
  Columns: 2
--out-5.csv--
Dimensions:
  Rows: 979
  Columns: 2

Again, GoCSV doesn't count the header as a row, so the Rows count is only 999 for the complete files.

英文:

I like the command-line utility GoCSV, by aotimme. It follows the Unix philosophy of having a number of small tools, each of which does one thing very well, and the tools can be pipelined. It also has pre-built binaries for Linux and Windows.

I mocked up this sample input based on the info in your question:

_ID__,Confirmed
00001,1
00002,0
00003,0
00004,1
00005,1
...
09996,1
09997,0
09998,0
09999,0
10000,1

GoCSV's filter and split subcommands can be piped together to first filter out any "non 1" row; then break up the remaining "1" rows into files of 1000 rows each:

gocsv filter -c Confirmed -eq 1 input.csv | gocsv split --max-rows 999

The filter subcommand specifies with column to consider, -c Confirmed the Confirmed column, then -eq 1 to specify that only rows with a 1 in the Confirmed column should be output.

GoCSV always treats the first row as the header (a number of its subcommands only make sense if they interpret the first row as a header), so I subtracted 1 for --max-rows.

For my mock input.csv, that yielded 5 output CSVs:

ls out*.csv | while read CSV; do
    echo "--$CSV--"
    gocsv dims $CSV
done
--out-1.csv--
Dimensions:
  Rows: 999
  Columns: 2
--out-2.csv--
Dimensions:
  Rows: 999
  Columns: 2
--out-3.csv--
Dimensions:
  Rows: 999
  Columns: 2
--out-4.csv--
Dimensions:
  Rows: 999
  Columns: 2
--out-5.csv--
Dimensions:
  Rows: 979
  Columns: 2

Again, GoCSV doesn't count the header as a row, so the Rows count is only 999 for the complete files.

答案2

得分: 2

你可以使用Miller(可在此处找到多个操作系统的版本)来执行此任务:

mlr --csv filter '$Confirmed == 1' then split -n 999 --prefix 'file' file.csv

filter 动词具有类似awk的语法,您可以使用任何您需要的条件。

英文:

You could use <a href="https://miller.readthedocs.io/en/latest/">Miller</a> (available <a href="https://github.com/johnkerl/miller/releases">here</a> for several OSs) for this task:

mlr --csv filter &#39;$Confirmed == 1&#39; then split -n 999 --prefix &#39;file&#39; file.csv

The filter verb has an awk-like syntax where you can use whatever conditions you might need.

答案3

得分: 1

Here is the translated content you requested:

使用Python的标准CSV模块,您可以迭代处理输入的CSV行,填充一个行缓冲区(其中Confirmed == 1),并在缓冲区达到足够大(999)时将其刷新到带有编号的输出文件中:

import csv
import sys

BATCH_SIZE = 999
OUTNAME_TMPL = "output_XX.csv"

# 将上面的模板中的XX替换为此递增的idx
out_idx = 1

rows_buf: list[dict[str, str]]
rows_buf = []

f_in = open("input.csv", encoding="utf-8", newline="")
reader = csv.DictReader(f_in)

# .fieldnames可能返回None,所以在下面进行检查
header = reader.fieldnames
if header is None:
    print("错误:预期标题行;得到了None", file=sys.stderr)
    sys.exit(1)


def flush_rows():
    """将rows_buf写入带有out_idx编号的CSV文件,并包含标题。"""
    fname = OUTNAME_TMPL.replace("XX", str(out_idx))

    with open(fname, "w", encoding="utf-8", newline="") as f_out:
        writer = csv.DictWriter(f_out, fieldnames=header)
        writer.writeheader()
        writer.writerows(rows_buf)


for row in reader:
    if len(rows_buf) == BATCH_SIZE:
        flush_rows()
        out_idx += 1
        rows_buf = []

    if row["Confirmed"] == "1":
        rows_buf.append(row)

# 刷新任何剩余的行
if rows_buf:
    flush_rows()

如果您以前没有使用过DictReader/Writer,那么reader会生成一个dict,其类型为dict[str,str],其中键是列名,值是该行和列的字段数据。writer需要使用字段名称(脚本中的全局变量header)进行初始化,然后其写方法可以接受来自reader的相同dict,并将其适当地输出为CSV。

英文:

With Python's standard CSV module you could iterate over the input CSV rows, filling up a buffer of rows (where Confirmed == 1), and flush that buffer to numbered output files when the buffer gets big enough (999):

import csv
import sys

BATCH_SIZE = 999
OUTNAME_TMPL = &quot;output_XX.csv&quot;

# Replace XX in the template above with this incrementing idx
out_idx = 1

rows_buf: list[dict[str, str]]
rows_buf = []

f_in = open(&quot;input.csv&quot;, encoding=&quot;utf-8&quot;, newline=&quot;&quot;)
reader = csv.DictReader(f_in)

# .fieldnames can return None, so check below
header = reader.fieldnames
if header is None:
    print(&quot;error: expected a header row; got None&quot;, file=sys.stderr)
    sys.exit(1)


def flush_rows():
    &quot;&quot;&quot;Write rows_buf to a CSV numbered with out_idx, and with header.&quot;&quot;&quot;
    fname = OUTNAME_TMPL.replace(&quot;XX&quot;, str(out_idx))

    with open(fname, &quot;w&quot;, encoding=&quot;utf-8&quot;, newline=&quot;&quot;) as f_out:
        writer = csv.DictWriter(f_out, fieldnames=header)
        writer.writeheader()
        writer.writerows(rows_buf)


for row in reader:
    if len(rows_buf) == BATCH_SIZE:
        flush_rows()
        out_idx += 1
        rows_buf = []

    if row[&quot;Confirmed&quot;] == &quot;1&quot;:
        rows_buf.append(row)

# Flush any remaining rows
if rows_buf:
    flush_rows()

If you haven't used the DictReader/Writer before, the reader yields a dict of dict[str,str] where the key is the column, and the value is the field data for that row-and-column. The writer needs to be initialized with the fieldnames (the global var header in the script), then its write methods can take those same dicts (from the reader) and properly output them as CSV.

huangapple
  • 本文由 发表于 2023年4月17日 21:32:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76035739.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定