更快的方法将大型CSV文件均匀地按组拆分成较小的CSV文件?

huangapple go评论74阅读模式
英文:

Faster way to split a large CSV file evenly by Groups into smaller CSV files?

问题

我确信有更好的方法来处理这个问题,但我现在不太清楚。我有一个以这种格式的CSV文件。ID列已排序,因此至少将一切分组在一起:

    Text                 ID
    this is sample text, AAAA
    this is sample text, AAAA
    this is sample text, AAAA
    this is sample text, AAAA
    this is sample text, AAAA
    this is sample text2, BBBB
    this is sample text2, BBBB
    this is sample text2, BBBB
    this is sample text3, CCCC
    this is sample text4, DDDD
    this is sample text4, DDDD
    this is sample text5, EEEE
    this is sample text5, EEEE
    this is sample text6, FFFF
    this is sample text6, FFFF

我想要做的是快速将CSV文件拆分为X个较小的CSV文件。所以如果X == 3,那么AAAA将进入"1.csv",BBBB将进入"2.csv",CCCC将进入"3.csv",然后下一个组将循环回到"1.csv"。

这些组的大小各不相同,所以硬编码的分割方法在这里不起作用。

是否有一种更快速、可靠的方式来拆分这些文件,而不像我当前的方法那样只是使用Python中的Pandas groupby来编写它们?

file_ = 0
num_files = 3

for name, group in df.groupby(by=['ID'], sort=False):

    file_+=1
    group['File Num'] = file_

    group.to_csv(file_+'.csv',index=False, header=False, mode='a')

    if file_ == num_files:

        file_ = 0

这是一个基于Python的解决方案,但我也可以考虑使用awk或bash来完成任务。

编辑:

为了澄清,我想要将这些组均匀地分割到一个可以设置的固定数量的文件中。在这种情况下,是3(即X = 3)。第一组(AAAA)将进入1.csv,第二组将进入2.csv,第三组将进入3.csv,然后对于第四组,它将循环回到1.csv,以此类推。

示例输出 1.csv:

    Text                 ID
    this is sample text, AAAA
    this is sample text, AAAA
    this is sample text, AAAA
    this is sample text, AAAA
    this is sample text, AAAA
    this is sample text4, DDDD
    this is sample text4, DDDD

示例输出 2.csv:

    Text                 ID
    this is sample text2, BBBB
    this is sample text2, BBBB
    this is sample text2, BBBB
    this is sample text5, EEEE
    this is sample text5, EEEE

示例输出 3.csv:

    Text                 ID
    this is sample text3, CCCC
    this is sample text6, FFFF
    this is sample text6, FFFF
英文:

I'm sure there is a better way for this but I am drawing a blank. I have a CSV file in this format. The ID column is sorted so everything is grouped together at least:

Text                 ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text3, CCCC
this is sample text4, DDDD
this is sample text4, DDDD
this is sample text5, EEEE
this is sample text5, EEEE
this is sample text6, FFFF
this is sample text6, FFFF

What I want to do is split the CSV fast across X amount of smaller CSV files fast. So if X==3, then AAAA would go into "1.csv", BBBB would go into "2.csv", CCCC would go into "3.csv" and the next group would loop back around and go into "1.csv".

The groups vary in size so a hardcoded split by numbers won't work here.

Is there a faster way to split these reliably then my current method which just uses Pandas groupby in Python to write them?

    file_ = 0
    num_files = 3

    for name, group in df.groupby(by=['ID'], sort=False):

        file_+=1
        group['File Num'] = file_

        group.to_csv(file_+'.csv',index=False, header=False, mode='a')

        if file_ == num_files:

            file_ = 0

This is a python based solution but I am open to stuff using awk or bash if it gets the job done.

EDIT:

For clarification, I want the groups split across a fixed amount of files I can set.

In this case, 3. (So x = 3). The first group (AAAA) would go into 1.csv, the 2nd into 2.csv, the third into 3.csv and then for the fourth group, it would loop back and insert it into 1.csv. etc.

Example output 1.csv:

Text                 ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text4, DDDD
this is sample text4, DDDD

Example output 2.csv:

Text                 ID
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text5, EEEE
this is sample text5, EEEE

Example output 3.csv:

Text                 ID
this is sample text3, CCCC
this is sample text6, FFFF
this is sample text6, FFFF

答案1

得分: 4

以下是您要翻译的内容:

Using any awk in any shell on every Unix box:

$ cat tst.awk
NR==1 {
    hdr = $0
    next
}
$NF != prev {
    out = (((blockCnt++) % X) + 1) ".csv"
    if ( blockCnt <= X ) {
        print hdr > out
    }
    prev = $NF
}
{ print > out }

<p>

$ awk -v X=3 -f tst.awk input.csv

<p>

$ head [0-9]*.csv
==&gt; 1.csv &lt;==
Text                 ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text4, DDDD
this is sample text4, DDDD

==&gt; 2.csv &lt;==
Text                 ID
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text5, EEEE
this is sample text5, EEEE

==&gt; 3.csv &lt;==
Text                 ID
this is sample text3, CCCC
this is sample text6, FFFF
this is sample text6, FFFF

If `X` was some large enough number that you exceed your system limit for concurrently open files and you start getting a &quot;too many open files&quot; error then you&#39;d need to use GNU awk as it handles that internally or change the code to only have 1 file open at a time:

```shell
NR==1 {
    hdr = $0
    next
}
$NF != prev {
    close(out)
    out = (((blockCnt++) % X) + 1) &quot;.csv&quot;
    if ( blockCnt &lt;= X ) {
        print hdr &gt; out
    }
    prev = $NF
}
{ print &gt;&gt; out }

or implement your own way of managing how many files are open concurrently.

EDIT: here's what the suggestion by @PaulHodges in the comments would result in a script like:

NR == 1 {
    for ( i=1; i &lt;= X; i++ ) {
        print &gt; (i &quot;.csv&quot;)
    }
    next
}
$NF != prev {
    out = (((NR-1) % X) + 1) &quot;.csv&quot;
    prev = $NF
}
{ print &gt; out }
英文:

Using any awk in any shell on every Unix box:

$ cat tst.awk
NR==1 {
    hdr = $0
    next
}
$NF != prev {
    out = (((blockCnt++) % X) + 1) &quot;.csv&quot;
    if ( blockCnt &lt;= X ) {
        print hdr &gt; out
    }
    prev = $NF
}
{ print &gt; out }

<p>

$ awk -v X=3 -f tst.awk input.csv

<p>

$ head [0-9]*.csv
==&gt; 1.csv &lt;==
Text                 ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text4, DDDD
this is sample text4, DDDD

==&gt; 2.csv &lt;==
Text                 ID
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text5, EEEE
this is sample text5, EEEE

==&gt; 3.csv &lt;==
Text                 ID
this is sample text3, CCCC
this is sample text6, FFFF
this is sample text6, FFFF

If X was some large enough number that you exceed your system limit for concurrently open files and you start getting a "too many open files" error then you'd need to use GNU awk as it handles that internally or change the code to only have 1 file open at a time:

NR==1 {
    hdr = $0
    next
}
$NF != prev {
    close(out)
    out = (((blockCnt++) % X) + 1) &quot;.csv&quot;
    if ( blockCnt &lt;= X ) {
        print hdr &gt; out
    }
    prev = $NF
}
{ print &gt;&gt; out }

or implement your own way of managing how many files are open concurrently.


EDIT: here's what the suggestion by @PaulHodges in the comments would result in a script like:

NR == 1 {
    for ( i=1; i &lt;= X; i++ ) {
        print &gt; (i &quot;.csv&quot;)
    }
    next
}
$NF != prev {
    out = (((NR-1) % X) + 1) &quot;.csv&quot;
    prev = $NF
}
{ print &gt; out }

答案2

得分: 3

你可以使用这个 awk 解决方案:

awk -v X=3 '
FNR == 1 {   # 将第一行记录保存为标题
   hdr = $0
   next
}
p != $NF {   # ID 字段变化,移至新的输出 CSV 文件
   close(fn)
   fn = ((n++ % X) + 1)".csv" # 构建新文件名
}
!seen[fn]++ {                 # 是否需要打印标题
   print hdr > fn 
}
{
   print >> fn                # 将每条记录追加到输出中
   p = $NF                    # 将最后一个字段保存在变量 p 中
}' file
英文:

You may use this awk solution:

awk -v X=3 &#39;
FNR == 1 {   # save 1st record as header 
   hdr = $0
   next
}
p != $NF {   # ID field changes, move to new output csv file 
   close(fn)
   fn = ((n++ % X) + 1)&quot;.csv&quot; # construct new file name
}
!seen[fn]++ {                 # do we need to print header
   print hdr &gt; fn 
}
{
   print &gt;&gt; fn                # append each record to output
   p = $NF                    # save last field in variable p
}&#39; file

答案3

得分: 2

使用您提供的示例,请尝试以下代码。如示例所示,考虑最后一列已按顺序排序。

awk -v x="3" '
BEGIN{
  count=1
  outFile=count".csv"
}
FNR==1{
  print
  next
}
prev!=$NF && prev{
  close(outFile)
  count++
  outFile=count".csv"
}
{
  print >> (outFile)
  prev=$NF
}
x==count{ count=1 }
' Input_file

请注意,我只为您提供了代码的翻译部分。如果您有任何其他问题,请随时提出。

英文:

With your shown samples please try following code. As mentioned considering last column is sorted as per shown samples.

awk -v x=&quot;3&quot; &#39;
BEGIN{
  count=1
  outFile=count&quot;.csv&quot;
}
FNR==1{
  print
  next
}
prev!=$NF &amp;&amp; prev{
  close(outFile)
  count++
  outFile=count&quot;.csv&quot;
}
{
  print &gt;&gt; (outFile)
  prev=$NF
}
x==count{ count=1 }
&#39; Input_file

答案4

得分: 0

使用groupbyfactorize,按所需的组数(N)进行模除:

N = 3

for i, g in df.groupby(pd.factorize(df['ID'])[0] % N):
    g.to_csv(f'chunk{i+1}.csv', index=False)

输出文件:

# chunk1.csv
Text,ID
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text4,DDDD
this is sample text4,DDDD

# chunk2.csv
Text,ID
this is sample text2,BBBB
this is sample text2,BBBB
this is sample text2,BBBB
this is sample text5,EEEE
this is sample text5,EEEE

# chunk3.csv
Text,ID
this is sample text3,CCCC
this is sample text6,FFFF
this is sample text6,FFFF

时间

在1400万行上进行测试:

每次循环耗时15.8秒 ± 687毫秒(平均值 ± 7次运行的标准差,1次循环)

其中大约14秒用于I/O操作。

与其他答案进行比较(在shell中使用time):

# @mozway 作为包括导入和读取文件的Python脚本
实际耗时0分20.834秒

# @RavinderSingh13
实际耗时1分22.952秒

# @anubhava
实际耗时1分23.790秒

# @Ed Morton(更新后的代码,原始解决方案需要2分58.171秒)
实际耗时0分8.599秒

作为一个函数:

import pandas as pd

def split_csv(filename, N=3, id_col='ID', out_basename='chunk'):
    df = pd.read_csv(filename)
    for i, g in df.groupby(pd.factorize(df[id_col])[0] % N):
        g.to_csv(f'{out_basename}{i+1}.csv', index=False)

split_csv('my_file.csv', N=3)
英文:

Using groupby and factorize modulo the number of desired groups (N):

N = 3

for i, g in df.groupby(pd.factorize(df[&#39;ID&#39;])[0]%N):
    g.to_csv(f&#39;chunk{i+1}.csv&#39;, index=False)

Output files:

# chunk1.csv
Text,ID
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text4,DDDD
this is sample text4,DDDD

# chunk2.csv
Text,ID
this is sample text2,BBBB
this is sample text2,BBBB
this is sample text2,BBBB
this is sample text5,EEEE
this is sample text5,EEEE

# chunk3.csv
Text,ID
this is sample text3,CCCC
this is sample text6,FFFF
this is sample text6,FFFF

timings

Tested on 14 million rows:

15.8 s &#177; 687 ms per loop (mean &#177; std. dev. of 7 runs, 1 loop each)

~14s of which is due to I/O

Comparison with other answers (using time in the shell):

# @mozway as a python script including imports and reading the file
real	0m20,834s

# @RavinderSingh13
real	1m22,952s

# @anubhava
real	1m23,790s

# @Ed Morton (updated code, original solution was 2m58,171s)
real	0m8,599s

As a function:

import pandas as pd

def split_csv(filename, N=3, id_col=&#39;ID&#39;, out_basename=&#39;chunk&#39;):
    df = pd.read_csv(filename)
    for i, g in df.groupby(pd.factorize(df[id_col])[0]%N):
        g.to_csv(f&#39;{out_basename}{i+1}.csv&#39;, index=False)

split_csv(&#39;my_file.csv&#39;, N=3)

答案5

得分: 0

你提供的代码中,第一个参数是字符串,但to_csv方法允许你提供文件样式的对象作为第一个参数,这样你可以避免多次进行与文件打开相关的操作。考虑以下简单的比较:

import os
import time
import pandas as pd
REPEAT = 1000
df = pd.DataFrame({'col1': range(100)})
t1 = time.time()
for _ in range(REPEAT):
    df.to_csv('file.csv', index=False, header=False, mode='a')
t2 = time.time()
os.remove('file.csv')
t3 = time.time()
with open('file.csv', 'a') as f:
    for _ in range(REPEAT):
        df.to_csv(f, index=False, header=False)
t4 = time.time()
print('使用文件名', t2 - t1)
print('使用文件句柄', t4 - t3)

输出结果为:

使用文件名 0.35850977897644043
使用文件句柄 0.2669696807861328

请注意,第二种方法的执行时间大约为第一种方法的75%,因此虽然速度更快,但仍然属于相同数量级。

英文:

Here

group.to_csv(file_+&#39;.csv&#39;,index=False, header=False, mode=&#39;a&#39;)

you are providing string as 1st argument, however to_csv method allow you to provide file-like object as 1st argument, in which case you might avoid doing file-opening-related stuff multiple times, consider following simple simple comparison

import os
import time
import pandas as pd
REPEAT = 1000
df = pd.DataFrame({&#39;col1&#39;:range(100)})
t1 = time.time()
for _ in range(REPEAT):
    df.to_csv(&#39;file.csv&#39;,index=False,header=False,mode=&#39;a&#39;)
t2 = time.time()
os.remove(&#39;file.csv&#39;)
t3 = time.time()
with open(&#39;file.csv&#39;,&#39;a&#39;) as f:
    for _ in range(REPEAT):
        df.to_csv(f,index=False,header=False)
t4 = time.time()
print(&#39;Using filename&#39;,t2-t1)
print(&#39;Using filehandle&#39;,t4-t3)

gives output

Using filename 0.35850977897644043
Using filehandle 0.2669696807861328

Observe that 2nd way take around 75% time of 1st way, so whilst it is faster it is still same order of magnitude.

huangapple
  • 本文由 发表于 2023年7月27日 18:38:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76778911.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定