英文:
Faster way to split a large CSV file evenly by Groups into smaller CSV files?
问题
我确信有更好的方法来处理这个问题,但我现在不太清楚。我有一个以这种格式的CSV文件。ID列已排序,因此至少将一切分组在一起:
Text ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text3, CCCC
this is sample text4, DDDD
this is sample text4, DDDD
this is sample text5, EEEE
this is sample text5, EEEE
this is sample text6, FFFF
this is sample text6, FFFF
我想要做的是快速将CSV文件拆分为X个较小的CSV文件。所以如果X == 3,那么AAAA将进入"1.csv",BBBB将进入"2.csv",CCCC将进入"3.csv",然后下一个组将循环回到"1.csv"。
这些组的大小各不相同,所以硬编码的分割方法在这里不起作用。
是否有一种更快速、可靠的方式来拆分这些文件,而不像我当前的方法那样只是使用Python中的Pandas groupby来编写它们?
file_ = 0
num_files = 3
for name, group in df.groupby(by=['ID'], sort=False):
file_+=1
group['File Num'] = file_
group.to_csv(file_+'.csv',index=False, header=False, mode='a')
if file_ == num_files:
file_ = 0
这是一个基于Python的解决方案,但我也可以考虑使用awk
或bash来完成任务。
编辑:
为了澄清,我想要将这些组均匀地分割到一个可以设置的固定数量的文件中。在这种情况下,是3(即X = 3)。第一组(AAAA)将进入1.csv,第二组将进入2.csv,第三组将进入3.csv,然后对于第四组,它将循环回到1.csv,以此类推。
示例输出 1.csv:
Text ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text4, DDDD
this is sample text4, DDDD
示例输出 2.csv:
Text ID
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text5, EEEE
this is sample text5, EEEE
示例输出 3.csv:
Text ID
this is sample text3, CCCC
this is sample text6, FFFF
this is sample text6, FFFF
英文:
I'm sure there is a better way for this but I am drawing a blank. I have a CSV file in this format. The ID column is sorted so everything is grouped together at least:
Text ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text3, CCCC
this is sample text4, DDDD
this is sample text4, DDDD
this is sample text5, EEEE
this is sample text5, EEEE
this is sample text6, FFFF
this is sample text6, FFFF
What I want to do is split the CSV fast across X amount of smaller CSV files fast. So if X==3, then AAAA would go into "1.csv", BBBB would go into "2.csv", CCCC would go into "3.csv" and the next group would loop back around and go into "1.csv".
The groups vary in size so a hardcoded split by numbers won't work here.
Is there a faster way to split these reliably then my current method which just uses Pandas groupby in Python to write them?
file_ = 0
num_files = 3
for name, group in df.groupby(by=['ID'], sort=False):
file_+=1
group['File Num'] = file_
group.to_csv(file_+'.csv',index=False, header=False, mode='a')
if file_ == num_files:
file_ = 0
This is a python based solution but I am open to stuff using awk
or bash if it gets the job done.
EDIT:
For clarification, I want the groups split across a fixed amount of files I can set.
In this case, 3. (So x = 3). The first group (AAAA) would go into 1.csv, the 2nd into 2.csv, the third into 3.csv and then for the fourth group, it would loop back and insert it into 1.csv. etc.
Example output 1.csv:
Text ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text4, DDDD
this is sample text4, DDDD
Example output 2.csv:
Text ID
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text5, EEEE
this is sample text5, EEEE
Example output 3.csv:
Text ID
this is sample text3, CCCC
this is sample text6, FFFF
this is sample text6, FFFF
答案1
得分: 4
以下是您要翻译的内容:
Using any awk in any shell on every Unix box:
$ cat tst.awk
NR==1 {
hdr = $0
next
}
$NF != prev {
out = (((blockCnt++) % X) + 1) ".csv"
if ( blockCnt <= X ) {
print hdr > out
}
prev = $NF
}
{ print > out }
<p>
$ awk -v X=3 -f tst.awk input.csv
<p>
$ head [0-9]*.csv
==> 1.csv <==
Text ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text4, DDDD
this is sample text4, DDDD
==> 2.csv <==
Text ID
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text5, EEEE
this is sample text5, EEEE
==> 3.csv <==
Text ID
this is sample text3, CCCC
this is sample text6, FFFF
this is sample text6, FFFF
If `X` was some large enough number that you exceed your system limit for concurrently open files and you start getting a "too many open files" error then you'd need to use GNU awk as it handles that internally or change the code to only have 1 file open at a time:
```shell
NR==1 {
hdr = $0
next
}
$NF != prev {
close(out)
out = (((blockCnt++) % X) + 1) ".csv"
if ( blockCnt <= X ) {
print hdr > out
}
prev = $NF
}
{ print >> out }
or implement your own way of managing how many files are open concurrently.
EDIT: here's what the suggestion by @PaulHodges in the comments would result in a script like:
NR == 1 {
for ( i=1; i <= X; i++ ) {
print > (i ".csv")
}
next
}
$NF != prev {
out = (((NR-1) % X) + 1) ".csv"
prev = $NF
}
{ print > out }
英文:
Using any awk in any shell on every Unix box:
$ cat tst.awk
NR==1 {
hdr = $0
next
}
$NF != prev {
out = (((blockCnt++) % X) + 1) ".csv"
if ( blockCnt <= X ) {
print hdr > out
}
prev = $NF
}
{ print > out }
<p>
$ awk -v X=3 -f tst.awk input.csv
<p>
$ head [0-9]*.csv
==> 1.csv <==
Text ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text4, DDDD
this is sample text4, DDDD
==> 2.csv <==
Text ID
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text5, EEEE
this is sample text5, EEEE
==> 3.csv <==
Text ID
this is sample text3, CCCC
this is sample text6, FFFF
this is sample text6, FFFF
If X
was some large enough number that you exceed your system limit for concurrently open files and you start getting a "too many open files" error then you'd need to use GNU awk as it handles that internally or change the code to only have 1 file open at a time:
NR==1 {
hdr = $0
next
}
$NF != prev {
close(out)
out = (((blockCnt++) % X) + 1) ".csv"
if ( blockCnt <= X ) {
print hdr > out
}
prev = $NF
}
{ print >> out }
or implement your own way of managing how many files are open concurrently.
EDIT: here's what the suggestion by @PaulHodges in the comments would result in a script like:
NR == 1 {
for ( i=1; i <= X; i++ ) {
print > (i ".csv")
}
next
}
$NF != prev {
out = (((NR-1) % X) + 1) ".csv"
prev = $NF
}
{ print > out }
答案2
得分: 3
你可以使用这个 awk
解决方案:
awk -v X=3 '
FNR == 1 { # 将第一行记录保存为标题
hdr = $0
next
}
p != $NF { # ID 字段变化,移至新的输出 CSV 文件
close(fn)
fn = ((n++ % X) + 1)".csv" # 构建新文件名
}
!seen[fn]++ { # 是否需要打印标题
print hdr > fn
}
{
print >> fn # 将每条记录追加到输出中
p = $NF # 将最后一个字段保存在变量 p 中
}' file
英文:
You may use this awk
solution:
awk -v X=3 '
FNR == 1 { # save 1st record as header
hdr = $0
next
}
p != $NF { # ID field changes, move to new output csv file
close(fn)
fn = ((n++ % X) + 1)".csv" # construct new file name
}
!seen[fn]++ { # do we need to print header
print hdr > fn
}
{
print >> fn # append each record to output
p = $NF # save last field in variable p
}' file
答案3
得分: 2
使用您提供的示例,请尝试以下代码。如示例所示,考虑最后一列已按顺序排序。
awk -v x="3" '
BEGIN{
count=1
outFile=count".csv"
}
FNR==1{
print
next
}
prev!=$NF && prev{
close(outFile)
count++
outFile=count".csv"
}
{
print >> (outFile)
prev=$NF
}
x==count{ count=1 }
' Input_file
请注意,我只为您提供了代码的翻译部分。如果您有任何其他问题,请随时提出。
英文:
With your shown samples please try following code. As mentioned considering last column is sorted as per shown samples.
awk -v x="3" '
BEGIN{
count=1
outFile=count".csv"
}
FNR==1{
print
next
}
prev!=$NF && prev{
close(outFile)
count++
outFile=count".csv"
}
{
print >> (outFile)
prev=$NF
}
x==count{ count=1 }
' Input_file
答案4
得分: 0
使用groupby
和factorize
,按所需的组数(N
)进行模除:
N = 3
for i, g in df.groupby(pd.factorize(df['ID'])[0] % N):
g.to_csv(f'chunk{i+1}.csv', index=False)
输出文件:
# chunk1.csv
Text,ID
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text4,DDDD
this is sample text4,DDDD
# chunk2.csv
Text,ID
this is sample text2,BBBB
this is sample text2,BBBB
this is sample text2,BBBB
this is sample text5,EEEE
this is sample text5,EEEE
# chunk3.csv
Text,ID
this is sample text3,CCCC
this is sample text6,FFFF
this is sample text6,FFFF
时间
在1400万行上进行测试:
每次循环耗时15.8秒 ± 687毫秒(平均值 ± 7次运行的标准差,1次循环)
其中大约14秒用于I/O操作。
与其他答案进行比较(在shell中使用time
):
# @mozway 作为包括导入和读取文件的Python脚本
实际耗时0分20.834秒
# @RavinderSingh13
实际耗时1分22.952秒
# @anubhava
实际耗时1分23.790秒
# @Ed Morton(更新后的代码,原始解决方案需要2分58.171秒)
实际耗时0分8.599秒
作为一个函数:
import pandas as pd
def split_csv(filename, N=3, id_col='ID', out_basename='chunk'):
df = pd.read_csv(filename)
for i, g in df.groupby(pd.factorize(df[id_col])[0] % N):
g.to_csv(f'{out_basename}{i+1}.csv', index=False)
split_csv('my_file.csv', N=3)
英文:
Using groupby
and factorize
modulo the number of desired groups (N
):
N = 3
for i, g in df.groupby(pd.factorize(df['ID'])[0]%N):
g.to_csv(f'chunk{i+1}.csv', index=False)
Output files:
# chunk1.csv
Text,ID
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text4,DDDD
this is sample text4,DDDD
# chunk2.csv
Text,ID
this is sample text2,BBBB
this is sample text2,BBBB
this is sample text2,BBBB
this is sample text5,EEEE
this is sample text5,EEEE
# chunk3.csv
Text,ID
this is sample text3,CCCC
this is sample text6,FFFF
this is sample text6,FFFF
timings
Tested on 14 million rows:
15.8 s ± 687 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
~14s of which is due to I/O
Comparison with other answers (using time
in the shell):
# @mozway as a python script including imports and reading the file
real 0m20,834s
# @RavinderSingh13
real 1m22,952s
# @anubhava
real 1m23,790s
# @Ed Morton (updated code, original solution was 2m58,171s)
real 0m8,599s
As a function:
import pandas as pd
def split_csv(filename, N=3, id_col='ID', out_basename='chunk'):
df = pd.read_csv(filename)
for i, g in df.groupby(pd.factorize(df[id_col])[0]%N):
g.to_csv(f'{out_basename}{i+1}.csv', index=False)
split_csv('my_file.csv', N=3)
答案5
得分: 0
你提供的代码中,第一个参数是字符串,但to_csv
方法允许你提供文件样式的对象作为第一个参数,这样你可以避免多次进行与文件打开相关的操作。考虑以下简单的比较:
import os
import time
import pandas as pd
REPEAT = 1000
df = pd.DataFrame({'col1': range(100)})
t1 = time.time()
for _ in range(REPEAT):
df.to_csv('file.csv', index=False, header=False, mode='a')
t2 = time.time()
os.remove('file.csv')
t3 = time.time()
with open('file.csv', 'a') as f:
for _ in range(REPEAT):
df.to_csv(f, index=False, header=False)
t4 = time.time()
print('使用文件名', t2 - t1)
print('使用文件句柄', t4 - t3)
输出结果为:
使用文件名 0.35850977897644043
使用文件句柄 0.2669696807861328
请注意,第二种方法的执行时间大约为第一种方法的75%,因此虽然速度更快,但仍然属于相同数量级。
英文:
Here
group.to_csv(file_+'.csv',index=False, header=False, mode='a')
you are providing string as 1st argument, however to_csv
method allow you to provide file-like object as 1st argument, in which case you might avoid doing file-opening-related stuff multiple times, consider following simple simple comparison
import os
import time
import pandas as pd
REPEAT = 1000
df = pd.DataFrame({'col1':range(100)})
t1 = time.time()
for _ in range(REPEAT):
df.to_csv('file.csv',index=False,header=False,mode='a')
t2 = time.time()
os.remove('file.csv')
t3 = time.time()
with open('file.csv','a') as f:
for _ in range(REPEAT):
df.to_csv(f,index=False,header=False)
t4 = time.time()
print('Using filename',t2-t1)
print('Using filehandle',t4-t3)
gives output
Using filename 0.35850977897644043
Using filehandle 0.2669696807861328
Observe that 2nd way take around 75% time of 1st way, so whilst it is faster it is still same order of magnitude.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论