What is a more efficient way to remove duplicates from a CSV file based on specific fields using a batch script, (and gawk, if needed)?

huangapple go评论61阅读模式
英文:

What is a more efficient way to remove duplicates from a CSV file based on specific fields using a batch script, (and gawk, if needed)?

问题

我理解你的需求是要搜索两个CSV文件中的文件名字段(第3列),并删除重复的记录,或者将不包含重复文件名的记录输出到新文件。你尝试使用嵌套的for /f循环来实现这个目标,但由于文件很大,运行时间太长。你也考虑使用findstrgawk来解决这个问题。

下面是使用gawk来实现你的目标的示例代码:

gawk -F, 'NR==FNR{seen[$3]=1;next} !seen[$3]' copytoarchive.csv alreadyinarchive.csv > output.csv

这个命令的解释如下:

  • -F,:指定字段分隔符为逗号。
  • NR==FNR:这是一个gawk的特殊用法,它表示在处理第一个文件(copytoarchive.csv)时执行以下操作。
  • seen[$3]=1:将copytoarchive.csv中的文件名(第3列)添加到一个数组中,以便后续检查是否重复。
  • next:跳过当前行,继续处理下一行。
  • !seen[$3]:当处理第二个文件(alreadyinarchive.csv)时,检查文件名是否在数组中,如果不在,则打印该行。
  • copytoarchive.csv alreadyinarchive.csv:这是输入文件的列表,你可以将它们替换为你的文件路径。
  • > output.csv:将结果输出到一个新文件 output.csv

这个命令将产生一个新的CSV文件 output.csv,其中包含所有不重复的记录。

请确保你已经安装了gawk,并将命令中的文件路径替换为你的实际文件路径。这个命令应该比嵌套的for /f循环快得多,对于2000多条记录的文件应该能够快速完成。

英文:

I have two csv documents that contain lists of files from a source and destination in Google Drive generated by GAM. One is called copytoarchive.csv and lists all relevant files in the source. The other is alreadyinarchive.csv and lists all relevant files already in the destination.

The way Google Drive works is to assign a UID to each file regardless of its name. The csv file list shows the file UID in one column and the file name in another.

Here is an example of what copytoarchive.csv looks like:

Owner,id,name,Parent
user@domain.com,1gyKqu_P0h3j1Vn-6EwUv_99q,PreschoolExampleLessonName-20230504_050216-Meeting Recording.mp4,1b-U8XU0jYVFCggoEH9E9wqUm
user@domain.com,14-sg-qSnn5GDAuftANdLxDXp,OneonOneExampleLessonName-20230510_043228-Meeting Recording.mp4,1TtaABRvoki4gNuyqRrlyTfzj
user@domain.com,1L9mQBJ6d3DIPbiIEyV7akArV,OtherYearLevelExampleLessonName-20230510_033024-Meeting Recording.mp4,1CdiBgePlVqPvtcEp83DmcUrotr
user@domain.com,1oHaFzLF_KcgVX-hZn5etBka9,TeacherTrainingLesson-20230510_000950-Meeting Recording.mp4,1QusVD-a9U16I-0GTP1t-Vd9Ez
user@domain.com,1_ewCEh37sZYpqZlr3TC8u2Yl,ExampleStaffMeeting-20230509_045403-Meeting Recording.mp4,1tOb7xV5OCMMebn2ab2KdXGvc
user@domain.com,1SyXjINXttrb3VKvpbjpm1y-V,ExampleLessonName-20230503_052304-Meeting Recording.mp4,13g_fYh9HYtnDtd4psHEZi

Here is what alreadyinarchive.csv looks like:

Owner,id,name,Parent
user@domain.com,1Zlnhqf6fSxTRT2JEmQS91cCX,PreschoolExampleLessonName-20230504_050216-Meeting Recording.mp4,1CdiBgePlVqPvtcEp83DmcUro
user@domain.com,1Lg2W0w8YGJytSgJl2JblBly3,OtherYearLevelExampleLessonName-20230510_033024-Meeting Recording.mp4,1TtaABRvoki4gNuyqRrlyTfzj
user@domain.com,1Q_K0D1RgZlz-LMlDUVrV0gGi,ParentTrainingLesson-20230510_000950-Meeting Recording.mp4,1b-U8XU0jYVFCggoEH9E9wq
user@domain.com,1LIrRoTGtADjQRg9IRmIlJ3oV,ExampleStaffMeeting-20230509_045403-Meeting Recording.mp4,1xVuHbE3pcWN1l7X109qTsIYZK
user@domain.com,1OHkH9Cg7i2-O-ZHXBr4wIYGZ,OneonOneExampleLessonName-20230510_043228-Meeting Recording.mp4,1U7Y2Xh4Qi3atCcVL262
user@domain.com,1jZsXB5TT0H0TRrvvZu5A3N1S,DifferentLessonName-20230503_052614-Meeting Recording.mp4,1eVS3QF_Sk_6fQkwF8PvTKQf

The data in the Owner and Parent fields is irrelevant for this part of the batch file.

How can I search the csv(s) for duplicates in just the filename field (field 3), then delete the whole record OR output to a new file only records that don't contain duplicates in field 3?

For example:

  1. Notice that the filename from the first non header record of copytoarchive.csv matches the first non header record of alreadyinarchive.csv even though the UIDs are different. This would be marked as a duplicate.

  2. Also notice that the filename in the second non header record of copytoarchive.csv matches the filename in the fifth non header record of alreadyinarchive.csv even though the UIDs are different. This would also be marked as a duplicate.

  3. Further, notice that although the timestamp of TeacherTrainingLesson... (4th non header record) of copytoarchive.csv matches ParentTrainingLesson... (3rd non header record) of alreadyinarchive.csv, because the filenames don't fully match these should not be considered a duplicate.

Of approx. 2,000 records only around 300 are not duplicates.

I am happy to manipulate copytoarchive.csv and alreadyinarchive.csv into one file if needed.

At first I tried to achieve this with nested for /f loops, where the first for /f would read copytoarchive.csv one line at a time and compare the relevant token (tokens=3) against the relevant token of EVERY line of the alreadyinarchive.csv through a 2nd nested for /f loop.

As requested, here is the for /f loop that I tried:

setlocal enabledelayedexpansion

rem This code block takes the info from copytoarchive.csv and alreadyinarchive.csv, deletes any matching lines (e.g., files already in the archive), and generates filestocopy.csv which contains the old and new parent IDs for only those files needing to be copied to the archive.

set /a filenum=0
set /a totalfiles=0

for /f "delims=, tokens=2-4" %%k in (C:\path\copytoarchive.csv) do (
	set /a filenum+=1
	set /a totalfiles+=1
	call set fileID[!filenum!]=%%k
	call set filename[!filenum!]=%%l
	call :checkifexists
)

set oldfileID[1]=OldParent
set newparentID[1]=NewParent
if exist c:\path\filestocopy.csv del c:\path\filestocopy.csv
for /l %%q in (1,1,%totalfiles%) do (
	echo !newowner[%%q]!,!newparentID[%%q]!,!oldfileID[%%q]!,!newparentname[%%q]! >> c:\path\filestocopy.csv
)

exit /b

:checkifexists
	for /f "delims=, tokens=3" %%n in (C:\path\alreadyinarchive.csv) do (
		if not !filename[%filenum%]!==%%n (
			set fileparentID[%filenum%]=%%m
		)
	)
	goto :eof

While this does technically work, because each list is almost 2,000 lines long, this creates approximately 4,000,000 iterations which takes WAY TOO LONG (>10 mins on my i9-12900 PC) to complete. Please note that this is only one section of a larger batch file.

I have read up on findstr but I couldn't find a way to use it to only search one field.

I also have GNU CoreUtils installed, including gawk. Reading gawk's PDF manual, it appears that this is likely the best path to take, but I am still struggling to find the correct parameters to pass to gawk to get it to only search the "name" fields.

I found @perl's answer here which looks tantalizingly close to a solution, but don't understand enough to know how to translate it to my use case.

Any help is appreciated.

答案1

得分: 2

以下是您提供的代码的中文翻译部分:

@ECHO OFF
SETLOCAL ENABLEDELAYEDEXPANSION
rem 下面的设置用于源目录和文件名,这些名称是我用于测试的名称,故意包含包含空格的名称以确保该过程可以使用这种名称。您需要根据您的情况进行更改。

SET "sourcedir=u:\your files"
SET "destdir=u:\your results"
:: 第一个文件是copytoarchive.csv
SET "filename1=%sourcedir%\q76378302.txt"
:: 第二个文件已存在于archive.csv中
SET "filename2=%sourcedir%\q76378302_2.txt"
SET "outfile=%destdir%\outfile.txt"

(
FOR /f "usebackq skip=1 delims=" %%e IN ("%filename1%") DO (
 rem %%e 按顺序包含每一行
 FOR /f "tokens=2* delims= " %%b IN ("%%e") DO (
  rem %%c 包含第3列至行尾
  SET "line=%%c"
  FOR /f "delims=?" %%o IN ("!line:.mp4=?!") DO FIND "%%o.mp4" "%filename2%" >nul && IF ERRORLEVEL 1 ECHO %%e? "%%o.mp4"
 )
)
)>>"%outfile%"

TYPE "%outfile%"

GOTO :EOF

rem **在应用于实际数据之前,始终对测试目录进行验证。**

嗯,代码内的注释在很大程度上已经解释了它。

其中的魔法部分是 `!line:.mp4=?!` 子句。
  - 将 `line` 变量中的当前值中的 `.mp4` 替换为 `?`。

批处理无法对 `metavariables` 像 `%%e` 进行子字符串操作,因此需要将 `%%c` 传递给用户变量(`line`)以执行子字符串操作。需要当前的 `line` 值,因此在 `delayedexpansion` 模式下执行替换,因此使用了 `!var!` 格式 [Stephan 的 DELAYEDEXPANSION 链接](https://stackoverflow.com/a/30284028/2128947)。

我使用 `?` 是因为它不会出现在文件名中。

不清楚您是希望获取实际输出行还是只想要文件名,因此我提供了两者,以 `?` 分隔。
英文:
@ECHO OFF
SETLOCAL ENABLEDELAYEDEXPANSION 
rem The following settings for the source directory and filenames are names
rem that I use for testing and deliberately include names which include spaces to make sure
rem that the process works using such names. These will need to be changed to suit your situation.

SET "sourcedir=u:\your files"
SET "destdir=u:\your results"
:: The first file is copytoarchive.csv
SET "filename1=%sourcedir%\q76378302.txt"
:: The second file is alreadyinarchive.csv
SET "filename2=%sourcedir%\q76378302_2.txt"
SET "outfile=%destdir%\outfile.txt"

(
FOR /f "usebackqskip=1delims=" %%e IN ("%filename1%") DO (
 rem %%e has each line in turn
 FOR /f "tokens=2*delims= " %%b IN ("%%e") DO (
  rem %%c has column 3 to eol
  SET "line=%%c"
  FOR /f "delims=?" %%o IN ("!line:.mp4=?!") DO FIND "%%o.mp4" "%filename2%">nul&IF ERRORLEVEL 1 ECHO %%e?"%%o.mp4"
 )
)
)>"%outfile%"

TYPE "%outfile%"

GOTO :EOF

rem Always verify against a test directory before applying to real data.

Well, the comments within the code explain it for the most part.

The magic is the !line:.mp4=?! clause.<br>

  • Replace .mp4 in the current value of line with ?.

Batch cannot substring metavariables line %%e, so it is neccessary to transfer %%c to a user-variable (line) to perform the substring. The current value of line required, so the substitution is performed in delayedexpansion mode, hence the !var! format Stephan's DELAYEDEXPANSION link

I used ? because it cannot appear in a filename.

It's not clear whether you want the actual output line or just the filename, so I've provided both separated by ?

答案2

得分: 1

以下是翻译好的内容:

"从你的问题中,似乎UID字段也不相关,并且重复项仅基于文件名列,因此你关于UID的描述使这一点变得混淆不清。
在你的for /f "delims=, tokens=2-4" %%k in ( ...命令中,你将逗号作为分隔符包含在内,但是你的文件中没有一个逗号!此外:
文件名中包含了空格!还有
你的代码中的这一行:echo !newowner[%%q]!,!newparentID[%%q]!,!oldfileID[%%q]!,!newparentname[%%q]! 表明输出字段由逗号分隔,没有空格...

你并没有描述或发布真正的文件格式,尽管在评论中已经多次要求...

因为所有这些原因,我假设你发布的文件不具备正确的格式,而正确的文件应该是这些:

copytoarchive.csv

Owner,id,name,Parent
user@domain.com,1gyKqu_P0h3j1Vn-6EwUv_99q,PreschoolExampleLessonName-20230504_050216-Meeting Recording.mp4,1b-U8XU0jYVFCggoEH9E9wqUm
user@domain.com,14-sg-qSnn5GDAuftANdLxDXp,OneonOneExampleLessonName-20230510_043228-Meeting Recording.mp4,1TtaABRvoki4gNuyqRrlyTfzj
user@domain.com,1L9mQBJ6d3DIPbiIEyV7akArV,OtherYearLevelExampleLessonName-20230510_033024-Meeting Recording.mp4,1CdiBgePlVqPvtcEp83DmcUrotr
user@domain.com,1oHaFzLF_KcgVX-hZn5etBka9,TeacherTrainingLesson-20230510_000950-Meeting Recording.mp4,1QusVD-a9U16I-0GTP1t-Vd9Ez
user@domain.com,1_ewCEh37sZYpqZlr3TC8u2Yl,ExampleStaffMeeting-20230509_045403-Meeting Recording.mp4,1tOb7xV5OCMMebn2ab2KdXGvc
user@domain.com,1SyXjINXttrb3VKvpbjpm1y-V,ExampleLessonName-20230503_052304-Meeting Recording.mp4,13g_fYh9HYtnDtd4psHEZi

alreadyinarchive.csv:

Owner,id,name,Parent
user@domain.com,1Zlnhqf6fSxTRT2JEmQS91cCX,PreschoolExampleLessonName-20230504_050216-Meeting Recording.mp4,1CdiBgePlVqPvtcEp83DmcUro
user@domain.com,1Lg2W0w8YGJytSgJl2JblBly3,OtherYearLevelExampleLessonName-20230510_033024-Meeting Recording.mp4,1TtaABRvoki4gNuyqRrlyTfzj
user@domain.com,1Q_K0D1RgZlz-LMlDUVrV0gGi,ParentTrainingLesson-20230510_000950-Meeting Recording.mp4,1b-U8XU0jYVFCggoEH9E9wq
user@domain.com,1LIrRoTGtADjQRg9IRmIlJ3oV,ExampleStaffMeeting-20230509_045403-Meeting Recording.mp4,1xVuHbE3pcWN1l7X109qTsIYZK
user@domain.com,1OHkH9Cg7i2-O-ZHXBr4wIYGZ,OneonOneExampleLessonName-20230510_043228-Meeting Recording.mp4,1U7Y2Xh4Qi3atCcVL262
user@domain.com,1jZsXB5TT0H0TRrvvZu5A3N1S,DifferentLessonName-20230503_052614-Meeting Recording.mp4,1eVS3QF_Sk_6fQkwF8PvTKQf

好的。这样的处理过程应该避免使用任何外部(.exe)命令(如findstrfind)以更快地运行。环境变量足以解决这个问题:

@echo off
setlocal EnableDelayedExpansion

rem 将已经在归档中的文件名字段加载到"already"数组中,将值放在下标中,
rem 以便通过IF DEFINED命令快速检查它,同时将名称中的空格更改为下划线
for /F "tokens=3 delims=," %%a in (AlreadyInArchive.csv) do (
   set "name=%%a"
   set "name=!name: =_!"
   set "already[!name!]=1"
)

rem 处理copytoarchive文件,并输出没有重复项的行
(for /F "tokens=1-4 delims=," %%a in (CopyToArchive.csv) do (
   set "name=%%c"
   set "name=!name: =_!"
   if not defined already[!name!] echo %%a,%%b,%%c,%%d
)) > filestocopy.csv

filestocopy.csv:

user@domain.com,1oHaFzLF_KcgVX-hZn5etBka9,TeacherTrainingLesson-20230510_000950-Meeting Recording.mp4,1QusVD-a9U16I-0GTP1t-Vd9Ez
user@domain.com,1SyXjINXttrb3VKvpbjpm1y-V,ExampleLessonName-20230503_052304-Meeting Recording.mp4,13g_fYh9HYtnDtd4psHEZi

PS - 你详细描述了不相关的数据,并没有足够描述重要的要点... What is a more efficient way to remove duplicates from a CSV file based on specific fields using a batch script, (and gawk, if needed)?

英文:

Mmm... First, a couple comments about your request:

  • From your question, it seems that UID field is also irrelevant and that the duplicates are based on filename column only, so your description about UID just make this confusing.
  • In your for /f &quot;delims=, tokens=2-4&quot; %%k in ( ... command you included the comma as delimiter, but there is not a single comma in your files! Moreover:
  • The filenames included a space! And
  • This line in your code: echo !newowner[%%q]!,!newparentID[%%q]!,!oldfileID[%%q]!,!newparentname[%%q]! indicated that the output fields are separated by a comma, and no spaces...

You have not described nor posted the real format of your files, although this was requested a couple times in the comments...

For all these reasons I assumed that your posted files have not the right format, and that the right files should be these ones:

copytoarchive.csv:

Owner,id,name,Parent
user@domain.com,1gyKqu_P0h3j1Vn-6EwUv_99q,PreschoolExampleLessonName-20230504_050216-Meeting Recording.mp4,1b-U8XU0jYVFCggoEH9E9wqUm
user@domain.com,14-sg-qSnn5GDAuftANdLxDXp,OneonOneExampleLessonName-20230510_043228-Meeting Recording.mp4,1TtaABRvoki4gNuyqRrlyTfzj
user@domain.com,1L9mQBJ6d3DIPbiIEyV7akArV,OtherYearLevelExampleLessonName-20230510_033024-Meeting Recording.mp4,1CdiBgePlVqPvtcEp83DmcUrotr
user@domain.com,1oHaFzLF_KcgVX-hZn5etBka9,TeacherTrainingLesson-20230510_000950-Meeting Recording.mp4,1QusVD-a9U16I-0GTP1t-Vd9Ez
user@domain.com,1_ewCEh37sZYpqZlr3TC8u2Yl,ExampleStaffMeeting-20230509_045403-Meeting Recording.mp4,1tOb7xV5OCMMebn2ab2KdXGvc
user@domain.com,1SyXjINXttrb3VKvpbjpm1y-V,ExampleLessonName-20230503_052304-Meeting Recording.mp4,13g_fYh9HYtnDtd4psHEZi

alreadyinarchive.csv:

Owner,id,name,Parent
user@domain.com,1Zlnhqf6fSxTRT2JEmQS91cCX,PreschoolExampleLessonName-20230504_050216-Meeting Recording.mp4,1CdiBgePlVqPvtcEp83DmcUro
user@domain.com,1Lg2W0w8YGJytSgJl2JblBly3,OtherYearLevelExampleLessonName-20230510_033024-Meeting Recording.mp4,1TtaABRvoki4gNuyqRrlyTfzj
user@domain.com,1Q_K0D1RgZlz-LMlDUVrV0gGi,ParentTrainingLesson-20230510_000950-Meeting Recording.mp4,1b-U8XU0jYVFCggoEH9E9wq
user@domain.com,1LIrRoTGtADjQRg9IRmIlJ3oV,ExampleStaffMeeting-20230509_045403-Meeting Recording.mp4,1xVuHbE3pcWN1l7X109qTsIYZK
user@domain.com,1OHkH9Cg7i2-O-ZHXBr4wIYGZ,OneonOneExampleLessonName-20230510_043228-Meeting Recording.mp4,1U7Y2Xh4Qi3atCcVL262
user@domain.com,1jZsXB5TT0H0TRrvvZu5A3N1S,DifferentLessonName-20230503_052614-Meeting Recording.mp4,1eVS3QF_Sk_6fQkwF8PvTKQf

Ok. A process like this should avoid the use of any external (.exe) command (like findstror find) in order to run faster. Environment variables are enough to solve this problem:

@echo off
setlocal EnableDelayedExpansion

rem Load alreadyinarchive&#39;s name field in &quot;already&quot; array putting the value *in the subscript*
rem in order to quickly check for it via IF DEFINED command
rem changing spaces in the name for underscores
for /F &quot;tokens=3 delims=,&quot; %%a in (AlreadyInArchive.csv) do (
   set &quot;name=%%a&quot;
   set &quot;name=!name: =_!&quot;
   set &quot;already[!name!]=1&quot;
)

rem Process copytoarchive file and output lines with no duplicates
(for /F &quot;tokens=1-4 delims=,&quot; %%a in (CopyToArchive.csv) do (
   set &quot;name=%%c&quot;
   set &quot;name=!name: =_!&quot;
   if not defined already[!name!] echo %%a,%%b,%%c,%%d
)) &gt; filestocopy.csv

filestocopy.csv:

user@domain.com,1oHaFzLF_KcgVX-hZn5etBka9,TeacherTrainingLesson-20230510_000950-Meeting Recording.mp4,1QusVD-a9U16I-0GTP1t-Vd9Ez
user@domain.com,1SyXjINXttrb3VKvpbjpm1y-V,ExampleLessonName-20230503_052304-Meeting Recording.mp4,13g_fYh9HYtnDtd4psHEZi

PS - You describe with detail irrelevant data, and don't describe enough the important points... :(

huangapple
  • 本文由 发表于 2023年6月1日 10:23:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76378302.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定