2023年2月27日 17:59:00go评论74阅读模式

英文:

How to delete duplicates from one specific column in python with 20 milion rows

问题

以下是您要翻译的内容：

"我想从一个大的CSV文件中删除重复项。我有这样的CSV数据格式

client_id;gender;age;profese;addr_cntry;NAZOKRESU;prijem_AVG_6M_pasmo;cont_id;main_prod_id;bal_actl_am_pasmo
388713248;F;80;důchodce;CZ;Czech;;5715125;39775;
27953927;M;28;Děln&#237;k;CZ;Opavia;22;4427292;39075;

我需要删除所有重复的client_id。
我无法在Python中使用Pandas处理这个大文件。我尝试了Dask，但结果一样。只是等待了无限的时间，什么都没有发生。

这是我的最新代码版本


import dask.dataframe as dd
import chardet
from dask.diagnostics import ProgressBar

with open(&#39;bigData.csv&#39;, &#39;rb&#39;) as f:
    result = chardet.detect(f.read())

df = dd.read_csv(&#39;bigData.csv&#39;, encoding=result[&#39;encoding&#39;], sep=&#39;;&#39;)

total_rows = df.shape[0].compute()

df = df.drop_duplicates(subset=[&#39;client_id&#39;], keep=False, Inplace=True)

df.to_csv(&#39;bigData.csv&#39;, sep=&#39;;&#39;, index=False)

total_duplicates = total_rows - df.shape[0].compute()

print(f&#39;Was deleted {total_duplicates} duplicated rows.&#39;)
```"

希望这对您有所帮助！

<details>
<summary>英文:</summary>

I would like to delete duplicates from one large csv. I have this csv format of data

```none
client_id;gender;age;profese;addr_cntry;NAZOKRESU;prijem_AVG_6M_pasmo;cont_id;main_prod_id;bal_actl_am_pasmo
388713248;F;80;důchodce;CZ;Czech;;5715125;39775;
27953927;M;28;Děln&#237;k;CZ;Opavia;22;4427292;39075;

I need delete all duplicates from client_id.
I can not handle this big file in python with Pandas. I tried dask, but same result. Just infinity time of waiting and nothing really happend.

Here is my last version of code


import dask.dataframe as dd
import chardet
from dask.diagnostics import ProgressBar

with open(&#39;bigData.csv&#39;, &#39;rb&#39;) as f:
    result = chardet.detect(f.read())

df = dd.read_csv(&#39;bigData.csv&#39;, encoding=result[&#39;encoding&#39;], sep=&#39;;&#39;)

total_rows = df.shape[0].compute()

df = df.drop_duplicates(subset=[&#39;client_id&#39;], keep=False, Inplace=True)

df.to_csv(&#39;bigData.csv&#39;, sep=&#39;;&#39;, index=False)

total_duplicates = total_rows - df.shape[0].compute()

print(f&#39;Was deleted {total_duplicates} duplicated rows.&#39;)

I tried it with progress bar and nothing really happened. Thanks for help!

答案1

得分: 1

你可以尝试使用一个非常简单的Python程序来实现，该程序将每个新的ID存储在一个字典中，并在找到该行的ID已经在字典中时跳过写入后续行。这应该需要约2GB的RAM。

import csv

reader = csv.reader(open("input.csv", newline=""))
writer = csv.writer(open("output.csv", "w", newline=""))

writer.writerow(next(reader))  # 传输标题，如果有的话

ids = {}
for row in reader:
    if row[0] not in ids:
        writer.writerow(row)
        ids[row[0]] = None  # 将ID添加到已写入的ID“列表”中

这种方法：

使用字典 ids 来保存程序已经遇到并写入的所有ID；字典可以非常快速地查找/检查其键（您的ID）。
保持了行的原始顺序。

我模拟了一个包含2000万行的CSV文件（其中包含在0到2000万之间随机生成的ID），看起来像这样：

| id       | i |
|----------|---|
| 2266768  | 0 |
| 15245359 | 1 |
| 16304974 | 2 |
| 4801643  | 3 |
| 9612409  | 4 |
| 17659151 | 5 |
| 15824934 | 6 |
| 4101873  | 7 |
| 12282127 | 8 |
| 5172219  | 9 |

我通过该程序运行了它，最终得到了1260万行。在我的Macbook Air M1上（双通道SSD），这个过程花费了14秒，并消耗了1.5GB的RAM。RAM用于保存先前已见的所有ID。

此外，我看到您首先读取整个文件以检测字符编码：

您尝试过在命令行中运行 chardetect，例如 chardetect input.csv，然后硬编码返回的值吗？
您是否尝试读取文件的较小部分并查看结果和置信度？

with open("input.csv", "rb") as f:
    input_enc = chardet.detect(f.read(1024 * 64))  # 仅读取前64K

print(input_enc)  # {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

英文:

You might be able get away with a very simple Python program that stores every new ID it sees in a dict, and skips writing a subsequent row if it finds that row's ID already in the dict. It should require about 2GB of RAM.

import csv

reader = csv.reader(open(&quot;input.csv&quot;, newline=&quot;&quot;))
writer = csv.writer(open(&quot;output.csv&quot;, &quot;w&quot;, newline=&quot;&quot;))

writer.writerow(next(reader))  # transfer header, if you have one

ids = {}
for row in reader:
    if row[0] not in ids:
        writer.writerow(row)
        ids[row[0]] = None  # add ID to &quot;list&quot; of already written IDs

This approach:

Uses a dict, ids, to hold all IDs the program has already encountered and written; dicts can do really fast lookups/checks for its keys (your IDs).
Keeps the original ordering of the rows.

I mocked up a CSV w/20M rows (with randomly generated IDs between 0 and 20M), that looks something like this:

| id       | i |
|----------|---|
| 2266768  | 0 |
| 15245359 | 1 |
| 16304974 | 2 |
| 4801643  | 3 |
| 9612409  | 4 |
| 17659151 | 5 |
| 15824934 | 6 |
| 4101873  | 7 |
| 12282127 | 8 |
| 5172219  | 9 |

I ran it through that program and ended up with 12.6M rows. On my Macbook Air M1 (dual-channel SSD) that took 14 seconds and consumed 1.5GB of RAM. The RAM is needed to hold all the previously seen IDs.

Also, I see you reading the entire file first to detect the character encoding:

Have you tried running chardetect from the command line, chardetect input.csv, and just hard-coding the value returned?

Have you experimented with reading a much smaller portion of the file and seeing what results and confidence you get?

with open(&quot;input.csv&quot;, &quot;rb&quot;) as f:
    input_enc = chardet.detect(f.read(1024 * 64))  # only read first 64K

print(input_enc)  # {&#39;encoding&#39;: &#39;ascii&#39;, &#39;confidence&#39;: 1.0, &#39;language&#39;: &#39;&#39;}

答案2

得分: 1

同样的任务使用AWK。这不是op所要求的，只是为了补充上面的评论。不要将其接受为答案。

BEGIN{
  FS=","   # 将字段分隔符设置为逗号
}
!seen[$2]++ {    # 字段2之前未见过吗？
    print $0
}

示例数据：

RowNum,ID
1,5220607
2,8632078
3,8323076
..

在命令行运行：c:\>awk -f script.awk input.csv > uniquevalues.csv

这将输出大约1200万行数据，大约需要18秒（在i7 Windows上），并消耗1.8GB内存。

与上面@zach-young的Python脚本在同一台计算机和文件上相比，Python脚本大约需要35秒，但内存占用更少。

英文:

Same task using AWK. This is not what op asked, just to complete comment above. Do not accept as answer.

BEGIN{
  FS=&quot;,&quot;   # set field separator to comma
}
!seen[$2]++ {    # is field 2 not seen before ? 
    print $0
}

Sample data:

RowNum,ID
1,5220607
2,8632078
3,8323076
..

Run as c:\>awk -f script.awk input.csv > uniquevalues.csv

This outputs about 12 mio rows, and consumes 1,8GB memory in about 18 seconds (i7 Windows).

The python script from @zach-young above on same computer and file was about 35 seconds, but less memory.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在 Python 中删除具有 2000 万行的一个特定列中的重复项

问题

答案1

答案2

上传Google API Python库到Lambda

pandas – 从两个CSV文件合并数据后，每列只显示一个单元格

为什么os.walk()（Python）会根据目录中的文件数量忽略OneDrive目录？

尝试使用Plotly DASH返回2个输出

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论