Compare Latest CSV with all CSV in directory and remove the matching from the latest and write new rows in new file with python

huangapple go评论65阅读模式
英文:

Compare Latest CSV with all CSV in directory and remove the matching from the latest and write new rows in new file with python

问题

代码将无法正常工作,例如当文件名是其他名称时。

例如,当文件名为carre123.csv时,它不会正确比较。但当我将文件名更改为test123.csv时,它可以正常工作。

英文:

the code will not work properly e.g. when the name of files are something else.

for example when the file name is carre123.csv, it wont compare correctly. but when I changed the file name to test123.csv it works fine.

here is the code

import os
import pandas as pd

# Set the directory where the CSV files are stored
directory = '/PATH/csv-files'

# Get a list of all the CSV files in the directory
csv_files = [os.path.join(directory, f) for f in os.listdir(directory) if f.endswith('.csv')]
#print(csv_files)

# Sort the CSV files by modification time and select the last file as the latest file
latest_file = sorted(csv_files, key=os.path.getmtime)[-1]
#print(latest_file)

# Read the contents of the latest CSV file into a pandas DataFrame
latest_data = pd.read_csv(latest_file)
#print(latest_data)

# Iterate over all the previous CSV files
for csv_file in csv_files[:-1]:
    # Read the contents of the previous CSV file into a pandas DataFrame
    prev_data = pd.read_csv(csv_file)
    #print(prev_data)

    # Identify the rows in the latest CSV file that match the rows in the previous CSV file
    matches = latest_data.isin(prev_data.to_dict('list')).all(axis=1)
    print(matches)

    # Remove the matching rows from the latest CSV file
    latest_data = latest_data[~matches]

# Write the remaining rows in the latest CSV file to a new file
latest_data.to_csv('/NEWPATH/diff.csv', index=False)

when the file name is carre123.csv, it wont compare correctly. but when I changed the file name to test123.csv it works fine.

答案1

得分: 1

我认为你的代码有一个bug,这可能是导致问题的原因。for 循环遍历的是 csv_files[:-1],这个列表并没有按修改时间排序,因此根据文件名的不同,可能会导致循环包括 latest_file。尝试存储排序后的列表,sorted(csv_files, key=os.path.getmtime),然后选择最后一个作为 latest_file,并循环遍历剩余的文件。也许还有其他问题,但根据你提供的示例,这似乎是我能明显看到的唯一问题。

英文:

I think your code has a bug, which may be what is causing the problem. The for loop is over csv_files[:-1] which is not sorted by modification time, so depending on the file names this may cause the loop to include latest_file. Try storing the sorted list, sorted(csv_files, key=os.path.getmtime), then select the last one for latest_file and loop over the remaining files. Maybe there is something else wrong too, but based on the example you provided, this looks like the only issue I can see that is obviously a problem.

huangapple
  • 本文由 发表于 2023年3月7日 15:29:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75659071.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定