英文:
Compare Latest CSV with all CSV in directory and remove the matching from the latest and write new rows in new file with python
问题
代码将无法正常工作,例如当文件名是其他名称时。
例如,当文件名为carre123.csv时,它不会正确比较。但当我将文件名更改为test123.csv时,它可以正常工作。
英文:
the code will not work properly e.g. when the name of files are something else.
for example when the file name is carre123.csv, it wont compare correctly. but when I changed the file name to test123.csv it works fine.
here is the code
import os
import pandas as pd
# Set the directory where the CSV files are stored
directory = '/PATH/csv-files'
# Get a list of all the CSV files in the directory
csv_files = [os.path.join(directory, f) for f in os.listdir(directory) if f.endswith('.csv')]
#print(csv_files)
# Sort the CSV files by modification time and select the last file as the latest file
latest_file = sorted(csv_files, key=os.path.getmtime)[-1]
#print(latest_file)
# Read the contents of the latest CSV file into a pandas DataFrame
latest_data = pd.read_csv(latest_file)
#print(latest_data)
# Iterate over all the previous CSV files
for csv_file in csv_files[:-1]:
# Read the contents of the previous CSV file into a pandas DataFrame
prev_data = pd.read_csv(csv_file)
#print(prev_data)
# Identify the rows in the latest CSV file that match the rows in the previous CSV file
matches = latest_data.isin(prev_data.to_dict('list')).all(axis=1)
print(matches)
# Remove the matching rows from the latest CSV file
latest_data = latest_data[~matches]
# Write the remaining rows in the latest CSV file to a new file
latest_data.to_csv('/NEWPATH/diff.csv', index=False)
when the file name is carre123.csv, it wont compare correctly. but when I changed the file name to test123.csv it works fine.
答案1
得分: 1
我认为你的代码有一个bug,这可能是导致问题的原因。for
循环遍历的是 csv_files[:-1]
,这个列表并没有按修改时间排序,因此根据文件名的不同,可能会导致循环包括 latest_file
。尝试存储排序后的列表,sorted(csv_files, key=os.path.getmtime)
,然后选择最后一个作为 latest_file
,并循环遍历剩余的文件。也许还有其他问题,但根据你提供的示例,这似乎是我能明显看到的唯一问题。
英文:
I think your code has a bug, which may be what is causing the problem. The for
loop is over csv_files[:-1]
which is not sorted by modification time, so depending on the file names this may cause the loop to include latest_file
. Try storing the sorted list, sorted(csv_files, key=os.path.getmtime)
, then select the last one for latest_file
and loop over the remaining files. Maybe there is something else wrong too, but based on the example you provided, this looks like the only issue I can see that is obviously a problem.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论