英文:
Python Pandas merge multiple CSV files with similar structure
问题
I understand your question. It seems you are trying to read multiple CSV files into a single dataframe using pandas, but you're encountering an issue when trying to drop rows with missing values.
In the provided code, you attempted to drop rows with missing values using df.dropna(axis=0, how='any', inplace=True)
, but this resulted in an empty dataframe.
The issue could be related to the fact that your CSV files have different structures. When you concatenate them, the resulting dataframe contains columns from all the CSV files, and for rows with missing values in one file, they might not be missing in another, causing the issue.
To address this problem, you can modify your code to handle columns with varying structures. Here's an example of how you can do this:
import pandas as pd
import os
path = '/path1234'
filelist = os.listdir(path)
# Initialize an empty dataframe
df = pd.DataFrame()
for file in filelist:
file_path = os.path.join(path, file)
temp_df = pd.read_csv(file_path, sep='\t', header=None) # Read the CSV file
temp_df.dropna(how='any', inplace=True) # Drop rows with missing values
df = pd.concat([df, temp_df], ignore_index=True) # Concatenate with the main dataframe
print(df)
This code reads each CSV file, drops rows with missing values in each file, and then concatenates them into a single dataframe. This approach should handle varying structures in your CSV files more gracefully.
英文:
I have tried all the instructions from here and just do not get on.
I want to use pandas to read all csv files from a folder and write them to a single dataframe.
The csv files are all almost the same, but in certain columns there is nothing in the first x-rows.
But I want to delete these lines anyway. The number of lines can vary though.
The whole thing seems to cause problems, because csv files with different structure can not be combined.
how could i work around this problem?
To illustrate, this is roughly what the dataset look like:
Dataset 1:
0 0 0 0 0 dont need this
0 0 5 0 0 dont need this
3 0 0 1 0 dont need this
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
Dataset 2:
0 0 0 0 0 dont need this
0 0 2 0 0 dont need this
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
Do you understand what i mean?
I have tried every instruction here, but nothing has worked.
EDIT:
I tried the example of @Tranbi.
Here is what i got:
import pandas as pd
from io import StringIO
import os
import io
path = '/path1234'
filelist = os.listdir(path)
dataset_list = [path + file for file in filelist]
cols = [f"col{i}" for i in range(0, 402)] # column names (here we expect 9 columns)
df = pd.concat([
pd.read_csv(dataset, sep="\t", names=cols) # set separator according to your actual data (default=',')
for dataset in dataset_list
]
)
#df.dropna(axis = 0,how ='any',inplace=True)
print(df)
Output:
col0 col1 col2 col3 ... col398 col399 col400 col401
0 0 0 0 0 ... NaN NaN NaN NaN
1 0 0 0 12 ... NaN NaN NaN NaN
2 0 18 0 0 ... NaN NaN NaN NaN
3 30 0 89 0 ... NaN NaN NaN NaN
4 0 0 0 36 ... NaN NaN NaN NaN
.. ... ... ... ... ... ... ... ... ...
206 30 22:17:04 22:23:14 0 ... 0.0 0.0 0.0 0.0
207 43 22:20:47 22:27:16 0 ... 0.0 0.0 0.0 0.0
208 43 22:24:28 22:30:57 0 ... 0.0 0.0 0.0 0.0
209 49 22:27:28 22:33:39 0 ... 0.0 0.0 0.0 0.0
210 43 22:34:44 22:41:13 0 ... 0.0 0.0 0.0 0.0
[1375 rows x 402 columns]
If i use
df.dropna(axis = 0,how ='any',inplace=True)
The output doesnt work:
Empty DataFrame
Columns: [col0, col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20, col21, col22, col23, col24, col25, col26, col27, col28, col29, col30, col31, col32, col33, col34, col35, col36, col37, col38, col39, col40, col41, col42, col43, col44, col45, col46, col47, col48, col49, col50, col51, col52, col53, col54, col55, col56, col57, col58, col59, col60, col61, col62, col63, col64, col65, col66, col67, col68, col69, col70, col71, col72, col73, col74, col75, col76, col77, col78, col79, col80, col81, col82, col83, col84, col85, col86, col87, col88, col89, col90, col91, col92, col93, col94, col95, col96, col97, col98, col99, ...]
Index: []
[0 rows x 402 columns]
Where is the mistake?
答案1
得分: 1
I created 2 csv
files containing the following:
0,0,0,0,0
0,0,5,0,0
3,0,0,1,0
1,2,3,4,5,6,7,8,9
2,2,3,4,5,6,7,8,9
3,2,3,4,5,6,7,8,9
0,0,0,0,0
0,0,2,0,0
4,2,3,4,5,6,7,8,9
5,2,3,4,5,6,7,8,9
6,2,3,4,5,6,7,8,9
7,2,3,4,5,6,7,8,9
8,2,3,4,5,6,7,8,9
The result is:
['1,2,3,4,5,6,7,8,9', '2,2,3,4,5,6,7,8,9', '3,2,3,4,5,6,7,8,9', '4,2,3,4,5,6,7,8,9', '5,2,3,4,5,6,7,8,9', '6,2,3,4,5,6,7,8,9', '7,2,3,4,5,6,7,8,9', '8,2,3,4,5,6,7,8,9']
英文:
import glob
if __name__ == "__main__":
combined_dataset = []
for filename in glob.glob("*.csv"):
with open(filename, "r", encoding="utf-8") as file:
data = file.read().split("\n")
data = [elt for elt in data if len(elt.split(",")) == 9]
combined_dataset.extend(data)
print(combined_dataset)
I created 2 csv
files containing the following:
0,0,0,0,0
0,0,5,0,0
3,0,0,1,0
1,2,3,4,5,6,7,8,9
2,2,3,4,5,6,7,8,9
3,2,3,4,5,6,7,8,9
0,0,0,0,0
0,0,2,0,0
4,2,3,4,5,6,7,8,9
5,2,3,4,5,6,7,8,9
6,2,3,4,5,6,7,8,9
7,2,3,4,5,6,7,8,9
8,2,3,4,5,6,7,8,9
The result is:
['1,2,3,4,5,6,7,8,9', '2,2,3,4,5,6,7,8,9', '3,2,3,4,5,6,7,8,9', '4,2,3,4,5,6,7,8,9', '5,2,3,4,5,6,7,8,9', '6,2,3,4,5,6,7,8,9', '7,2,3,4,5,6,7,8,9', '8,2,3,4,5,6,7,8,9']
Is this satisfactory?
答案2
得分: 0
你可以直接使用pandas将读取的csv拼接在一起:
import pandas as pd
from io import StringIO
dataset1 = StringIO("""
0 0 0 0 0
0 0 5 0 0
3 0 0 1 0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
""")
dataset2 = StringIO("""
0 0 0 0 0
0 0 2 0 0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
""")
dataset_list = [dataset1, dataset2] # 将你的文件放在一个列表中 (例如使用glob获取所有csv文件)
cols = [f"col{i}" for i in range(1, 10)] # 列名 (这里我们期望有9列)
df = pd.concat([
pd.read_csv(dataset, sep=" ", names=cols) # 根据你的实际数据设置分隔符 (默认为',')
for dataset in dataset_list
]
).dropna()
输出:
col1 col2 col3 col4 col5 col6 col7 col8 col9
3 1 2 3 4 5 6.0 7.0 8.0 9.0
4 1 2 3 4 5 6.0 7.0 8.0 9.0
5 1 2 3 4 5 6.0 7.0 8.0 9.0
2 1 2 3 4 5 6.0 7.0 8.0 9.0
3 1 2 3 4 5 6.0 7.0 8.0 9.0
4 1 2 3 4 5 6.0 7.0 8.0 9.0
5 1 2 3 4 5 6.0 7.0 8.0 9.0
6 1 2 3 4 5 6.0 7.0 8.0 9.0
英文:
You can concat the read csv directly with pandas:
import pandas as pd
from io import StringIO
dataset1 = StringIO("""0 0 0 0 0
0 0 5 0 0
3 0 0 1 0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
""")
dataset2 = StringIO("""0 0 0 0 0
0 0 2 0 0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
""")
dataset_list = [dataset1, dataset2] # your files in a list (use glob to get all csv files for exemple)
cols = [f"col{i}" for i in range(1, 10)] # column names (here we expect 9 columns)
df = pd.concat([
pd.read_csv(dataset, sep=" ", names=cols) # set separator according to your actual data (default=',')
for dataset in dataset_list
]
).dropna()
Output:
col1 col2 col3 col4 col5 col6 col7 col8 col9
3 1 2 3 4 5 6.0 7.0 8.0 9.0
4 1 2 3 4 5 6.0 7.0 8.0 9.0
5 1 2 3 4 5 6.0 7.0 8.0 9.0
2 1 2 3 4 5 6.0 7.0 8.0 9.0
3 1 2 3 4 5 6.0 7.0 8.0 9.0
4 1 2 3 4 5 6.0 7.0 8.0 9.0
5 1 2 3 4 5 6.0 7.0 8.0 9.0
6 1 2 3 4 5 6.0 7.0 8.0 9.0
答案3
得分: 0
假设您始终有一定数量的列(例如 N = 9
),以及一个名为 filelist
的CSV文件列表,您可以使用以下代码:
N = 9
sep = r'\s+' # CSV分隔符(这里是空格)
out = pd.concat([pd.read_csv(f, names=range(N), sep=sep).dropna()
for f in filelist], ignore_index=True)
这段代码将合并 filelist
中所有CSV文件的数据,每个文件应该有9列,并使用空格作为分隔符。
英文:
Assuming you always have a defined number of columns (N = 9
for example) and a list of your csv files in filelist
, you can use:
N = 9
sep= r'\s+' # csv separator (here spaces)
out = pd.concat([pd.read_csv(f, names=range(N), sep=sep).dropna()
for f in filelist], ignore_index=True)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论