Python Pandas合并具有相似结构的多个CSV文件

huangapple go评论73阅读模式
英文:

Python Pandas merge multiple CSV files with similar structure

问题

I understand your question. It seems you are trying to read multiple CSV files into a single dataframe using pandas, but you're encountering an issue when trying to drop rows with missing values.

In the provided code, you attempted to drop rows with missing values using df.dropna(axis=0, how='any', inplace=True), but this resulted in an empty dataframe.

The issue could be related to the fact that your CSV files have different structures. When you concatenate them, the resulting dataframe contains columns from all the CSV files, and for rows with missing values in one file, they might not be missing in another, causing the issue.

To address this problem, you can modify your code to handle columns with varying structures. Here's an example of how you can do this:

import pandas as pd
import os

path = '/path1234'
filelist = os.listdir(path)

# Initialize an empty dataframe
df = pd.DataFrame()

for file in filelist:
    file_path = os.path.join(path, file)
    temp_df = pd.read_csv(file_path, sep='\t', header=None)  # Read the CSV file
    temp_df.dropna(how='any', inplace=True)  # Drop rows with missing values
    df = pd.concat([df, temp_df], ignore_index=True)  # Concatenate with the main dataframe

print(df)

This code reads each CSV file, drops rows with missing values in each file, and then concatenates them into a single dataframe. This approach should handle varying structures in your CSV files more gracefully.

英文:

I have tried all the instructions from here and just do not get on.
I want to use pandas to read all csv files from a folder and write them to a single dataframe.
The csv files are all almost the same, but in certain columns there is nothing in the first x-rows.
But I want to delete these lines anyway. The number of lines can vary though.
The whole thing seems to cause problems, because csv files with different structure can not be combined.
how could i work around this problem?

To illustrate, this is roughly what the dataset look like:

Dataset 1:
0 0 0 0 0 dont need this
0 0 5 0 0 dont need this
3 0 0 1 0 dont need this
1 2 3 4 5 6 7 8 9 
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9

Dataset 2:
0 0 0 0 0 dont need this
0 0 2 0 0 dont need this
1 2 3 4 5 6 7 8 9 
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9

Do you understand what i mean?

I have tried every instruction here, but nothing has worked.

EDIT:
I tried the example of @Tranbi.
Here is what i got:

    import pandas as pd
    from io import StringIO
    import os
    import io

path = '/path1234'
filelist = os.listdir(path)


dataset_list = [path + file for file in filelist]

cols = [f"col{i}" for i in range(0, 402)] # column names (here we expect 9 columns)

df = pd.concat([
    pd.read_csv(dataset, sep="\t", names=cols) # set separator according to your actual data (default=',')
    for dataset in dataset_list
    ]
)

#df.dropna(axis = 0,how ='any',inplace=True)

print(df)

Output:
col0 col1 col2 col3 ... col398 col399 col400 col401
0 0 0 0 0 ... NaN NaN NaN NaN
1 0 0 0 12 ... NaN NaN NaN NaN
2 0 18 0 0 ... NaN NaN NaN NaN
3 30 0 89 0 ... NaN NaN NaN NaN
4 0 0 0 36 ... NaN NaN NaN NaN
.. ... ... ... ... ... ... ... ... ...
206 30 22:17:04 22:23:14 0 ... 0.0 0.0 0.0 0.0
207 43 22:20:47 22:27:16 0 ... 0.0 0.0 0.0 0.0
208 43 22:24:28 22:30:57 0 ... 0.0 0.0 0.0 0.0
209 49 22:27:28 22:33:39 0 ... 0.0 0.0 0.0 0.0
210 43 22:34:44 22:41:13 0 ... 0.0 0.0 0.0 0.0

[1375 rows x 402 columns]

If i use

df.dropna(axis = 0,how ='any',inplace=True)

The output doesnt work:
Empty DataFrame
Columns: [col0, col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20, col21, col22, col23, col24, col25, col26, col27, col28, col29, col30, col31, col32, col33, col34, col35, col36, col37, col38, col39, col40, col41, col42, col43, col44, col45, col46, col47, col48, col49, col50, col51, col52, col53, col54, col55, col56, col57, col58, col59, col60, col61, col62, col63, col64, col65, col66, col67, col68, col69, col70, col71, col72, col73, col74, col75, col76, col77, col78, col79, col80, col81, col82, col83, col84, col85, col86, col87, col88, col89, col90, col91, col92, col93, col94, col95, col96, col97, col98, col99, ...]
Index: []

[0 rows x 402 columns]

Where is the mistake?

答案1

得分: 1

I created 2 csv files containing the following:

0,0,0,0,0
0,0,5,0,0
3,0,0,1,0
1,2,3,4,5,6,7,8,9
2,2,3,4,5,6,7,8,9
3,2,3,4,5,6,7,8,9
0,0,0,0,0
0,0,2,0,0
4,2,3,4,5,6,7,8,9
5,2,3,4,5,6,7,8,9
6,2,3,4,5,6,7,8,9
7,2,3,4,5,6,7,8,9
8,2,3,4,5,6,7,8,9

The result is:

['1,2,3,4,5,6,7,8,9', '2,2,3,4,5,6,7,8,9', '3,2,3,4,5,6,7,8,9', '4,2,3,4,5,6,7,8,9', '5,2,3,4,5,6,7,8,9', '6,2,3,4,5,6,7,8,9', '7,2,3,4,5,6,7,8,9', '8,2,3,4,5,6,7,8,9']
英文:
import glob


if __name__ == "__main__":
    combined_dataset = []
    for filename in glob.glob("*.csv"):
        with open(filename, "r", encoding="utf-8") as file:
            data = file.read().split("\n")
        data = [elt for elt in data if len(elt.split(",")) == 9]
        combined_dataset.extend(data)

    print(combined_dataset)

I created 2 csv files containing the following:

0,0,0,0,0
0,0,5,0,0
3,0,0,1,0
1,2,3,4,5,6,7,8,9
2,2,3,4,5,6,7,8,9
3,2,3,4,5,6,7,8,9
0,0,0,0,0
0,0,2,0,0
4,2,3,4,5,6,7,8,9
5,2,3,4,5,6,7,8,9
6,2,3,4,5,6,7,8,9
7,2,3,4,5,6,7,8,9
8,2,3,4,5,6,7,8,9

The result is:

['1,2,3,4,5,6,7,8,9', '2,2,3,4,5,6,7,8,9', '3,2,3,4,5,6,7,8,9', '4,2,3,4,5,6,7,8,9', '5,2,3,4,5,6,7,8,9', '6,2,3,4,5,6,7,8,9', '7,2,3,4,5,6,7,8,9', '8,2,3,4,5,6,7,8,9']

Is this satisfactory?

答案2

得分: 0

你可以直接使用pandas将读取的csv拼接在一起:

import pandas as pd
from io import StringIO

dataset1 = StringIO("""
0 0 0 0 0
0 0 5 0 0
3 0 0 1 0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
""")

dataset2 = StringIO("""
0 0 0 0 0
0 0 2 0 0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
""")

dataset_list = [dataset1, dataset2] # 将你的文件放在一个列表中 (例如使用glob获取所有csv文件)
cols = [f"col{i}" for i in range(1, 10)] # 列名 (这里我们期望有9列)

df = pd.concat([
    pd.read_csv(dataset, sep=" ", names=cols) # 根据你的实际数据设置分隔符 (默认为',')
    for dataset in dataset_list
    ]
).dropna()

输出:

   col1  col2  col3  col4  col5  col6  col7  col8  col9
3     1     2     3     4     5   6.0   7.0   8.0   9.0
4     1     2     3     4     5   6.0   7.0   8.0   9.0
5     1     2     3     4     5   6.0   7.0   8.0   9.0
2     1     2     3     4     5   6.0   7.0   8.0   9.0
3     1     2     3     4     5   6.0   7.0   8.0   9.0
4     1     2     3     4     5   6.0   7.0   8.0   9.0
5     1     2     3     4     5   6.0   7.0   8.0   9.0
6     1     2     3     4     5   6.0   7.0   8.0   9.0
英文:

You can concat the read csv directly with pandas:

import pandas as pd
from io import StringIO

dataset1 = StringIO("""0 0 0 0 0
0 0 5 0 0
3 0 0 1 0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
""")

dataset2 = StringIO("""0 0 0 0 0
0 0 2 0 0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
""")

dataset_list = [dataset1, dataset2] # your files in a list (use glob to get all csv files for exemple)
cols = [f"col{i}" for i in range(1, 10)] # column names (here we expect 9 columns)

df = pd.concat([
    pd.read_csv(dataset, sep=" ", names=cols) # set separator according to your actual data (default=',')
    for dataset in dataset_list
    ]
).dropna()

Output:

   col1  col2  col3  col4  col5  col6  col7  col8  col9
3     1     2     3     4     5   6.0   7.0   8.0   9.0
4     1     2     3     4     5   6.0   7.0   8.0   9.0
5     1     2     3     4     5   6.0   7.0   8.0   9.0
2     1     2     3     4     5   6.0   7.0   8.0   9.0
3     1     2     3     4     5   6.0   7.0   8.0   9.0
4     1     2     3     4     5   6.0   7.0   8.0   9.0
5     1     2     3     4     5   6.0   7.0   8.0   9.0
6     1     2     3     4     5   6.0   7.0   8.0   9.0

答案3

得分: 0

假设您始终有一定数量的列(例如 N = 9),以及一个名为 filelist 的CSV文件列表,您可以使用以下代码:

N = 9
sep = r'\s+'  # CSV分隔符(这里是空格)

out = pd.concat([pd.read_csv(f, names=range(N), sep=sep).dropna()
                 for f in filelist], ignore_index=True)

这段代码将合并 filelist 中所有CSV文件的数据,每个文件应该有9列,并使用空格作为分隔符。

英文:

Assuming you always have a defined number of columns (N = 9 for example) and a list of your csv files in filelist, you can use:

N = 9
sep= r'\s+' # csv separator (here spaces)

out = pd.concat([pd.read_csv(f, names=range(N), sep=sep).dropna()
                 for f in filelist], ignore_index=True)

huangapple
  • 本文由 发表于 2023年3月15日 21:39:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/75745507.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定