Python Pandas合并具有相似结构的多个CSV文件

huangapple go评论99阅读模式
英文:

Python Pandas merge multiple CSV files with similar structure

问题

I understand your question. It seems you are trying to read multiple CSV files into a single dataframe using pandas, but you're encountering an issue when trying to drop rows with missing values.

In the provided code, you attempted to drop rows with missing values using df.dropna(axis=0, how='any', inplace=True), but this resulted in an empty dataframe.

The issue could be related to the fact that your CSV files have different structures. When you concatenate them, the resulting dataframe contains columns from all the CSV files, and for rows with missing values in one file, they might not be missing in another, causing the issue.

To address this problem, you can modify your code to handle columns with varying structures. Here's an example of how you can do this:

  1. import pandas as pd
  2. import os
  3. path = '/path1234'
  4. filelist = os.listdir(path)
  5. # Initialize an empty dataframe
  6. df = pd.DataFrame()
  7. for file in filelist:
  8. file_path = os.path.join(path, file)
  9. temp_df = pd.read_csv(file_path, sep='\t', header=None) # Read the CSV file
  10. temp_df.dropna(how='any', inplace=True) # Drop rows with missing values
  11. df = pd.concat([df, temp_df], ignore_index=True) # Concatenate with the main dataframe
  12. print(df)

This code reads each CSV file, drops rows with missing values in each file, and then concatenates them into a single dataframe. This approach should handle varying structures in your CSV files more gracefully.

英文:

I have tried all the instructions from here and just do not get on.
I want to use pandas to read all csv files from a folder and write them to a single dataframe.
The csv files are all almost the same, but in certain columns there is nothing in the first x-rows.
But I want to delete these lines anyway. The number of lines can vary though.
The whole thing seems to cause problems, because csv files with different structure can not be combined.
how could i work around this problem?

To illustrate, this is roughly what the dataset look like:

  1. Dataset 1:
  2. 0 0 0 0 0 dont need this
  3. 0 0 5 0 0 dont need this
  4. 3 0 0 1 0 dont need this
  5. 1 2 3 4 5 6 7 8 9
  6. 1 2 3 4 5 6 7 8 9
  7. 1 2 3 4 5 6 7 8 9
  8. Dataset 2:
  9. 0 0 0 0 0 dont need this
  10. 0 0 2 0 0 dont need this
  11. 1 2 3 4 5 6 7 8 9
  12. 1 2 3 4 5 6 7 8 9
  13. 1 2 3 4 5 6 7 8 9
  14. 1 2 3 4 5 6 7 8 9
  15. 1 2 3 4 5 6 7 8 9

Do you understand what i mean?

I have tried every instruction here, but nothing has worked.

EDIT:
I tried the example of @Tranbi.
Here is what i got:

  1. import pandas as pd
  2. from io import StringIO
  3. import os
  4. import io
  5. path = '/path1234'
  6. filelist = os.listdir(path)
  7. dataset_list = [path + file for file in filelist]
  8. cols = [f"col{i}" for i in range(0, 402)] # column names (here we expect 9 columns)
  9. df = pd.concat([
  10. pd.read_csv(dataset, sep="\t", names=cols) # set separator according to your actual data (default=',')
  11. for dataset in dataset_list
  12. ]
  13. )
  14. #df.dropna(axis = 0,how ='any',inplace=True)
  15. print(df)

Output:
col0 col1 col2 col3 ... col398 col399 col400 col401
0 0 0 0 0 ... NaN NaN NaN NaN
1 0 0 0 12 ... NaN NaN NaN NaN
2 0 18 0 0 ... NaN NaN NaN NaN
3 30 0 89 0 ... NaN NaN NaN NaN
4 0 0 0 36 ... NaN NaN NaN NaN
.. ... ... ... ... ... ... ... ... ...
206 30 22:17:04 22:23:14 0 ... 0.0 0.0 0.0 0.0
207 43 22:20:47 22:27:16 0 ... 0.0 0.0 0.0 0.0
208 43 22:24:28 22:30:57 0 ... 0.0 0.0 0.0 0.0
209 49 22:27:28 22:33:39 0 ... 0.0 0.0 0.0 0.0
210 43 22:34:44 22:41:13 0 ... 0.0 0.0 0.0 0.0

[1375 rows x 402 columns]

If i use

  1. df.dropna(axis = 0,how ='any',inplace=True)

The output doesnt work:
Empty DataFrame
Columns: [col0, col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20, col21, col22, col23, col24, col25, col26, col27, col28, col29, col30, col31, col32, col33, col34, col35, col36, col37, col38, col39, col40, col41, col42, col43, col44, col45, col46, col47, col48, col49, col50, col51, col52, col53, col54, col55, col56, col57, col58, col59, col60, col61, col62, col63, col64, col65, col66, col67, col68, col69, col70, col71, col72, col73, col74, col75, col76, col77, col78, col79, col80, col81, col82, col83, col84, col85, col86, col87, col88, col89, col90, col91, col92, col93, col94, col95, col96, col97, col98, col99, ...]
Index: []

[0 rows x 402 columns]

Where is the mistake?

答案1

得分: 1

I created 2 csv files containing the following:

  1. 0,0,0,0,0
  2. 0,0,5,0,0
  3. 3,0,0,1,0
  4. 1,2,3,4,5,6,7,8,9
  5. 2,2,3,4,5,6,7,8,9
  6. 3,2,3,4,5,6,7,8,9
  1. 0,0,0,0,0
  2. 0,0,2,0,0
  3. 4,2,3,4,5,6,7,8,9
  4. 5,2,3,4,5,6,7,8,9
  5. 6,2,3,4,5,6,7,8,9
  6. 7,2,3,4,5,6,7,8,9
  7. 8,2,3,4,5,6,7,8,9

The result is:

  1. ['1,2,3,4,5,6,7,8,9', '2,2,3,4,5,6,7,8,9', '3,2,3,4,5,6,7,8,9', '4,2,3,4,5,6,7,8,9', '5,2,3,4,5,6,7,8,9', '6,2,3,4,5,6,7,8,9', '7,2,3,4,5,6,7,8,9', '8,2,3,4,5,6,7,8,9']
英文:
  1. import glob
  2. if __name__ == "__main__":
  3. combined_dataset = []
  4. for filename in glob.glob("*.csv"):
  5. with open(filename, "r", encoding="utf-8") as file:
  6. data = file.read().split("\n")
  7. data = [elt for elt in data if len(elt.split(",")) == 9]
  8. combined_dataset.extend(data)
  9. print(combined_dataset)

I created 2 csv files containing the following:

  1. 0,0,0,0,0
  2. 0,0,5,0,0
  3. 3,0,0,1,0
  4. 1,2,3,4,5,6,7,8,9
  5. 2,2,3,4,5,6,7,8,9
  6. 3,2,3,4,5,6,7,8,9
  1. 0,0,0,0,0
  2. 0,0,2,0,0
  3. 4,2,3,4,5,6,7,8,9
  4. 5,2,3,4,5,6,7,8,9
  5. 6,2,3,4,5,6,7,8,9
  6. 7,2,3,4,5,6,7,8,9
  7. 8,2,3,4,5,6,7,8,9

The result is:

  1. ['1,2,3,4,5,6,7,8,9', '2,2,3,4,5,6,7,8,9', '3,2,3,4,5,6,7,8,9', '4,2,3,4,5,6,7,8,9', '5,2,3,4,5,6,7,8,9', '6,2,3,4,5,6,7,8,9', '7,2,3,4,5,6,7,8,9', '8,2,3,4,5,6,7,8,9']

Is this satisfactory?

答案2

得分: 0

你可以直接使用pandas将读取的csv拼接在一起:

  1. import pandas as pd
  2. from io import StringIO
  3. dataset1 = StringIO("""
  4. 0 0 0 0 0
  5. 0 0 5 0 0
  6. 3 0 0 1 0
  7. 1 2 3 4 5 6 7 8 9
  8. 1 2 3 4 5 6 7 8 9
  9. 1 2 3 4 5 6 7 8 9
  10. """)
  11. dataset2 = StringIO("""
  12. 0 0 0 0 0
  13. 0 0 2 0 0
  14. 1 2 3 4 5 6 7 8 9
  15. 1 2 3 4 5 6 7 8 9
  16. 1 2 3 4 5 6 7 8 9
  17. 1 2 3 4 5 6 7 8 9
  18. 1 2 3 4 5 6 7 8 9
  19. """)
  20. dataset_list = [dataset1, dataset2] # 将你的文件放在一个列表中 (例如使用glob获取所有csv文件)
  21. cols = [f"col{i}" for i in range(1, 10)] # 列名 (这里我们期望有9列)
  22. df = pd.concat([
  23. pd.read_csv(dataset, sep=" ", names=cols) # 根据你的实际数据设置分隔符 (默认为',')
  24. for dataset in dataset_list
  25. ]
  26. ).dropna()

输出:

  1. col1 col2 col3 col4 col5 col6 col7 col8 col9
  2. 3 1 2 3 4 5 6.0 7.0 8.0 9.0
  3. 4 1 2 3 4 5 6.0 7.0 8.0 9.0
  4. 5 1 2 3 4 5 6.0 7.0 8.0 9.0
  5. 2 1 2 3 4 5 6.0 7.0 8.0 9.0
  6. 3 1 2 3 4 5 6.0 7.0 8.0 9.0
  7. 4 1 2 3 4 5 6.0 7.0 8.0 9.0
  8. 5 1 2 3 4 5 6.0 7.0 8.0 9.0
  9. 6 1 2 3 4 5 6.0 7.0 8.0 9.0
英文:

You can concat the read csv directly with pandas:

  1. import pandas as pd
  2. from io import StringIO
  3. dataset1 = StringIO("""0 0 0 0 0
  4. 0 0 5 0 0
  5. 3 0 0 1 0
  6. 1 2 3 4 5 6 7 8 9
  7. 1 2 3 4 5 6 7 8 9
  8. 1 2 3 4 5 6 7 8 9
  9. """)
  10. dataset2 = StringIO("""0 0 0 0 0
  11. 0 0 2 0 0
  12. 1 2 3 4 5 6 7 8 9
  13. 1 2 3 4 5 6 7 8 9
  14. 1 2 3 4 5 6 7 8 9
  15. 1 2 3 4 5 6 7 8 9
  16. 1 2 3 4 5 6 7 8 9
  17. """)
  18. dataset_list = [dataset1, dataset2] # your files in a list (use glob to get all csv files for exemple)
  19. cols = [f"col{i}" for i in range(1, 10)] # column names (here we expect 9 columns)
  20. df = pd.concat([
  21. pd.read_csv(dataset, sep=" ", names=cols) # set separator according to your actual data (default=',')
  22. for dataset in dataset_list
  23. ]
  24. ).dropna()

Output:

  1. col1 col2 col3 col4 col5 col6 col7 col8 col9
  2. 3 1 2 3 4 5 6.0 7.0 8.0 9.0
  3. 4 1 2 3 4 5 6.0 7.0 8.0 9.0
  4. 5 1 2 3 4 5 6.0 7.0 8.0 9.0
  5. 2 1 2 3 4 5 6.0 7.0 8.0 9.0
  6. 3 1 2 3 4 5 6.0 7.0 8.0 9.0
  7. 4 1 2 3 4 5 6.0 7.0 8.0 9.0
  8. 5 1 2 3 4 5 6.0 7.0 8.0 9.0
  9. 6 1 2 3 4 5 6.0 7.0 8.0 9.0

答案3

得分: 0

假设您始终有一定数量的列(例如 N = 9),以及一个名为 filelist 的CSV文件列表,您可以使用以下代码:

  1. N = 9
  2. sep = r'\s+' # CSV分隔符(这里是空格)
  3. out = pd.concat([pd.read_csv(f, names=range(N), sep=sep).dropna()
  4. for f in filelist], ignore_index=True)

这段代码将合并 filelist 中所有CSV文件的数据,每个文件应该有9列,并使用空格作为分隔符。

英文:

Assuming you always have a defined number of columns (N = 9 for example) and a list of your csv files in filelist, you can use:

  1. N = 9
  2. sep= r'\s+' # csv separator (here spaces)
  3. out = pd.concat([pd.read_csv(f, names=range(N), sep=sep).dropna()
  4. for f in filelist], ignore_index=True)

huangapple
  • 本文由 发表于 2023年3月15日 21:39:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/75745507.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定