2023年3月15日 21:39:37go评论99阅读模式

英文:

Python Pandas merge multiple CSV files with similar structure

问题

I understand your question. It seems you are trying to read multiple CSV files into a single dataframe using pandas, but you're encountering an issue when trying to drop rows with missing values.

In the provided code, you attempted to drop rows with missing values using df.dropna(axis=0, how='any', inplace=True), but this resulted in an empty dataframe.

The issue could be related to the fact that your CSV files have different structures. When you concatenate them, the resulting dataframe contains columns from all the CSV files, and for rows with missing values in one file, they might not be missing in another, causing the issue.

To address this problem, you can modify your code to handle columns with varying structures. Here's an example of how you can do this:

import pandas as pd
import os
path = '/path1234'
filelist = os.listdir(path)
# Initialize an empty dataframe
df = pd.DataFrame()
for file in filelist:
    file_path = os.path.join(path, file)
    temp_df = pd.read_csv(file_path, sep='\t', header=None)  # Read the CSV file
    temp_df.dropna(how='any', inplace=True)  # Drop rows with missing values
    df = pd.concat([df, temp_df], ignore_index=True)  # Concatenate with the main dataframe
print(df)

This code reads each CSV file, drops rows with missing values in each file, and then concatenates them into a single dataframe. This approach should handle varying structures in your CSV files more gracefully.

英文:

I have tried all the instructions from here and just do not get on.
I want to use pandas to read all csv files from a folder and write them to a single dataframe.
The csv files are all almost the same, but in certain columns there is nothing in the first x-rows.
But I want to delete these lines anyway. The number of lines can vary though.
The whole thing seems to cause problems, because csv files with different structure can not be combined.
how could i work around this problem?

To illustrate, this is roughly what the dataset look like:

Dataset 1:
0 0 0 0 0 dont need this
0 0 5 0 0 dont need this
3 0 0 1 0 dont need this
1 2 3 4 5 6 7 8 9 
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
Dataset 2:
0 0 0 0 0 dont need this
0 0 2 0 0 dont need this
1 2 3 4 5 6 7 8 9 
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9

Do you understand what i mean?

I have tried every instruction here, but nothing has worked.

EDIT:
I tried the example of @Tranbi.
Here is what i got:

    import pandas as pd
    from io import StringIO
    import os
    import io
path = &#39;/path1234&#39;
filelist = os.listdir(path)
dataset_list = [path + file for file in filelist]
cols = [f&quot;col{i}&quot; for i in range(0, 402)] # column names (here we expect 9 columns)
df = pd.concat([
    pd.read_csv(dataset, sep=&quot;\t&quot;, names=cols) # set separator according to your actual data (default=&#39;,&#39;)
    for dataset in dataset_list
    ]
)
#df.dropna(axis = 0,how =&#39;any&#39;,inplace=True)
print(df)

Output:
col0 col1 col2 col3 ... col398 col399 col400 col401
0 0 0 0 0 ... NaN NaN NaN NaN
1 0 0 0 12 ... NaN NaN NaN NaN
2 0 18 0 0 ... NaN NaN NaN NaN
3 30 0 89 0 ... NaN NaN NaN NaN
4 0 0 0 36 ... NaN NaN NaN NaN
.. ... ... ... ... ... ... ... ... ...
206 30 22:17:04 22:23:14 0 ... 0.0 0.0 0.0 0.0
207 43 22:20:47 22:27:16 0 ... 0.0 0.0 0.0 0.0
208 43 22:24:28 22:30:57 0 ... 0.0 0.0 0.0 0.0
209 49 22:27:28 22:33:39 0 ... 0.0 0.0 0.0 0.0
210 43 22:34:44 22:41:13 0 ... 0.0 0.0 0.0 0.0

[1375 rows x 402 columns]

If i use

df.dropna(axis = 0,how =&#39;any&#39;,inplace=True)

The output doesnt work:
Empty DataFrame
Columns: [col0, col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20, col21, col22, col23, col24, col25, col26, col27, col28, col29, col30, col31, col32, col33, col34, col35, col36, col37, col38, col39, col40, col41, col42, col43, col44, col45, col46, col47, col48, col49, col50, col51, col52, col53, col54, col55, col56, col57, col58, col59, col60, col61, col62, col63, col64, col65, col66, col67, col68, col69, col70, col71, col72, col73, col74, col75, col76, col77, col78, col79, col80, col81, col82, col83, col84, col85, col86, col87, col88, col89, col90, col91, col92, col93, col94, col95, col96, col97, col98, col99, ...]
Index: []

[0 rows x 402 columns]

Where is the mistake?

答案1

得分: 1

I created 2 csv files containing the following:

0,0,0,0,0
0,0,5,0,0
3,0,0,1,0
1,2,3,4,5,6,7,8,9
2,2,3,4,5,6,7,8,9
3,2,3,4,5,6,7,8,9

0,0,0,0,0
0,0,2,0,0
4,2,3,4,5,6,7,8,9
5,2,3,4,5,6,7,8,9
6,2,3,4,5,6,7,8,9
7,2,3,4,5,6,7,8,9
8,2,3,4,5,6,7,8,9

The result is:

['1,2,3,4,5,6,7,8,9', '2,2,3,4,5,6,7,8,9', '3,2,3,4,5,6,7,8,9', '4,2,3,4,5,6,7,8,9', '5,2,3,4,5,6,7,8,9', '6,2,3,4,5,6,7,8,9', '7,2,3,4,5,6,7,8,9', '8,2,3,4,5,6,7,8,9']

英文:

import glob
if __name__ == &quot;__main__&quot;:
    combined_dataset = []
    for filename in glob.glob(&quot;*.csv&quot;):
        with open(filename, &quot;r&quot;, encoding=&quot;utf-8&quot;) as file:
            data = file.read().split(&quot;\n&quot;)
        data = [elt for elt in data if len(elt.split(&quot;,&quot;)) == 9]
        combined_dataset.extend(data)
    print(combined_dataset)

I created 2 csv files containing the following:

0,0,0,0,0
0,0,5,0,0
3,0,0,1,0
1,2,3,4,5,6,7,8,9
2,2,3,4,5,6,7,8,9
3,2,3,4,5,6,7,8,9

0,0,0,0,0
0,0,2,0,0
4,2,3,4,5,6,7,8,9
5,2,3,4,5,6,7,8,9
6,2,3,4,5,6,7,8,9
7,2,3,4,5,6,7,8,9
8,2,3,4,5,6,7,8,9

The result is:

[&#39;1,2,3,4,5,6,7,8,9&#39;, &#39;2,2,3,4,5,6,7,8,9&#39;, &#39;3,2,3,4,5,6,7,8,9&#39;, &#39;4,2,3,4,5,6,7,8,9&#39;, &#39;5,2,3,4,5,6,7,8,9&#39;, &#39;6,2,3,4,5,6,7,8,9&#39;, &#39;7,2,3,4,5,6,7,8,9&#39;, &#39;8,2,3,4,5,6,7,8,9&#39;]

Is this satisfactory?

答案2

得分: 0

你可以直接使用pandas将读取的csv拼接在一起：

import pandas as pd
from io import StringIO
dataset1 = StringIO("""
0 0 0 0 0
0 0 5 0 0
3 0 0 1 0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
""")
dataset2 = StringIO("""
0 0 0 0 0
0 0 2 0 0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
""")
dataset_list = [dataset1, dataset2] # 将你的文件放在一个列表中 (例如使用glob获取所有csv文件)
cols = [f"col{i}" for i in range(1, 10)] # 列名 (这里我们期望有9列)
df = pd.concat([
    pd.read_csv(dataset, sep=" ", names=cols) # 根据你的实际数据设置分隔符 (默认为',')
    for dataset in dataset_list
    ]
).dropna()

输出:

   col1  col2  col3  col4  col5  col6  col7  col8  col9
3     1     2     3     4     5   6.0   7.0   8.0   9.0
4     1     2     3     4     5   6.0   7.0   8.0   9.0
5     1     2     3     4     5   6.0   7.0   8.0   9.0
2     1     2     3     4     5   6.0   7.0   8.0   9.0
3     1     2     3     4     5   6.0   7.0   8.0   9.0
4     1     2     3     4     5   6.0   7.0   8.0   9.0
5     1     2     3     4     5   6.0   7.0   8.0   9.0
6     1     2     3     4     5   6.0   7.0   8.0   9.0

英文:

You can concat the read csv directly with pandas:

import pandas as pd
from io import StringIO
dataset1 = StringIO(&quot;&quot;&quot;0 0 0 0 0
0 0 5 0 0
3 0 0 1 0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
&quot;&quot;&quot;)
dataset2 = StringIO(&quot;&quot;&quot;0 0 0 0 0
0 0 2 0 0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
&quot;&quot;&quot;)
dataset_list = [dataset1, dataset2] # your files in a list (use glob to get all csv files for exemple)
cols = [f&quot;col{i}&quot; for i in range(1, 10)] # column names (here we expect 9 columns)
df = pd.concat([
    pd.read_csv(dataset, sep=&quot; &quot;, names=cols) # set separator according to your actual data (default=&#39;,&#39;)
    for dataset in dataset_list
    ]
).dropna()

Output:

   col1  col2  col3  col4  col5  col6  col7  col8  col9
3     1     2     3     4     5   6.0   7.0   8.0   9.0
4     1     2     3     4     5   6.0   7.0   8.0   9.0
5     1     2     3     4     5   6.0   7.0   8.0   9.0
2     1     2     3     4     5   6.0   7.0   8.0   9.0
3     1     2     3     4     5   6.0   7.0   8.0   9.0
4     1     2     3     4     5   6.0   7.0   8.0   9.0
5     1     2     3     4     5   6.0   7.0   8.0   9.0
6     1     2     3     4     5   6.0   7.0   8.0   9.0

答案3

得分: 0

假设您始终有一定数量的列（例如 N = 9），以及一个名为 filelist 的CSV文件列表，您可以使用以下代码：

N = 9
sep = r'\s+'  # CSV分隔符（这里是空格）
out = pd.concat([pd.read_csv(f, names=range(N), sep=sep).dropna()
                 for f in filelist], ignore_index=True)

这段代码将合并 filelist 中所有CSV文件的数据，每个文件应该有9列，并使用空格作为分隔符。

英文:

Assuming you always have a defined number of columns (N = 9 for example) and a list of your csv files in filelist, you can use:

N = 9
sep= r&#39;\s+&#39; # csv separator (here spaces)
out = pd.concat([pd.read_csv(f, names=range(N), sep=sep).dropna()
                 for f in filelist], ignore_index=True)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python Pandas合并具有相似结构的多个CSV文件

问题

答案1

答案2

答案3

我怎样将我的凭证添加到 .gitignore，但仍然可以执行我的 Python？

将Win_L与’t’绑定在tkinter中。

在Polars中创建一个具有类别 [‘a’, ‘b’, ‘c’] 的分类列。

Python asyncio sleep is big memory usage.

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。