How can we merge column headers from multiple CSVs into one dataframe, and list source file names for each file in one column?

huangapple go评论76阅读模式
英文:

How can we merge column headers from multiple CSVs into one dataframe, and list source file names for each file in one column?

问题

以下是您要翻译的代码部分:

# 导入必要的库
import pandas as pd
import os
import glob


# 使用 glob 获取文件夹中的所有 CSV 文件
path = 'C:\\Users\\'
csv_files = glob.glob(os.path.join(path, "*.csv"))
csv_files


df_headers = pd.DataFrame()


# 遍历 CSV 文件列表
for f in csv_files:
    # 读取 CSV 文件
    df = pd.read_csv(f, nrows=1)
    print(df.shape)

    df_headers = pd.concat([df_headers, df], axis=0)
    df_headers['file_name'] = f


df_headers.to_csv('C:\\Users\\ryans\\Desktop\\out.csv')

请注意,代码中的注释已被翻译成中文。

英文:

Here is the code that I am testing.

# import necessary libraries
import pandas as pd
import os
import glob
  
  
# use glob to get all the csv files 
# in the folder
path = 'C:\\Users\\'
csv_files = glob.glob(os.path.join(path, "*.csv"))
csv_files


df_headers = pd.DataFrame()


# loop over the list of csv files
for f in csv_files:
    #print(type(f))

    # read the csv file
    df = pd.read_csv(f, nrows=1)
    print(df.shape)
    
    
    df_headers = pd.concat([df_headers, df], axis=0)
    df_headers['file_name'] = f
      

df_headers.to_csv('C:\\Users\\ryans\\Desktop\\out.csv')

This almost works, but it always writes the last file to the column in df_headers['file_name'], so only the last file that the loop goes through, is actually listed in 'file_name'.

答案1

得分: 1

以下是您提供的内容的翻译部分:

这是因为每当您将单个值分配给整个列时,相同的值会重复出现。

例如,如果您有您的CSV文件如下 -

csv_files = ['1.csv', '2.csv', '3.csv']

第一次迭代

您的数据框将会是这样的

|某些列......|'filename'|
|某些值......|'1.csv'|

第二次迭代

您的数据框将会是这样的

|某些列......|'filename'|
|某些值......|'2.csv'|
|某些值......|'2.csv'|

以此类推。

当您将单个值分配给一列时,相同的值将分配给该列中的所有值。简而言之,如果您有一个数据框如下 -

|A|B|
|1|2|
|3|4|

如果您执行

df['B'] = 5

则B中的所有值都将变为5,因此您的数据框将变为

|A|B|
|1|5|
|3|5|

对于您的情况,一种解决方法可能是 -

导入必要的库

import pandas as pd
import os
import glob

使用glob获取文件夹中的所有CSV文件

path = 'C:\Users\'
csv_files = glob.glob(os.path.join(path, "*.csv"))
csv_files

df_headers = pd.DataFrame()

遍历CSV文件列表

for f in csv_files:
# 读取CSV文件
df = pd.read_csv(f, nrows=1)
print(df.shape)

df_headers = pd.concat([df_headers, df], axis=0)

df_headers['file_name'] = csv_files

df_headers.to_csv('C:\Users\ryans\Desktop\out.csv')

英文:

It is because whenever you assign a single value to a whole column, the same value gets repeated.

For example, if you have your csv files like -

csv_files = ['1.csv', '2.csv', '3.csv']

<b> Ist Iteration </b><br>

Your df will be like

|Some columns......|'filename'|<br>
|Some values.......|'1.csv'|<br>

<b> IInd Iteration </b><br>

Your df will be like

|Some columns......|'filename'|<br>
|Some values.......|'2.csv'|<br>
|Some values.......|'2.csv'|<br>

and so on.

When you are assigning a single value to a column, the same value gets assigned to all the values in the column. Simply put, a dataframe like-

|A|B|<br>
|1|2|<br>
|3|4|<br>

if you do,

df['B'] = 5
, all values in B will become 5, therefore your dataframe becomes-

|A|B|<br>
|1|5|<br>
|3|5|<br>

A solution to your case could be-

# import necessary libraries
import pandas as pd
import os
import glob
  
  
# use glob to get all the csv files 
# in the folder
path = &#39;C:\\Users\\&#39;
csv_files = glob.glob(os.path.join(path, &quot;*.csv&quot;))
csv_files


df_headers = pd.DataFrame()


# loop over the list of csv files
for f in csv_files:
    #print(type(f))

    # read the csv file
    df = pd.read_csv(f, nrows=1)
    print(df.shape)
    
    
    df_headers = pd.concat([df_headers, df], axis=0)

df_headers[&#39;file_name&#39;] = csv_files
      

df_headers.to_csv(&#39;C:\\Users\\ryans\\Desktop\\out.csv&#39;)

huangapple
  • 本文由 发表于 2023年5月22日 08:26:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76302435.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定