英文:
How can we merge column headers from multiple CSVs into one dataframe, and list source file names for each file in one column?
问题
以下是您要翻译的代码部分:
# 导入必要的库
import pandas as pd
import os
import glob
# 使用 glob 获取文件夹中的所有 CSV 文件
path = 'C:\\Users\\'
csv_files = glob.glob(os.path.join(path, "*.csv"))
csv_files
df_headers = pd.DataFrame()
# 遍历 CSV 文件列表
for f in csv_files:
# 读取 CSV 文件
df = pd.read_csv(f, nrows=1)
print(df.shape)
df_headers = pd.concat([df_headers, df], axis=0)
df_headers['file_name'] = f
df_headers.to_csv('C:\\Users\\ryans\\Desktop\\out.csv')
请注意,代码中的注释已被翻译成中文。
英文:
Here is the code that I am testing.
# import necessary libraries
import pandas as pd
import os
import glob
# use glob to get all the csv files
# in the folder
path = 'C:\\Users\\'
csv_files = glob.glob(os.path.join(path, "*.csv"))
csv_files
df_headers = pd.DataFrame()
# loop over the list of csv files
for f in csv_files:
#print(type(f))
# read the csv file
df = pd.read_csv(f, nrows=1)
print(df.shape)
df_headers = pd.concat([df_headers, df], axis=0)
df_headers['file_name'] = f
df_headers.to_csv('C:\\Users\\ryans\\Desktop\\out.csv')
This almost works, but it always writes the last file to the column in df_headers['file_name'], so only the last file that the loop goes through, is actually listed in 'file_name'.
答案1
得分: 1
以下是您提供的内容的翻译部分:
这是因为每当您将单个值分配给整个列时,相同的值会重复出现。
例如,如果您有您的CSV文件如下 -
csv_files = ['1.csv', '2.csv', '3.csv']
第一次迭代
您的数据框将会是这样的
|某些列......|'filename'|
|某些值......|'1.csv'|
第二次迭代
您的数据框将会是这样的
|某些列......|'filename'|
|某些值......|'2.csv'|
|某些值......|'2.csv'|
以此类推。
当您将单个值分配给一列时,相同的值将分配给该列中的所有值。简而言之,如果您有一个数据框如下 -
|A|B|
|1|2|
|3|4|
如果您执行
df['B'] = 5
则B中的所有值都将变为5,因此您的数据框将变为
|A|B|
|1|5|
|3|5|
对于您的情况,一种解决方法可能是 -
导入必要的库
import pandas as pd
import os
import glob
使用glob获取文件夹中的所有CSV文件
path = 'C:\Users\'
csv_files = glob.glob(os.path.join(path, "*.csv"))
csv_files
df_headers = pd.DataFrame()
遍历CSV文件列表
for f in csv_files:
# 读取CSV文件
df = pd.read_csv(f, nrows=1)
print(df.shape)
df_headers = pd.concat([df_headers, df], axis=0)
df_headers['file_name'] = csv_files
df_headers.to_csv('C:\Users\ryans\Desktop\out.csv')
英文:
It is because whenever you assign a single value to a whole column, the same value gets repeated.
For example, if you have your csv files like -
csv_files = ['1.csv', '2.csv', '3.csv']
<b> Ist Iteration </b><br>
Your df will be like
|Some columns......|'filename'|<br>
|Some values.......|'1.csv'|<br>
<b> IInd Iteration </b><br>
Your df will be like
|Some columns......|'filename'|<br>
|Some values.......|'2.csv'|<br>
|Some values.......|'2.csv'|<br>
and so on.
When you are assigning a single value to a column, the same value gets assigned to all the values in the column. Simply put, a dataframe like-
|A|B|<br>
|1|2|<br>
|3|4|<br>
if you do,
df['B'] = 5
, all values in B will become 5, therefore your dataframe becomes-
|A|B|<br>
|1|5|<br>
|3|5|<br>
A solution to your case could be-
# import necessary libraries
import pandas as pd
import os
import glob
# use glob to get all the csv files
# in the folder
path = 'C:\\Users\\'
csv_files = glob.glob(os.path.join(path, "*.csv"))
csv_files
df_headers = pd.DataFrame()
# loop over the list of csv files
for f in csv_files:
#print(type(f))
# read the csv file
df = pd.read_csv(f, nrows=1)
print(df.shape)
df_headers = pd.concat([df_headers, df], axis=0)
df_headers['file_name'] = csv_files
df_headers.to_csv('C:\\Users\\ryans\\Desktop\\out.csv')
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论