英文:
Reading 3-dimensional data from many files
问题
我有许多文本文件,其中包含以下结构的数据:
#ABTMTY
mdjkls 993583.17355
ebgtas 899443.47380
udenhr 717515.59788
paomen 491385.80901
gneavc 275411.91025
wesuii 119744.95306
ploppm 59145.56233
#MNTGHP
mdjkls 5668781.68669
ebgtas 3852468.72569
.
.
.
文件名,如 "ang_001", "ang_002" 等,构成了第三维。我需要创建一个值的三维矩阵,但不知道如何以高效的方式完成这个任务。
我考虑了以下方法:
1. 遍历每个文件以获取文件名(变量_1)。
2. 进入每个文件并查找6个大写字母代码出现的次数(变量_2)。然后剪切带有小写字母代码(变量_3)和值的“表格”部分,并将它们粘贴到数据帧中。
3. 拥有一系列数据帧,每个对应不同的变量_1。
目前,我尝试遍历单个文件。首先,我计算这个6个大写字母代码的出现次数,因为它们都以“#”开头:
```python
for ang_file in ang_all:
file = open(ang_file, "r")
text = file.read()
count = text.count("#")
然后,我遍历包含在单个文件中的数据表。我将每个新表格添加到主数据帧中。每个表格的长度为101行,它们由一个空行分隔。
n = 0
for header in range(count):
df_temp = pd.read_csv("ang_001.txt", delim_whitespace=True, skipinitialspace=True, nrows=101, skiprows=1 + n * header, names=["code", "value"])
df = pd.concat([df, df_tmp], axis=0)
n += 100
问题是,有大约1000个这样的文件,每个文件都超过20 MB。这个短循环已经花费了大量时间,而且我仍然需要以某种方式处理数据帧中的数据。有没有更好的方法?是否有专门用于高效读取文本文件的Python包?
<details>
<summary>英文:</summary>
I have many text files with data written in such a structure:
#ABTMTY
mdjkls 993583.17355
ebgtas 899443.47380
udenhr 717515.59788
paomen 491385.80901
gneavc 275411.91025
wesuii 119744.95306
ploppm 59145.56233
#MNTGHP
mdjkls 5668781.68669
ebgtas 3852468.72569
.
.
.
and the name of the file "ang_001", "ang_002" etc. is the third dimension. I have to make a 3D matrix of values, but I don't know how to make this in an efficient way.
I thought about such an approach:
1. Iterate over each file so I can get filename (variable_1)
2. Go to each file and find how many times 6-capital-letter code appears (variable_2) appears. Then cut out the "table" parts with small letter code (variable_3) and value, and paste them into a DataFrame.
3. Have a series of DataFrames, each corresponding to different variable_1.
For now I tried to iterate over a single file. First I count the occurrences of this 6-capital-letters code, as all of them start from "#":
for ang_file in ang_all:
file = open(ang_file, "r")
text = file.read()
count = text.count("#")
Then I iterate over the tables with data that are in this single file. Each new table I add to the main DataFrame. Each table length is 101 lines and they are separated by a single space.
n = 0
for header in range(count):
df_temp = pd.read_csv("ang_001.txt", delim_whitespace=True, skipinitialspace = True, nrows= 101, skiprows = 1 + n*header, names = ["code", "value"])
df = pd.concat([df, df_tmp], axis = 0)
n += 100
The problem is that there are around 1000 such files, and each of them is above 20 MB. This one short loop already took a lot of time to complete, and I'll still have to work somehow with the data in the DataFrame. Is there a better way to do it? Are there any Python packages that specialize in efficient reading text files?
</details>
# 答案1
**得分**: 1
自 Pandas 版本 1.4.0 开始,有一个新的实验性引擎用于 [read_csv][1],它依赖于 Arrow 库的 CSV 多线程解析器,而不是默认的 C 解析器。
此外,你应该避免在 for 循环内进行连接操作,而是将结果附加到列表中。
因此,像这样重构你的代码应该会加快速度:
```python
n = 0
dfs = [df]
for header in range(count):
df_temp = pd.read_csv(
"ang_001.txt",
engine="pyarrow",
delim_whitespace=True,
skipinitialspace=True,
nrows=101,
skiprows=1 + n * header,
names=["code", "value"],
)
dfs.append(df_temp)
n += 100
df = pd.concat(dfs, axis=0)
英文:
Since version 1.4.0 of Pandas, there is a new experimental engine for read_csv, relying on the Arrow library’s CSV multithreaded parser instead of the default C parser.
Also, you should avoid concatenating inside the for-loop and append to a list instead.
So, refactoring your code like this should speed things up:
n = 0
dfs = [df]
for header in range(count):
df_temp = pd.read_csv(
"ang_001.txt",
engine="pyarrow",
delim_whitespace=True,
skipinitialspace=True,
nrows=101,
skiprows=1 + n * header,
names=["code", "value"],
)
dfs.append(df_temp)
n += 100
df = pd.concat(dfs, axis=0)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论