读取来自多个文件的三维数据

huangapple go评论61阅读模式
英文:

Reading 3-dimensional data from many files

问题

我有许多文本文件,其中包含以下结构的数据:

#ABTMTY
mdjkls 993583.17355
ebgtas 899443.47380
udenhr 717515.59788
paomen 491385.80901
gneavc 275411.91025
wesuii 119744.95306
ploppm 59145.56233

#MNTGHP
mdjkls 5668781.68669
ebgtas 3852468.72569
.
.
.


文件名,如 "ang_001", "ang_002" 等,构成了第三维。我需要创建一个值的三维矩阵,但不知道如何以高效的方式完成这个任务。

我考虑了以下方法:
1. 遍历每个文件以获取文件名(变量_1)。
2. 进入每个文件并查找6个大写字母代码出现的次数(变量_2)。然后剪切带有小写字母代码(变量_3)和值的“表格”部分,并将它们粘贴到数据帧中。
3. 拥有一系列数据帧,每个对应不同的变量_1。

目前,我尝试遍历单个文件。首先,我计算这个6个大写字母代码的出现次数,因为它们都以“#”开头:

```python
for ang_file in ang_all:
    file = open(ang_file, "r")
    text = file.read()
    count = text.count("#")

然后,我遍历包含在单个文件中的数据表。我将每个新表格添加到主数据帧中。每个表格的长度为101行,它们由一个空行分隔。

n = 0
for header in range(count):
    df_temp = pd.read_csv("ang_001.txt", delim_whitespace=True, skipinitialspace=True, nrows=101, skiprows=1 + n * header, names=["code", "value"])
    df = pd.concat([df, df_tmp], axis=0)
    n += 100

问题是,有大约1000个这样的文件,每个文件都超过20 MB。这个短循环已经花费了大量时间,而且我仍然需要以某种方式处理数据帧中的数据。有没有更好的方法?是否有专门用于高效读取文本文件的Python包?


<details>
<summary>英文:</summary>

I have many text files with data written in such a structure:

#ABTMTY
mdjkls 993583.17355
ebgtas 899443.47380
udenhr 717515.59788
paomen 491385.80901
gneavc 275411.91025
wesuii 119744.95306
ploppm 59145.56233

#MNTGHP
mdjkls 5668781.68669
ebgtas 3852468.72569
.
.
.


and the name of the file  &quot;ang_001&quot;, &quot;ang_002&quot; etc. is the third dimension. I have to make a 3D matrix of values, but I don&#39;t know how to make this in an efficient way.

I thought about such an approach:
1. Iterate over each file so I can get filename (variable_1)
2. Go to each file and find how many times 6-capital-letter code appears (variable_2) appears. Then cut out the &quot;table&quot; parts with small letter code (variable_3) and value, and paste them into a DataFrame.
3. Have a series of DataFrames, each corresponding to different variable_1. 

For now I tried to iterate over a single file. First I count the occurrences of this 6-capital-letters code, as all of them start from &quot;#&quot;:

for ang_file in ang_all:
file = open(ang_file, "r")
text = file.read()
count = text.count("#")

Then I iterate over the tables with data that are in this single file. Each new table I add to the main DataFrame. Each table length is 101 lines and they are separated by a single space.

n = 0
for header in range(count):
df_temp = pd.read_csv("ang_001.txt", delim_whitespace=True, skipinitialspace = True, nrows= 101, skiprows = 1 + n*header, names = ["code", "value"])
df = pd.concat([df, df_tmp], axis = 0)
n += 100


The problem is that there are around 1000 such files, and each of them is above 20 MB. This one short loop already took a lot of time to complete, and I&#39;ll still have to work somehow with the data in the DataFrame. Is there a better way to do it? Are there any Python packages that specialize in efficient reading text files?


</details>


# 答案1
**得分**: 1

自 Pandas 版本 1.4.0 开始,有一个新的实验性引擎用于 [read_csv][1],它依赖于 Arrow 库的 CSV 多线程解析器,而不是默认的 C 解析器。

此外,你应该避免在 for 循环内进行连接操作,而是将结果附加到列表中。

因此,像这样重构你的代码应该会加快速度:
```python
n = 0
dfs = [df]
for header in range(count):
    df_temp = pd.read_csv(
        "ang_001.txt",
        engine="pyarrow",
        delim_whitespace=True,
        skipinitialspace=True,
        nrows=101,
        skiprows=1 + n * header,
        names=["code", "value"],
    )
    dfs.append(df_temp)
    n += 100

df = pd.concat(dfs, axis=0)
英文:

Since version 1.4.0 of Pandas, there is a new experimental engine for read_csv, relying on the Arrow library’s CSV multithreaded parser instead of the default C parser.

Also, you should avoid concatenating inside the for-loop and append to a list instead.

So, refactoring your code like this should speed things up:

n = 0
dfs = [df]
for header in range(count):
    df_temp = pd.read_csv(
        &quot;ang_001.txt&quot;,
        engine=&quot;pyarrow&quot;,
        delim_whitespace=True,
        skipinitialspace=True,
        nrows=101,
        skiprows=1 + n * header,
        names=[&quot;code&quot;, &quot;value&quot;],
    )
    dfs.append(df_temp)
    n += 100

df = pd.concat(dfs, axis=0)

huangapple
  • 本文由 发表于 2023年3月3日 23:10:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75628777.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定