问题

我有许多文本文件，其中包含以下结构的数据：

#ABTMTY
mdjkls 993583.17355
ebgtas 899443.47380
udenhr 717515.59788
paomen 491385.80901
gneavc 275411.91025
wesuii 119744.95306
ploppm 59145.56233

#MNTGHP
mdjkls 5668781.68669
ebgtas 3852468.72569
.
.
.


文件名，如 "ang_001"， "ang_002" 等，构成了第三维。我需要创建一个值的三维矩阵，但不知道如何以高效的方式完成这个任务。

我考虑了以下方法：
1. 遍历每个文件以获取文件名（变量_1）。
2. 进入每个文件并查找6个大写字母代码出现的次数（变量_2）。然后剪切带有小写字母代码（变量_3）和值的“表格”部分，并将它们粘贴到数据帧中。
3. 拥有一系列数据帧，每个对应不同的变量_1。

目前，我尝试遍历单个文件。首先，我计算这个6个大写字母代码的出现次数，因为它们都以“＃”开头：

```python
for ang_file in ang_all:
    file = open(ang_file, "r")
    text = file.read()
    count = text.count("#")

然后，我遍历包含在单个文件中的数据表。我将每个新表格添加到主数据帧中。每个表格的长度为101行，它们由一个空行分隔。

n = 0
for header in range(count):
    df_temp = pd.read_csv("ang_001.txt", delim_whitespace=True, skipinitialspace=True, nrows=101, skiprows=1 + n * header, names=["code", "value"])
    df = pd.concat([df, df_tmp], axis=0)
    n += 100

问题是，有大约1000个这样的文件，每个文件都超过20 MB。这个短循环已经花费了大量时间，而且我仍然需要以某种方式处理数据帧中的数据。有没有更好的方法？是否有专门用于高效读取文本文件的Python包？


<details>
<summary>英文:</summary>

I have many text files with data written in such a structure:

#ABTMTY
mdjkls 993583.17355
ebgtas 899443.47380
udenhr 717515.59788
paomen 491385.80901
gneavc 275411.91025
wesuii 119744.95306
ploppm 59145.56233

#MNTGHP
mdjkls 5668781.68669
ebgtas 3852468.72569
.
.
.


and the name of the file  &quot;ang_001&quot;, &quot;ang_002&quot; etc. is the third dimension. I have to make a 3D matrix of values, but I don&#39;t know how to make this in an efficient way.

I thought about such an approach:
1. Iterate over each file so I can get filename (variable_1)
2. Go to each file and find how many times 6-capital-letter code appears (variable_2) appears. Then cut out the &quot;table&quot; parts with small letter code (variable_3) and value, and paste them into a DataFrame.
3. Have a series of DataFrames, each corresponding to different variable_1. 

For now I tried to iterate over a single file. First I count the occurrences of this 6-capital-letters code, as all of them start from &quot;#&quot;:

for ang_file in ang_all:
file = open(ang_file, "r")
text = file.read()
count = text.count("#")

Then I iterate over the tables with data that are in this single file. Each new table I add to the main DataFrame. Each table length is 101 lines and they are separated by a single space.

n = 0
for header in range(count):
df_temp = pd.read_csv("ang_001.txt", delim_whitespace=True, skipinitialspace = True, nrows= 101, skiprows = 1 + n*header, names = ["code", "value"])
df = pd.concat([df, df_tmp], axis = 0)
n += 100


The problem is that there are around 1000 such files, and each of them is above 20 MB. This one short loop already took a lot of time to complete, and I&#39;ll still have to work somehow with the data in the DataFrame. Is there a better way to do it? Are there any Python packages that specialize in efficient reading text files?


</details>


# 答案1
**得分**: 1

自 Pandas 版本 1.4.0 开始，有一个新的实验性引擎用于 [read_csv][1]，它依赖于 Arrow 库的 CSV 多线程解析器，而不是默认的 C 解析器。

此外，你应该避免在 for 循环内进行连接操作，而是将结果附加到列表中。

因此，像这样重构你的代码应该会加快速度：
```python
n = 0
dfs = [df]
for header in range(count):
    df_temp = pd.read_csv(
        "ang_001.txt",
        engine="pyarrow",
        delim_whitespace=True,
        skipinitialspace=True,
        nrows=101,
        skiprows=1 + n * header,
        names=["code", "value"],
    )
    dfs.append(df_temp)
    n += 100

df = pd.concat(dfs, axis=0)

英文:

Since version 1.4.0 of Pandas, there is a new experimental engine for read_csv, relying on the Arrow library’s CSV multithreaded parser instead of the default C parser.

Also, you should avoid concatenating inside the for-loop and append to a list instead.

So, refactoring your code like this should speed things up:

n = 0
dfs = [df]
for header in range(count):
    df_temp = pd.read_csv(
        &quot;ang_001.txt&quot;,
        engine=&quot;pyarrow&quot;,
        delim_whitespace=True,
        skipinitialspace=True,
        nrows=101,
        skiprows=1 + n * header,
        names=[&quot;code&quot;, &quot;value&quot;],
    )
    dfs.append(df_temp)
    n += 100

df = pd.concat(dfs, axis=0)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

读取来自多个文件的三维数据

问题

从列表列中提取日期数值并将它们放入新列中

Azure Functions 使用 Python 运行时如何评估本地环境变量？

Selenium Web Driver – Send Keys 在 Chrome 上不起作用（xpath）

Selenium WebDriverWait 在尝试将数据提取到 Pandas DataFrame 时出现 TimeoutException。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论