2023年1月9日 19:12:33go评论85阅读模式

英文:

database from multiple text files in python

问题

我正在尝试改进我最近编写的Python代码，该代码打开一个包含能量列表的文本文件，格式如下：

对于每个条目，它会搜索能量是否存在于专用文件夹中包含的任何文件中（这些文件具有相同的格式）。如果找到能量，它会返回找到它的文件名和能量本身。

问题在于文件数量非常多（>1000），每个文件都有大量能量要查找。

这个软件可以工作，但非常慢，因为它必须每次搜索条目时都打开每个文件，我了解到如果我可以将所有文件加载到数据库中，然后进行查询，它将工作得更快。

问题在于我从未使用过数据库，我完全不知道如何从所有的文件中创建这样的数据库（并跟踪文件名），以及在创建后如何进行搜索。

如果有人能帮助我，我将非常感激。

谢谢

英文:

I am trying to improve a python code I recently wrote that opens a text file containing a list of energies, like this:

and for each entry it searches if the energy is present in any of the files contained in a dedicated folder (that have the same format). If the energy is found it returns the name of the file where it has been found and the energy itself.

The problem is that the number of files is very large (>1000), and each one has a lot of energies to look into.

The software works, but it is very slow because it has to open every file every time it searches for an entry, and I understood that it would work a lot faster if I could load all the files into a database and then query it.

The problem is that I never worked with databases and I have literally no idea of how to create such database from all the thousand of files (keeping track of the file name), and how to search into that once it has been created

If someone could give me a hand I would be very grateful

Thanks

答案1

得分: 0

作为一种创建数据库的替代方法，您可以将数据存储在 pandas DataFrame 中（并在本地存储为 .csv 或 .xlsx 文件）。

可行性取决于每个文件中有多少能源条目，但 pandas 能够非常快速地处理数百万行数据。

您的 DataFrame 可以有两列，第一列存储文件名，第二列存储能量值：

文件名	能量数值
文件名1	6.36271
文件名1	5.37679
文件名1	165.742
文件名1	6.53952
文件名2	7.3
文件名2	6.36271

然后，您可以遍历能量条目列表，对于每个条目，筛选 DataFrame，只显示包含此条目的行。

例如，搜索 6.36271 将返回以下 DataFrame：

文件名	能量数值
文件名1	6.36271
文件名2	6.36271

然后，您就可以在文件名列中找到包含能量值的所有文件。

如果您发布代码的最小工作示例，我可以提供可能的实现方式。

英文:

As an alternative to creating a database using SQL to query the data, you could also store the data in a pandas DataFrame (and locally in a .csv or .xlsx file).

The feasability depends on how many energy entries you have in each file, but pandas is capable of processing millions of rows very quickly.

Your dataframe could have two columns, with the first column storing the filenames and the second column storing the energy values:

Filename	Energy values
filename1	6.36271
filename1	5.37679
filename1	165.742
filename1	6.53952
filename2	7.3
filename2	6.36271

And then you can iterate through your list of energy entries and for each entry, filter the DataFrame to only show the rows in which this entry was found.

For example the search for 6.36271 would then return this DataFrame:

Filename	Energy values
filename1	6.36271
filename2	6.36271

And then you have all the files containing the energy value in the Filename column.

If you post a minimal working example of your code, I could update the answer with a possible implementation.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从多个文本文件中创建数据库在Python中

问题

答案1

使用for循环来迭代地重命名文档

添加 getitem 访问器到 Python 类方法

(Python) 列表中的字典引发 “IndexError: list index out of range” 错误。

三角形数：用Python的嵌套循环

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。