从多个文本文件中创建数据库在Python中

huangapple go评论69阅读模式
英文:

database from multiple text files in python

问题

我正在尝试改进我最近编写的Python代码,该代码打开一个包含能量列表的文本文件,格式如下:

6.36271
5.37679
165.742
6.53952
...

对于每个条目,它会搜索能量是否存在于专用文件夹中包含的任何文件中(这些文件具有相同的格式)。如果找到能量,它会返回找到它的文件名和能量本身。

问题在于文件数量非常多(>1000),每个文件都有大量能量要查找。

这个软件可以工作,但非常慢,因为它必须每次搜索条目时都打开每个文件,我了解到如果我可以将所有文件加载到数据库中,然后进行查询,它将工作得更快。

问题在于我从未使用过数据库,我完全不知道如何从所有的文件中创建这样的数据库(并跟踪文件名),以及在创建后如何进行搜索。

如果有人能帮助我,我将非常感激。

谢谢

英文:

I am trying to improve a python code I recently wrote that opens a text file containing a list of energies, like this:

6.36271
5.37679
165.742
6.53952
...

and for each entry it searches if the energy is present in any of the files contained in a dedicated folder (that have the same format). If the energy is found it returns the name of the file where it has been found and the energy itself.

The problem is that the number of files is very large (>1000), and each one has a lot of energies to look into.

The software works, but it is very slow because it has to open every file every time it searches for an entry, and I understood that it would work a lot faster if I could load all the files into a database and then query it.

The problem is that I never worked with databases and I have literally no idea of how to create such database from all the thousand of files (keeping track of the file name), and how to search into that once it has been created

If someone could give me a hand I would be very grateful

Thanks

答案1

得分: 0

作为一种创建数据库的替代方法,您可以将数据存储在 pandas DataFrame 中(并在本地存储为 .csv 或 .xlsx 文件)。

可行性取决于每个文件中有多少能源条目,但 pandas 能够非常快速地处理数百万行数据。

您的 DataFrame 可以有两列,第一列存储文件名,第二列存储能量值:

文件名 能量数值
文件名1 6.36271
文件名1 5.37679
文件名1 165.742
文件名1 6.53952
文件名2 7.3
文件名2 6.36271

然后,您可以遍历能量条目列表,对于每个条目,筛选 DataFrame,只显示包含此条目的行。

例如,搜索 6.36271 将返回以下 DataFrame:

文件名 能量数值
文件名1 6.36271
文件名2 6.36271

然后,您就可以在文件名列中找到包含能量值的所有文件。

如果您发布代码的最小工作示例,我可以提供可能的实现方式。

英文:

As an alternative to creating a database using SQL to query the data, you could also store the data in a pandas DataFrame (and locally in a .csv or .xlsx file).

The feasability depends on how many energy entries you have in each file, but pandas is capable of processing millions of rows very quickly.

Your dataframe could have two columns, with the first column storing the filenames and the second column storing the energy values:

Filename Energy values
filename1 6.36271
filename1 5.37679
filename1 165.742
filename1 6.53952
filename2 7.3
filename2 6.36271

And then you can iterate through your list of energy entries and for each entry, filter the DataFrame to only show the rows in which this entry was found.

For example the search for 6.36271 would then return this DataFrame:

Filename Energy values
filename1 6.36271
filename2 6.36271

And then you have all the files containing the energy value in the Filename column.

If you post a minimal working example of your code, I could update the answer with a possible implementation.

huangapple
  • 本文由 发表于 2023年1月9日 19:12:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/75056448.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定