英文:
Optimizing the Django model for large tables
问题
考虑一个拥有100万用户和10亿交易的银行。
所有10亿交易会放在一个表中吗?
1-20亿条记录是正常的还是太多了?
如果我们想要查看一个用户的交易,我们每次都要搜索整个表吗?
是这样运作的吗,还是有更好的方法?
英文:
Consider a bank with 1 million users and 1 billion transactions.
Will all one billion transactions be placed in one table?
Is 1-2 billion records normal or too much?
If we want to see the transactions of a user, do we search this entire table every time?
Is this the way it works or is there a better way?
答案1
得分: 2
No, 数据库保留索引。索引允许在数据库中高效搜索。对于ForeignKey
,它会自动创建索引。
索引的实现取决于数据库,但通常是类似B树的树状结构,这意味着我们可以在*O(log n)*时间内找到要提取的行,这在记录数量较多时性能相当好。因此,数据库无需进行线性搜索。
如果您想要检索用户过去n年的索引,通常会创建一个复合索引,使得模型如下:
from django.conf import settings
from django.db import models
class Transaction(models.Model):
user = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
timestamp = models.DateTimeField(auto_now_add=True)
class Meta:
indexes = [
models.<b>Index(</b>fields=['user', 'timestamp']<b>)</b>,
]
这将timestamp
和user
字段合并为一个索引,使得检索给定用户过去一年的记录通常非常高效。
数据库被设计为尽可能高效地检索数据,因此只有在最后的情况下才会执行“线性扫描”,即数据库必须枚举所有记录。
然而,通常隐私规则要求银行在某个时候删除旧记录。如果我记得没错,在欧盟,银行被要求在一定年限后删除交易记录(尽管它们也被要求在某些年限内保留这些记录)。
英文:
> If we want to see the transactions of a user, do we search this entire table every time? Is this the way it works or is there a better way?
No, a database keeps indexes. Indexes allow searching in a database efficiently. For ForeignKey
s, it makes indexes automatically.
The implementation of indexes depends on the database, but typically it is a tree-like structure like a B-tree, this means that we detect which rows to fetch in O(log n), which thus does scale quite well with the number of records. The database thus has not to do a linear search.
If you want to retrieve the indexes of a user for the last n years, you typically make a composite index, so the model looks like:
<pre><code>from django.conf import settings
class Transaction(models.Model):
user = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
timestamp = models.DateTimeField(auto_now_add=True)
class Meta:
indexes = [
models.<b>Index(</b>fields=['user', 'timestamp']<b>)</b>,
]</code></pre>
This combines the timestamp
and user
fields in an index, making queries to retrieve the records for the last year for a given user usually quite efficient.
Databases are designed to try to retrieve data as efficient as possible, and thus will thus only as a last resort do a "linear scan" where the database has to enumerate over all records.
often however privacy rules require the bank to delete old records anyway. If I recall correctly, in the EU, banks are required to remove the transactions after a certain number of years (although they are required to keep these for a certain number of years as well).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论