英文:
SQL: Select Running Total For Each Category In Transactional Table, Sorted By Date
问题
I understand you'd like a translation of the provided content. Here it is:
我有一个提供交易数据的表格:
一个包含字母数字的主键,一个时间戳,一个用户ID,一个出/入字符串列和一个金额列。
我试图创建另一个列,每次用户ID出现时都提供该用户ID的累积总数。这些用户ID在初始时并不会显示为0金额,因此必须假定它们首次出现时为0。
我考虑过可能可以借助一些辅助列来实现这一点。我的思路大致如下:
- 创建一个列来指示在`time`列值之前用户ID出现了多少次。可能称为`occurence_num`列。
- 创建一个使金额更容易处理的列,如`(case when io='in' then amount else -1*amount end) as balance_adjust`
- 按`user_id`分组,并(在每行上)对所有`occurence_num`小于当前记录的`balance_adjust`值进行`sum()`。
然而,我在测试这些想法时遇到了一些困难。我在一个相当大的数据库中工作,其中包含2200万行的SQLite。表格可以根据需要进行更改/更新。之所以以这种方式存储,是为了尽可能简化ETL过程,因为需要从许多页面中提取大量数据。我期望的输出将类似于:
我可以通过以下方式获得每个用户的总体总数,但需要几分钟:
```sql
SELECT
user_id,
sum(case when io='in' then amount else -1*amount end) as balance
FROM
transactions
GROUP BY
user_id
我认为在这方面进行扩展,使用OVER
/PARTITION
子句可能是个不错的选择,但考虑到这个数据库的规模,我不确定是否是正确的选择。
感谢您的帮助。
编辑:我应该提到,实际数据中可能包括时间列中的重复项。交易可能在相同的时间发生,因为它只能精确到秒。
Is there anything else you'd like to ask or clarify?
<details>
<summary>英文:</summary>
I have a transactional table providing:
an alphanumeric PK, a timestamp, a user_id, a in/out string column, and an amount column.
id time user_id io amount
38hw 2019-10-18 18:35:09 2 in 1
nv49 2019-10-18 18:35:10 3 in 50
83ha 2019-10-18 18:35:11 5 in 2
ja03 2019-10-18 18:35:12 4 out 2
019c 2019-10-18 18:35:13 1 out 75
ac5r 2019-10-18 18:35:14 3 in 20
as30 2019-10-18 18:35:15 3 in 3
34ds 2019-10-18 18:35:16 4 in 7
12my 2019-10-18 18:35:17 2 in 50
dk20 2019-10-18 18:35:18 4 in 50
sk18 2019-10-18 18:35:19 1 in 7
am35 2019-10-18 18:35:20 2 in 3
mc92 2019-10-18 18:35:21 2 out 8
alov 2019-10-18 18:35:22 3 in 4
ap34 2019-10-18 18:35:23 1 out 6
I am trying to create another column that provides the running total for that user_id each time it shows up. These user_id's do not initially show up with a 0 amount, so that must be assumed the first time they show up.
I've considered its likely possible to do this with some helper columns. My thought process is something like this:
- Create a column to indicate how many times the user_id shows up prior to the value in `time` column. Maybe called `occurence_num` column
- Create a column that makes amount easier to work with, like `(case when io='in' then amount else -1*amount end) as balance_adjust`
- Group by `user_id` and (on each row) `sum()` all the `balance_adjust` values where `occurence_num` is less than the current records.
I'm having a hard time testing these ideas out though. I'm working in a fairly large database, SQLite with 22 million rows. The table can be altered/updated as needed. It was stored this way in favor of keeping the ETL as simple as possible because there was a lot of data to pull and a lot of pages to pull from. My desired output would look something like this:
id time user_id io amount running_total
38hw 2019-10-18 18:35:09 2 in 1 1
nv49 2019-10-18 18:35:10 3 in 50 50
83ha 2019-10-18 18:35:11 5 in 2 2
ja03 2019-10-18 18:35:12 4 out 2 -2
019c 2019-10-18 18:35:13 1 out 75 -75
ac5r 2019-10-18 18:35:14 3 in 20 70
as30 2019-10-18 18:35:15 3 in 3 73
34ds 2019-10-18 18:35:16 4 in 7 5
12my 2019-10-18 18:35:17 2 in 50 51
dk20 2019-10-18 18:35:18 4 in 50 55
sk18 2019-10-18 18:35:19 1 in 7 -68
am35 2019-10-18 18:35:20 2 in 3 54
mc92 2019-10-18 18:35:21 2 out 8 46
alov 2019-10-18 18:35:22 3 in 4 77
ap34 2019-10-18 18:35:23 1 out 6 -74
I can get the overall total of each user this way, but it takes a few minutes:
SELECT
user_id,
sum(case when io='in' then amount else -1*amount end) as balance
FROM
transactions
GROUP BY
user_id
I think expanding on this, a `OVER`/`PARTITION` clause will be a good call, but I'm not sure if it's the right call given the size of this database.
Thanks for the help.
Edit: I should mention, the real data may include duplicates in the time column. Transactions could have occurred at the same time, as it's only granular to the second.
</details>
# 答案1
**得分**: 1
以下是翻译的内容:
对你的尝试进行一个小调整就可以了。只需将你的求和转换为一个“*运行总和*”,使用相应的窗口函数,它将通过对用户进行分区并按时间排序来计算运行总额。
如果你有相同的时间,可以依赖于按id排序,这将打破相同时间的关系,并使求和正确运行。
```sql
SELECT *,
SUM(CASE WHEN io = 'in'
THEN amount
ELSE -amount
END) OVER(PARTITION BY user_id ORDER BY time, id) as balance
FROM transactions
ORDER BY time
输出:
id | time | user_id | io | amount | balance |
---|---|---|---|---|---|
38hw | 2019-10-18 18:35:09 | 2 | in | 1 | 1 |
nv49 | 2019-10-18 18:35:10 | 3 | in | 50 | 50 |
83ha | 2019-10-18 18:35:11 | 5 | in | 2 | 2 |
ja03 | 2019-10-18 18:35:12 | 4 | out | 2 | -2 |
019c | 2019-10-18 18:35:13 | 1 | out | 75 | -75 |
ac5r | 2019-10-18 18:35:14 | 3 | in | 20 | 70 |
as30 | 2019-10-18 18:35:15 | 3 | in | 3 | 73 |
34ds | 2019-10-18 18:35:16 | 4 | in | 7 | 5 |
12my | 2019-10-18 18:35:17 | 2 | in | 50 | 51 |
dk20 | 2019-10-18 18:35:18 | 4 | in | 50 | 55 |
sk18 | 2019-10-18 18:35:19 | 1 | in | 7 | -68 |
am35 | 2019-10-18 18:35:20 | 2 | in | 3 | 54 |
mc92 | 2019-10-18 18:35:21 | 2 | out | 8 | 46 |
alov | 2019-10-18 18:35:22 | 3 | in | 4 | 77 |
ap34 | 2019-10-18 18:35:23 | 1 | out | 6 | -74 |
查看演示这里。
注意:最后的 ORDER BY
子句不是必需的,它仅用于可视化目的。
英文:
A small tweak to your attempt should do it. It's sufficient to turn your sum into a "running sum" using the corresponding window function, that will compute the running amount by partitioning on user and ordering on time.
And if you have tied times, you can rely on ordering by id, which will break the tie and make the sum work correctly.
SELECT *,
SUM(CASE WHEN io = 'in'
THEN amount
ELSE -amount
END) OVER(PARTITION BY user_id ORDER BY time, id) as balance
FROM transactions
ORDER BY time
Output:
id | time | user_id | io | amount | balance |
---|---|---|---|---|---|
38hw | 2019-10-18 18:35:09 | 2 | in | 1 | 1 |
nv49 | 2019-10-18 18:35:10 | 3 | in | 50 | 50 |
83ha | 2019-10-18 18:35:11 | 5 | in | 2 | 2 |
ja03 | 2019-10-18 18:35:12 | 4 | out | 2 | -2 |
019c | 2019-10-18 18:35:13 | 1 | out | 75 | -75 |
ac5r | 2019-10-18 18:35:14 | 3 | in | 20 | 70 |
as30 | 2019-10-18 18:35:15 | 3 | in | 3 | 73 |
34ds | 2019-10-18 18:35:16 | 4 | in | 7 | 5 |
12my | 2019-10-18 18:35:17 | 2 | in | 50 | 51 |
dk20 | 2019-10-18 18:35:18 | 4 | in | 50 | 55 |
sk18 | 2019-10-18 18:35:19 | 1 | in | 7 | -68 |
am35 | 2019-10-18 18:35:20 | 2 | in | 3 | 54 |
mc92 | 2019-10-18 18:35:21 | 2 | out | 8 | 46 |
alov | 2019-10-18 18:35:22 | 3 | in | 4 | 77 |
ap34 | 2019-10-18 18:35:23 | 1 | out | 6 | -74 |
Check the demo here.
Note: The last ORDER BY
clause is not needed: it's just for visualization purposes.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论