2023年5月18日 02:03:28go评论74阅读模式

英文:

SQL: Select Running Total For Each Category In Transactional Table, Sorted By Date

问题

I understand you'd like a translation of the provided content. Here it is:

我有一个提供交易数据的表格：
一个包含字母数字的主键，一个时间戳，一个用户ID，一个出/入字符串列和一个金额列。

我试图创建另一个列，每次用户ID出现时都提供该用户ID的累积总数。这些用户ID在初始时并不会显示为0金额，因此必须假定它们首次出现时为0。

我考虑过可能可以借助一些辅助列来实现这一点。我的思路大致如下：

- 创建一个列来指示在`time`列值之前用户ID出现了多少次。可能称为`occurence_num`列。
- 创建一个使金额更容易处理的列，如`(case when io='in' then amount else -1*amount end) as balance_adjust`
- 按`user_id`分组，并（在每行上）对所有`occurence_num`小于当前记录的`balance_adjust`值进行`sum()`。

然而，我在测试这些想法时遇到了一些困难。我在一个相当大的数据库中工作，其中包含2200万行的SQLite。表格可以根据需要进行更改/更新。之所以以这种方式存储，是为了尽可能简化ETL过程，因为需要从许多页面中提取大量数据。我期望的输出将类似于：

我可以通过以下方式获得每个用户的总体总数，但需要几分钟：
```sql
SELECT 
    user_id,
    sum(case when io='in' then amount else -1*amount end) as balance
FROM 
    transactions 
GROUP BY
    user_id

我认为在这方面进行扩展，使用OVER/PARTITION子句可能是个不错的选择，但考虑到这个数据库的规模，我不确定是否是正确的选择。

感谢您的帮助。

编辑：我应该提到，实际数据中可能包括时间列中的重复项。交易可能在相同的时间发生，因为它只能精确到秒。


Is there anything else you'd like to ask or clarify?

<details>
<summary>英文:</summary>

I have a transactional table providing:
an alphanumeric PK, a timestamp, a user_id, a in/out string column, and an amount column.

id time user_id io amount
38hw 2019-10-18 18:35:09 2 in 1
nv49 2019-10-18 18:35:10 3 in 50
83ha 2019-10-18 18:35:11 5 in 2
ja03 2019-10-18 18:35:12 4 out 2
019c 2019-10-18 18:35:13 1 out 75
ac5r 2019-10-18 18:35:14 3 in 20
as30 2019-10-18 18:35:15 3 in 3
34ds 2019-10-18 18:35:16 4 in 7
12my 2019-10-18 18:35:17 2 in 50
dk20 2019-10-18 18:35:18 4 in 50
sk18 2019-10-18 18:35:19 1 in 7
am35 2019-10-18 18:35:20 2 in 3
mc92 2019-10-18 18:35:21 2 out 8
alov 2019-10-18 18:35:22 3 in 4
ap34 2019-10-18 18:35:23 1 out 6


I am trying to create another column that provides the running total for that user_id each time it shows up. These user_id&#39;s do not initially show up with a 0 amount, so that must be assumed the first time they show up.

I&#39;ve considered its likely possible to do this with some helper columns. My thought process is something like this:

 - Create a column to indicate how many times the user_id shows up prior to the value in `time` column. Maybe called `occurence_num` column
 - Create a column that makes amount easier to work with, like `(case when io=&#39;in&#39; then amount else -1*amount end) as balance_adjust`
 - Group by `user_id` and (on each row) `sum()` all the `balance_adjust` values where `occurence_num` is less than the current records.

I&#39;m having a hard time testing these ideas out though. I&#39;m working in a fairly large database, SQLite with 22 million rows. The table can be altered/updated as needed. It was stored this way in favor of keeping the ETL as simple as possible because there was a lot of data to pull and a lot of pages to pull from. My desired output would look something like this:

id time user_id io amount running_total
38hw 2019-10-18 18:35:09 2 in 1 1
nv49 2019-10-18 18:35:10 3 in 50 50
83ha 2019-10-18 18:35:11 5 in 2 2
ja03 2019-10-18 18:35:12 4 out 2 -2
019c 2019-10-18 18:35:13 1 out 75 -75
ac5r 2019-10-18 18:35:14 3 in 20 70
as30 2019-10-18 18:35:15 3 in 3 73
34ds 2019-10-18 18:35:16 4 in 7 5
12my 2019-10-18 18:35:17 2 in 50 51
dk20 2019-10-18 18:35:18 4 in 50 55
sk18 2019-10-18 18:35:19 1 in 7 -68
am35 2019-10-18 18:35:20 2 in 3 54
mc92 2019-10-18 18:35:21 2 out 8 46
alov 2019-10-18 18:35:22 3 in 4 77
ap34 2019-10-18 18:35:23 1 out 6 -74


I can get the overall total of each user this way, but it takes a few minutes:

SELECT
user_id,
sum(case when io='in' then amount else -1*amount end) as balance
FROM
transactions
GROUP BY
user_id

I think expanding on this, a `OVER`/`PARTITION` clause will be a good call, but I&#39;m not sure if it&#39;s the right call given the size of this database.

Thanks for the help.

Edit: I should mention, the real data may include duplicates in the time column. Transactions could have occurred at the same time, as it&#39;s only granular to the second.

</details>


# 答案1
**得分**: 1

以下是翻译的内容：

对你的尝试进行一个小调整就可以了。只需将你的求和转换为一个“*运行总和*”，使用相应的窗口函数，它将通过对用户进行分区并按时间排序来计算运行总额。

如果你有相同的时间，可以依赖于按id排序，这将打破相同时间的关系，并使求和正确运行。

```sql
SELECT *,
       SUM(CASE WHEN io = 'in' 
                THEN amount 
                ELSE -amount 
           END) OVER(PARTITION BY user_id ORDER BY time, id) as balance
FROM transactions 
ORDER BY time

输出：

id	time	user_id	io	amount	balance
38hw	2019-10-18 18:35:09	2	in	1	1
nv49	2019-10-18 18:35:10	3	in	50	50
83ha	2019-10-18 18:35:11	5	in	2	2
ja03	2019-10-18 18:35:12	4	out	2	-2
019c	2019-10-18 18:35:13	1	out	75	-75
ac5r	2019-10-18 18:35:14	3	in	20	70
as30	2019-10-18 18:35:15	3	in	3	73
34ds	2019-10-18 18:35:16	4	in	7	5
12my	2019-10-18 18:35:17	2	in	50	51
dk20	2019-10-18 18:35:18	4	in	50	55
sk18	2019-10-18 18:35:19	1	in	7	-68
am35	2019-10-18 18:35:20	2	in	3	54
mc92	2019-10-18 18:35:21	2	out	8	46
alov	2019-10-18 18:35:22	3	in	4	77
ap34	2019-10-18 18:35:23	1	out	6	-74

查看演示这里。

注意：最后的 ORDER BY 子句不是必需的，它仅用于可视化目的。

英文:

A small tweak to your attempt should do it. It's sufficient to turn your sum into a "running sum" using the corresponding window function, that will compute the running amount by partitioning on user and ordering on time.

And if you have tied times, you can rely on ordering by id, which will break the tie and make the sum work correctly.

SELECT *,
       SUM(CASE WHEN io = &#39;in&#39; 
                THEN amount 
                ELSE -amount 
           END) OVER(PARTITION BY user_id ORDER BY time, id) as balance
FROM transactions 
ORDER BY time

Output:

id	time	user_id	io	amount	balance
38hw	2019-10-18 18:35:09	2	in	1	1
nv49	2019-10-18 18:35:10	3	in	50	50
83ha	2019-10-18 18:35:11	5	in	2	2
ja03	2019-10-18 18:35:12	4	out	2	-2
019c	2019-10-18 18:35:13	1	out	75	-75
ac5r	2019-10-18 18:35:14	3	in	20	70
as30	2019-10-18 18:35:15	3	in	3	73
34ds	2019-10-18 18:35:16	4	in	7	5
12my	2019-10-18 18:35:17	2	in	50	51
dk20	2019-10-18 18:35:18	4	in	50	55
sk18	2019-10-18 18:35:19	1	in	7	-68
am35	2019-10-18 18:35:20	2	in	3	54
mc92	2019-10-18 18:35:21	2	out	8	46
alov	2019-10-18 18:35:22	3	in	4	77
ap34	2019-10-18 18:35:23	1	out	6	-74

Check the demo here.

Note: The last ORDER BY clause is not needed: it's just for visualization purposes.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

SQL：按日期排序的每个类别的交易表中的累计总额

问题

将字典样式的字符串转换为PostgreSQL中的表格

how to count the number of records with ExecuteScalar if more than one then cannot insert into in the MS Access database on vb.net

如何在实体关系图中引用这些表格？

无法在 SQL Server 中将 varchar 转换为 int。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论