使用每个表的WHERE子句连接多个表。

huangapple go评论105阅读模式
英文:

Joining multiple tables with WHERE clause for each table

问题

  1. 选择 *
  2. A
  3. 加入 B A.idx = B.idx
  4. 加入 C A.idx = C.idx
  5. 其中 A.create_dt A.last_dt '2023-05-01' '2023-05-31' 之间
  6. B.create_dt B.last_dt '2023-05-01' '2023-05-31' 之间
  7. C.create_dt C.last_dt '2023-05-01' '2023-05-31' 之间;

这是一个 PostgreSQL 数据库。

在多表联接时,我只想选择那些每个表的 create_dtlast_dt 是最新的数据。然而,存在速度太慢的问题。在这种情况下如何解决?

当我只保留 A 表的 WHERE 条件时,性能是正常的。

英文:

Ex.

  1. SELECT *
  2. FROM A
  3. JOIN B ON A.idx = B.idx
  4. JOIN C ON A.idx = C.idx
  5. WHERE A.create_dt BETWEEN '2023-05-01' AND '2023-05-31'
  6. OR A.last_dt BETWEEN '2023-05-01' AND '2023-05-31'
  7. OR B.create_dt BETWEEN '2023-05-01' AND '2023-05-31'
  8. OR B.last_dt BETWEEN '2023-05-01' AND '2023-05-31'
  9. OR C.create_dt BETWEEN '2023-05-01' AND '2023-05-31'
  10. OR C.last_dt BETWEEN '2023-05-01' AND '2023-05-31';

It is a PostgreSQL DB

While multi-joining the tables, I want to SELECT only the data where the create_dt or last_dt of each table is the latest.
However, there is a problem that the speed is too slow. How to solve in this case?

Performance is fine when I leave only the A table WHERE condition.

答案1

得分: 2

我只会翻译代码部分,以下是翻译好的代码部分:

  1. 我要 `SELECT` 只有每个表的 `create_dt` `last_dt` 是最新的数据。
  2. 然后你肯定不能在 `idx` 上连接,因为每个表的最新数据不会共享相同的 `idx` 值。
  3. 你的查询不是你说的那样。这个是:
  4. ~~~pgsql
  5. SELECT *
  6. FROM (
  7. (
  8. SELECT GREATEST(create_dt, last_dt) AS latest_a, *
  9. FROM a
  10. WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
  11. ORDER BY create_dt DESC
  12. LIMIT 1
  13. )
  14. UNION ALL
  15. (
  16. SELECT GREATEST(create_dt, last_dt) AS latest_a, *
  17. FROM a
  18. WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
  19. ORDER BY last_dt DESC
  20. LIMIT 1
  21. )
  22. ORDER BY latest_a
  23. LIMIT 1
  24. ) a
  25. CROSS JOIN (
  26. (
  27. SELECT GREATEST(create_dt, last_dt) AS latest_b, *
  28. FROM b
  29. WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
  30. ORDER BY create_dt DESC
  31. LIMIT 1
  32. )
  33. UNION ALL
  34. (
  35. SELECT GREATEST(create_dt, last_dt) AS latest_b, *
  36. FROM b
  37. WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
  38. ORDER BY last_dt DESC
  39. LIMIT 1
  40. )
  41. ORDER BY latest_b
  42. LIMIT 1
  43. ) b
  44. CROSS JOIN (
  45. (
  46. SELECT GREATEST(create_dt, last_dt) AS latest_c, *
  47. FROM a
  48. WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
  49. ORDER BY create_dt DESC
  50. LIMIT 1
  51. )
  52. UNION ALL
  53. (
  54. SELECT GREATEST(create_dt, last_dt) AS latest_c, *
  55. FROM a
  56. WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
  57. ORDER BY last_dt DESC
  58. LIMIT 1
  59. )
  60. ORDER BY latest_c
  61. LIMIT 1
  62. ) c -- USING (idx);
  63. ~~~
  64. 所有括号都是必需的。
  65. 有点啰嗦,但这是速度最快的方式 - 前提是你有这些索引:
  66. ~~~pgsql
  67. CREATE INDEX a_create_dt_idx ON A (create_dt);
  68. CREATE INDEX a_last_dt_idx ON A (last_dt);
  69. CREATE INDEX b_create_dt_idx ON B (create_dt);
  70. CREATE INDEX b_last_dt_idx ON B (last_dt);
  71. CREATE INDEX c_create_dt_idx ON C (create_dt);
  72. CREATE INDEX c_last_dt_idx ON C (last_dt);
  73. ~~~
  74. 每个表格将进行两次索引查找,直接选择符合条件的一行。
  75. 我使用了无条件的 `CROSS JOIN` 进行连接,因为每个子查询只返回一行,前提是至少有一个符合条件的。
  76. 如果其中一个子查询找不到行,结果将为空。也许你真的希望使用 `FULL OUTER JOIN` 来保留其他表的结果,如果有一个为空。或者只返回3行结果。
  77. 再次说明,如果你没有确切地说明你真正需要什么,我不会感到惊讶。我的猜测是你想要类似这样的东西(每个表的每个 `idx` 的最新行):
  78. ```sql
  79. (
  80. SELECT DISTINCT ON (idx) 'a_latest_create_dt' AS what, *
  81. FROM a
  82. WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
  83. ORDER BY idx DESC, create_dt DESC, last_dt DESC
  84. )
  85. UNION ALL
  86. (
  87. SELECT DISTINCT ON (idx) 'a_latest_last_dt' AS what, *
  88. FROM a
  89. WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
  90. ORDER BY idx DESC, last_dt DESC, create_dt DESC
  91. )
  92. UNION ALL
  93. (
  94. SELECT DISTINCT ON (idx) 'b_latest_create_dt' AS what, *
  95. FROM b
  96. WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
  97. ORDER BY idx DESC, create_dt DESC, last_dt DESC
  98. )
  99. UNION ALL
  100. (
  101. SELECT DISTINCT ON (idx) 'b_latest_last_dt' AS what, *
  102. FROM b
  103. WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
  104. ORDER BY idx DESC, last_dt DESC, create_dt DESC
  105. )
  106. UNION ALL
  107. (
  108. SELECT DISTINCT ON (idx) 'c_latest_create_dt' AS what, *
  109. FROM c
  110. WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
  111. ORDER BY idx DESC, create_dt DESC, last_dt DESC
  112. )
  113. UNION ALL
  114. (
  115. SELECT DISTINCT ON (idx) 'c_latest_last_dt' AS what, *
  116. FROM c
  117. WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
  118. ORDER BY idx DESC, last_dt DESC, create_dt DESC
  119. )
  120. ORDER BY idx, what;
  121. ~~~
  122. 这次我列出了每个表的每个 `idx` 的最新行:一个是 `create_dt`,一个是 `last_dt`。如果在每个类别中都有一个符合时间范围的行,就会有6行。
  123. 关于 `DISTINCT ON`
  124. - https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564
  125. 假设所有涉及的列都是 `NOT NULL`。否则你可能需要做更多的工作...
  126. 匹配的索引:
  127. ~~~pgsql
  128. CREATE INDEX a_create_dt_idx ON A (idx, create_dt);
  129. CREATE INDEX a_last_dt_idx ON A (idx, last_dt);
  130. CREATE INDEX b_create_dt_idx ON B (idx, create_dt);
  131. CREATE INDEX b_last_dt_idx ON B (idx
  132. <details>
  133. <summary>英文:</summary>
  134. &gt; I want to `SELECT` only the data where the `create_dt` or `last_dt` of each table is the latest.
  135. Then you certainly cannot join on `idx` since the latest per table won&#39;t share the same `idx` value.
  136. Your query does not do what you say. This one does:
  137. ~~~pgsql
  138. SELECT *
  139. FROM (
  140. (
  141. SELECT GREATEST(create_dt, last_dt) AS latest_a, *
  142. FROM a
  143. WHERE create_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
  144. ORDER BY create_dt DESC
  145. LIMIT 1
  146. )
  147. UNION ALL
  148. (
  149. SELECT GREATEST(create_dt, last_dt) AS latest_a, *
  150. FROM a
  151. WHERE last_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
  152. ORDER BY last_dt DESC
  153. LIMIT 1
  154. )
  155. ORDER BY latest_a
  156. LIMIT 1
  157. ) a
  158. CROSS JOIN (
  159. (
  160. SELECT GREATEST(create_dt, last_dt) AS latest_b, *
  161. FROM b
  162. WHERE create_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
  163. ORDER BY create_dt DESC
  164. LIMIT 1
  165. )
  166. UNION ALL
  167. (
  168. SELECT GREATEST(create_dt, last_dt) AS latest_b, *
  169. FROM b
  170. WHERE last_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
  171. ORDER BY last_dt DESC
  172. LIMIT 1
  173. )
  174. ORDER BY latest_b
  175. LIMIT 1
  176. ) b
  177. CROSS JOIN (
  178. (
  179. SELECT GREATEST(create_dt, last_dt) AS latest_c, *
  180. FROM a
  181. WHERE create_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
  182. ORDER BY create_dt DESC
  183. LIMIT 1
  184. )
  185. UNION ALL
  186. (
  187. SELECT GREATEST(create_dt, last_dt) AS latest_c, *
  188. FROM a
  189. WHERE last_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
  190. ORDER BY last_dt DESC
  191. LIMIT 1
  192. )
  193. ORDER BY latest_c
  194. LIMIT 1
  195. ) c -- USING (idx);
  196. ~~~
  197. All parentheses required.
  198. A bit verbose. But it&#39;s as fast as this gets - provided you have these indexes:
  199. ~~~pgsql
  200. CREATE INDEX a_create_dt_idx ON A (create_dt);
  201. CREATE INDEX a_last_dt_idx ON A (last_dt);
  202. CREATE INDEX b_create_dt_idx ON B (create_dt);
  203. CREATE INDEX b_last_dt_idx ON B (last_dt);
  204. CREATE INDEX c_create_dt_idx ON C (create_dt);
  205. CREATE INDEX c_last_dt_idx ON C (last_dt);
  206. ~~~
  207. It will be two index seeks per table, directly picking the one qualifying row every time.
  208. I am joining with an unconditional `CROSS JOIN`, since each subquery returns exactly *one* row, provided at least one qualifies.
  209. If one of the subqueries finds no row, the result is empty. Maybe you really want a `FULL OUTER JOIN` to preserve results from the other tables if one comes up empty. Or just 3 result rows.
  210. Then again, I wouldn&#39;t be surprised if you didn&#39;t exactly say what you really need. My educated guess: you want something like this (the latest row **per idx** from each table):
  211. ~~~pgsql
  212. (
  213. SELECT DISTINCT ON (idx) &#39;a_latest_create_dt&#39; AS what, *
  214. FROM a
  215. WHERE create_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
  216. ORDER BY idx DESC, create_dt DESC, last_dt DESC
  217. )
  218. UNION ALL
  219. (
  220. SELECT DISTINCT ON (idx) &#39;a_latest_last_dt&#39; AS what, *
  221. FROM a
  222. WHERE last_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
  223. ORDER BY idx DESC, last_dt DESC, create_dt DESC
  224. )
  225. UNION ALL
  226. (
  227. SELECT DISTINCT ON (idx) &#39;b_latest_create_dt&#39; AS what, *
  228. FROM b
  229. WHERE create_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
  230. ORDER BY idx DESC, create_dt DESC, last_dt DESC
  231. )
  232. UNION ALL
  233. (
  234. SELECT DISTINCT ON (idx) &#39;b_latest_last_dt&#39; AS what, *
  235. FROM b
  236. WHERE last_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
  237. ORDER BY idx DESC, last_dt DESC, create_dt DESC
  238. )
  239. UNION ALL
  240. (
  241. SELECT DISTINCT ON (idx) &#39;c_latest_create_dt&#39; AS what, *
  242. FROM c
  243. WHERE create_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
  244. ORDER BY idx DESC, create_dt DESC, last_dt DESC
  245. )
  246. UNION ALL
  247. (
  248. SELECT DISTINCT ON (idx) &#39;c_latest_last_dt&#39; AS what, *
  249. FROM c
  250. WHERE last_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
  251. ORDER BY idx DESC, last_dt DESC, create_dt DESC
  252. )
  253. ORDER BY idx, what;
  254. ~~~
  255. This time I list the latest rows per idx for each table: one for `create_dt`, one for `last_dt`. Makes 6 rows if there is one in the time frame for every flavor.
  256. About `DISTINCT ON`:
  257. - https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564
  258. Assuming all involved columns are `NOT NULL`. Else you may have to do more ...
  259. Matching indexes:
  260. ~~~pgsql
  261. CREATE INDEX a_create_dt_idx ON A (idx, create_dt);
  262. CREATE INDEX a_last_dt_idx ON A (idx, last_dt);
  263. CREATE INDEX b_create_dt_idx ON B (idx, create_dt);
  264. CREATE INDEX b_last_dt_idx ON B (idx, last_dt);
  265. CREATE INDEX c_create_dt_idx ON C (idx, create_dt);
  266. CREATE INDEX c_last_dt_idx ON C (idx, last_dt);
  267. ~~~
  268. </details>
  269. # 答案2
  270. **得分**: 1
  271. 尝试向三个表中添加以下索引:
  272. ```sql
  273. CREATE INDEX idx_a ON A (idx, create_dt, last_dt);
  274. CREATE INDEX idx_b ON B (idx, create_dt, last_dt);
  275. CREATE INDEX idx_c ON C (idx, create_dt, last_dt);

如果使用这些索引,应该会加速查询中的连接操作。

英文:

You might try adding the following indices to the three tables:

<!-- language: sql -->

  1. CREATE INDEX idx_a ON A (idx, create_dt, last_dt);
  2. CREATE INDEX idx_b ON B (idx, create_dt, last_dt);
  3. CREATE INDEX idx_c ON C (idx, create_dt, last_dt);

These indices, if used, should speed up the joins in your query.

huangapple
  • 本文由 发表于 2023年6月2日 10:01:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76386704.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定