如何创建一个正常运行的数据透视表而不使系统负荷过重

huangapple go评论40阅读模式
英文:

How to create a working pivot table without killing the system

问题

让我们假设你有一个客户表,这个表只包含4列:

  • customerCode 数值型(7,0)
  • customerName 字符型(50)
  • customerVATNumber 字符型(11)
  • customerLocation 字符型(35)

请记住,客户表包含300万行数据,因为包含了过去40年的所有客户数据,但活跃客户只有98,000人。

接下来,我们假设有一个名为Sales的表,结构如下:

  • saleID 整数
  • customerCode 数值型(7,0)
  • agentID 数值型(6,0)
  • productID 字符型(2)
  • dateBeginSale 日期
  • dateEndSale 日期

这个表中大约有350万行数据(同样包含了40年前的数据),但各种产品的当前供应总共有100万。公司只销售4种产品。每个客户最多可以购买4种产品,甚至可以来自4个不同的代理商。大多数客户(90%)只购买一种产品,剩下的客户从2到4种产品(完整的产品组合只有4位客户)。

我被要求构建一个数据透视表,显示每个客户的名称和位置,以及他购买的产品和购买代理商。

提议的数据透视表布局如下:

  • customerCode
  • customerName
  • customerLocation
  • productID1
  • agentID1
  • saleID1
  • dateBeginSale1
  • dateEndSale1
  • productID2
  • agentID2
  • saleID2
  • dateBeginSale2
  • dateEndSale2
  • productID3
  • agentID3
  • saleID3
  • dateBeginSale3
  • dateEndSale3
  • productID4
  • agentID4
  • saleID4
  • dateBeginSale4
  • dateEndSale4

我通过创建一个视图来构建这个数据透视表。

首先,我创建了4个视图,每个视图对应Sales表中的一个产品ID,这对于其他统计和报告目的也很有用。

View1:

  • customerCode1
  • productID1
  • agentID1
  • saleID1
  • dateBeginSale1
  • dateEndSale1

View2:

  • customerCode2
  • productID2
  • agentID2
  • saleID2
  • dateBeginSale2
  • dateEndSale2

以此类推,一直到View4。

然后,我将这4个视图与客户表进行连接,并创建了所需的PivotView。

现在,"Select * from PivotView" 可以正常工作。

"Select * from PivotView Where customerLocation='NEW YORK CITY'" 也可以正常工作。

然而,对于任何其他请求,比如选择和计算居住在洛杉矶的客户,这些客户是否从相同的代理商或不同的销售代理商那里购买产品,通常会导致内存占用增加(可能是由于构建某些临时表或视图),并经常导致查询执行崩溃。

但是,如果我在表而不是视图上创建相同的数据透视表,各种选择的时间都会减少,即使数据量很大(仍然需要扫描大约100万条记录来验证各种条件的存在),也能够接受。

肯定是我犯了一些错误,或者有更好的方法来实现这个结果:在在线数据而不是每夜提取的数据上构建数据透视表。

我很愿意阅读您的评论和建议。

英文:

Let's say you have a Customer table, a simple customer table with just 4 columns:

  • customerCode numeric(7,0)
  • customerName char(50)
  • customerVATNumber char(11)
  • customerLocation char(35)

Keep in mind that the customers table contains 3 million rows because there are all the customers of the last 40 years, but the active ones are only 980000.

Suppose we then have a table called Sales structured in this way:

  • saleID integer
  • customerCode numeric(7,0)
  • agentID numeric(6,0)
  • productID char(2)
  • dateBeginSale date
  • dateEndSale date

There are about three and a half million rows in this table (here too we have stuff from 40 years ago), but the current supplies for the various products are a total of one million. The company only sells 4 products. Each customer can purchase up to 4 products with 4 different contracts even from 4 different agents. Most (90%) buy only one, the remaining from two to 4 (those who make the complete assortment are just 4 cats).

I was asked to build a pivot table showing for each customer with it's name and location all the product he purchased and from which agent.

The proposed layout for this pivot table is:

  • customerCode
  • customerName
  • customerLocation
  • productID1
  • agentID1
  • saleID1
  • dateBeginSale1
  • dateEndSale1
  • productID2
  • agentID2
  • saleID2
  • dateBeginSale2
  • dateEndSale2
  • productID3
  • agentID3
  • saleID3
  • dateBeginSale3
  • dateEndSale3
  • productID4
  • agentID4
  • saleID4
  • dateBeginSale4
  • dateEndSale4

I built the pivot with a view.

First I created 4 views, one for each product id on the Sales table, also useful for other statistical and reporting purposes

View1 as

  • customerCode1
  • productID1
  • agentID1
  • saleID1
  • dateBeginSale1
  • dateEndSale1

View2 as

  • customerCode2
  • productID2
  • agentID2
  • saleID2
  • dateBeginSale2
  • dateEndSale2

and so on till View4

Then i joined the 4 views with the customer table and created the PivotView i needed.

Now Select * from PivotView works perfectly.

Also Select * from PivotView Where customerLocation='NEW YORK CITY' too.

Any other request, for example: we select and count the customers residing in LOS ANGELES who have purchased the products from the same agent or from different sales agents, literally makes the machine sit down, I see the memory occupation grow (probably due to the construction of some temporary table or view) and often the execution of the query crashes.

However, if I create the same pivot on a table instead of a view the times of the various selections collapse and even if heavy (there are always about a million records to scan to verify the existence of the various conditions) they become acceptable.

For sure i am mistaking something and/or there must to be a better way to achieve the result: having a pivot built on on line data istead of one from data extracted nightly.

I'll be happy to read your comments and suggestion.

答案1

得分: 1

我不太明白你的数据布局和需求。但我要说的是,在IBM i上使用Db2进行数据透视的常见问题是没有内置的动态透视数据的方法。

考虑到你只有4种产品,上述限制实际上并不适用。

你的问题似乎是通过在同一表上创建4个视图,你正在重复处理记录。相反,尽量只处理数据一次。

create view PivotSales as
select 
   customerCode, 
   -- 产品1
   max(case productID when '01' then productID end) as productID1,
   max(case productID when '01' then agentID end) as agentID1,
   max(case productID when '01' then saleID end) as saleID1,         
   max(case productID when '01' then dateBeginSale end) as dateBeginSale1,
   max(case productID when '01' then dateEndSale end) as dateEndSale1,
   -- 产品2
   max(case productID when '02' then productID end) as productID2,
   max(case productID when '02' then agentID end) as agentID2,
   max(case productID when '02' then saleID end) as saleID2,         
   max(case productID when '02' then dateBeginSale end) as dateBeginSale2,
   max(case productID when '02' then dateEndSale end) as dateEndSale2,
   -- 重复处理产品3和4
from Sales
group by customerCode;

现在你可以创建一个CustomerSales视图:

create view CustomerSales as
select *
from Customers join PivotSales using (customerCode);

运行你的查询,使用Visual Explain来查看系统建议需要的索引。至少,你应该有以下索引:

  • Customer (customerCode)
  • Customer (location, customerCode)
  • Sales (customerCode)

我怀疑在Sales和Customer的各种列上创建一些Encoded Vector Indexes (EVI)可能会有帮助,特别是因为你提到了“计数”。EVI可以跟踪符号的计数,所以计数是“免费”的。一个例子:

create encoded vector index customerLocEvi
 on Customers (location);
    
-- 这不需要读取任何行
select count(*) 
from customer 
where location = 'LOS ANGELES';

关于你是否犯了错误或是否有更好的方法来实现结果,不要太肯定。最好支持商业智能类型查询的数据库结构通常不符合典型的事务数据结构。周期性的“提取、转换、加载(ETL)”通常很典型。

对于你的特定用例,你可以将CustomerSales转换成Materialized Query Table (MQT),为它构建一些支持的索引,然后直接在上面运行查询。每夜的重建将简单地是REFRESH CustomerSales;

或者,如果你愿意,因为IBM i上的Db2不支持SYSTEM MAINTAINED MQTs,你可以在Sales上创建一个触发器,自动将数据传播到CustomerSales,而不是每夜重建它。

英文:

I don't clearly understand your data layout and what you need. But I'll say that the usual problem with pivoting data on Db2 for IBM i is that there's no built in way to dynamically pivot the data.

Given that you only have 4 products, the above limitation doesn't really apply.

Your problem would seem to be that by creating 4 views over the same table, you're processing records repeatedly. Instead, try to touch the data one time.

create view PivotSales as
  select 
     customerCode, 
     -- product 1
     max(case productID when '01' then productID end) as productID1,
     max(case productID when '01' then agentID end) as agentID1,
     max(case productID when '01' then saleID end) as saleID1,         
     max(case productID when '01' then dateBeginSale end) as dateBeginSale1,
     max(case productID when '01' then dateEndSale end) as dateEndSale1,
     -- product 2
     max(case productID when '02' then productID end) as productID2,
     max(case productID when '02' then agentID end) as agentID2,
     max(case productID when '02' then saleID end) as saleID2,         
     max(case productID when '02' then dateBeginSale end) as dateBeginSale2,
     max(case productID when '02' then dateEndSale end) as dateEndSale2,
     -- repeat for product 3 and 4
  from Sales
  group by customerCode;

Now you can have a CustomerSales view:

create view CustomerSales as
  select *
  from Customers join SalesPivot using (customerCode);

Run your queries, using Visual Explain to see what indexes the system suggests are needed. At minimum, you should have an indexes:

  • Customer (customerCode)
  • Customer (location, customerCode)
  • Sales (customerCode)

I suspect that some Encoded Vector Indexes (EVI) over various columns in Sales and Customer would prove helpful. Especially since you mention "counting". An EVI keeps track of the counts of the symbols. So counting is "free". An example:

create encoded vector index customerLocEvi
 on Customers (location);

-- this doesn't have to read any rows in customer
select count(*) 
from customer 
where location = 'LOS ANGELES';

> For sure I am mistaking something and/or there must to be a better way
> to achieve the result: having a pivot built on on line data istead of
> one from data extracted nightly.

Don't be too sure about that. The DB structure that best supports Business Intelligence type queries usually doesn't match the typical transactional data structure. A periodic "extract, transform, load (ETL)" is pretty typical.

For your particular use case, you could turn CustomerSales into a Materalized Query Table (MQT), build some supporting indexes for it and just run queries directly over it. Nightly rebuild would be as simple as REFRESH CustomerSales;

Or if you wanted too, since Db2 for IBM i doesn't support SYSTEM MAINTAINED MQTs, a trigger over Sales could automatically propagate data to CustomerSales instead of rebuilding it nightly.

huangapple
  • 本文由 发表于 2023年2月9日 00:14:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/75388642.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定