MS Power Query 使用索引 +/- 1 占用了太多时间和内存。

huangapple go评论137阅读模式
英文:

MS Power Query usage of Index +/- 1 costly too much Time & Memory

问题

我正在尝试使用Power Query更改我的Excel表格格式,以便在Power BI上更轻松使用,但我的查询太费时和占用空间(70KB文件需要几GB...),这是不可接受的,我无法找出原因和如何修复它。

我有一个Excel文件,记录了执行程序/项目的需求。
前两列是'程序'和'技能',而后续列是月份:
因此,每行显示了在那个月份需要多少“人员”来执行该程序的特定技能;
Excel文件如下所示:

我的目标是从这个格式中获取一个表格,其中包含以下列:程序,技能,标签,开始,停止。

其中'开始'和'停止'是特定资源使用的开始和结束日期,'标签'用于将相同的程序-技能资源拆分为不同的行。例如,从起始表格中,Program_1的前两个技能(Skill_A和Skill_B)将如下所示:

我用3个查询完成了这个操作,第一个仅用于打开Excel文件,
第二个只是以不同的格式清除所有程序需求,
最后一个是我请求支持的查询(命名为Gantt),其中可能使用[Index]-1[Index]+1生成了这种不需要的行为。

File_pointer

let
    Source = Excel.Workbook(File.Contents("C:\Users\xxx\Desktop\Forecast_Resources.xlsx"), null, true),
    Foglio1_Sheet = Source{[Item="Foglio1",Kind="Sheet"]}[Data],
    #"Removed Other Columns" = Table.SelectColumns(Foglio1_Sheet,{"Column1", "Column2", ... }),  // (将"..."替换为实际的列名称)
    #"Promoted Headers" = Table.PromoteHeaders(#"Removed Other Columns", [PromoteAllScalars=true])
in
    #"Promoted Headers"

Programs

let
    Source = File_pointer,
    #"Removed Columns" = Table.RemoveColumns(Source,{"Usage"}),
    #"Unpivoted Columns" = Table.UnpivotOtherColumns(#"Removed Columns", {"Program", "Skill"}, "Attribute", "Value"),
    #"Changed Type" = Table.TransformColumnTypes(#"Unpivoted Columns",{{"Program", type text}, {"Skill", type text}, {"Attribute", type text}, {"Value", Int64.Type}}),
    #"Inserted Year" = Table.AddColumn(#"Changed Type", "Year", each Date.Year(Date.From([Attribute])), Int64.Type)
in
    #"Inserted Year"

Gantt

let
    Source = Programs,
    addColumnList = List.Accumulate({1..13}, Source, (state, current) => Table.AddColumn(state, "v" & Number.ToText(current), each [Value] - current)),
    #"Unpivoted Columns" = Table.UnpivotOtherColumns(addColumnList, {"Program", "Skill", "Attribute", "Value", "Year"}, "Attribute.1", "Value.1"),
    #"Filtered Rows1" = Table.SelectRows(#"Unpivoted Columns", each ([Value.1] >= 0)),
    ...
    (以下部分未完全包括,由于长度限制,无法完全显示)
in
    #"Renamed Columns_2"

所以,有没有更加优雅的方法来实现我的目标?

英文:

I'm trying to change the format of my excel table with power query, for a easy usage on Power BI, but my query takes too much time and space (GygaBytes for a file of 70KB...) which is unaccetable, and I cannot figure it the reason and how can I fix it.

I have an excel file which records needs for execute programs/projects.
First two columns are 'Program' and 'Skill', while next columns are the months:
so each rows shows how many "person" are necessary of that skill for that program on that month;
excel file looks like this:
MS Power Query 使用索引 +/- 1 占用了太多时间和内存。

My Goal is to obtain from this format a table whith following columns: Program, Skill, Label, Start, Stop.

Where 'Start' and 'Stop' are the dates of the begininng and the end of the usage of that specific resource, and 'Label' is used to split same program-skill resources in different rows. For example from starting table, first two skill (Skill_A and Skill_B) of Program_1 will be:

MS Power Query 使用索引 +/- 1 占用了太多时间和内存。

I did it with 3 queries, first one just is usefull for opening the excel file,
secondone just, clean all programs needs in a different format,
last one is the query which I request support (named Gantt) where probably the usage of [Index]-1 and [Index]+1 generates this unwanted behaviour.

File_pointer

let
    Source = Excel.Workbook(File.Contents("C:\Users\xxx\Desktop\Forecast_Resources.xlsx"), null, true),
    Foglio1_Sheet = Source{[Item="Foglio1",Kind="Sheet"]}[Data],
    #"Removed Other Columns" = Table.SelectColumns(Foglio1_Sheet,{"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29", "Column30", "Column31", "Column32", "Column33", "Column34", "Column35", "Column36", "Column37", "Column38", "Column39", "Column40", "Column41", "Column42", "Column43", "Column44", "Column45", "Column46", "Column47", "Column48", "Column49", "Column50", "Column51", "Column52", "Column53", "Column54", "Column55", "Column56", "Column57", "Column58", "Column59", "Column60", "Column61", "Column62", "Column63", "Column64", "Column65", "Column66", "Column67", "Column68", "Column69"}),
    #"Promoted Headers" = Table.PromoteHeaders(#"Removed Other Columns", [PromoteAllScalars=true])
in
    #"Promoted Headers"

Programs

let
    Source = File_pointer,
    #"Removed Columns" = Table.RemoveColumns(Source,{"Usage"}),
    #"Unpivoted Columns" = Table.UnpivotOtherColumns(#"Removed Columns", {"Program", "Skill"}, "Attribute", "Value"),
    #"Changed Type" = Table.TransformColumnTypes(#"Unpivoted Columns",{{"Program", type text}, {"Skill", type text}, {"Attribute", type text}, {"Value", Int64.Type}}),
    #"Inserted Year" = Table.AddColumn(#"Changed Type", "Year", each Date.Year(Date.From([Attribute])), Int64.Type)
in
    #"Inserted Year"

Gantt

let
    Source = Programs,
    addColumnList = List.Accumulate({1..13}, Source, (state, current) => Table.AddColumn(state, "v" & Number.ToText(current), each [Value] - current)),
    #"Unpivoted Columns" = Table.UnpivotOtherColumns(addColumnList, {"Program", "Skill", "Attribute", "Value", "Year"}, "Attribute.1", "Value.1"),
    #"Filtered Rows1" = Table.SelectRows(#"Unpivoted Columns", each ([Value.1] >= 0)),
    #"Sorted Rows1" = Table.Sort(#"Filtered Rows1",{{"Attribute", Order.Ascending}}),
	
    #"Replaced Value" = Table.ReplaceValue(#"Sorted Rows1","v","",Replacer.ReplaceText,{"Attribute.1"}),    
    #"Inserted Merged Column" = Table.AddColumn(#"Replaced Value", "Merged", each Text.Combine({[Program], "_", [Attribute.1]}), type text),
    #"Removed Columns" = Table.RemoveColumns(#"Inserted Merged Column",{"Attribute.1", "Value.1"}),
    #"Inserted Start of Month" = Table.AddColumn(#"Removed Columns", "Start of Month", each Date.StartOfMonth(Date.From([Attribute])), type date),
    #"Inserted End of Month" = Table.AddColumn(#"Inserted Start of Month", "End of Month", each Date.EndOfMonth(Date.From([Attribute])), type date),
    #"Sorted Rows" = Table.Sort(#"Inserted End of Month",{{"Program", Order.Ascending}, {"Skill", Order.Ascending}, {"Merged", Order.Ascending}, {"Attribute", Order.Ascending}}),
    #"Removed Columns1" = Table.RemoveColumns(#"Sorted Rows",{"Attribute", "Value", "Year"}),
	
    Label = Table.AddColumn(#"Removed Columns1", "Label", each Text.Combine({[Skill], "__", [Merged]}), type text),
    #"EOM_plus" = Table.AddColumn(Label, "EOM_plus", each Date.AddDays([End of Month],1) , type date),
    #"Added Index" = Table.AddIndexColumn(#"EOM_plus", "Index", 0, 1, Int64.Type),
    #"BEGINNING" = Table.AddColumn(#"Added Index", "Beginning", each [Start of Month]<>#"Added Index"[EOM_plus]{[Index]-1} or [Label]<>#"Added Index"[Label]{[Index]-1}, type logical),
    #"ENDING" = Table.AddColumn(#"BEGINNING", "Ending", each [EOM_plus]<>#"Added Index"[Start of Month]{[Index]+1} or [Label]<>#"Added Index"[Label]{[Index]+1}, type logical),
    #"Replaced Errors" = Table.ReplaceErrorValues(ENDING, {{"Beginning", true},{"Ending",true}}),
    #"toKeep" = Table.AddColumn(#"Replaced Errors", "toKeep", each [Beginning] or [Ending] , type logical),
    #"Filtered Rows" = Table.SelectRows(toKeep, each ([toKeep] = true)),
    #"Removed Columns2" = Table.RemoveColumns(#"Filtered Rows",{"Index", "toKeep"}),
    #"Added Index 2" = Table.AddIndexColumn(#"Removed Columns2", "Index", 0, 1, Int64.Type),
    #"Added Custom 1" = Table.AddColumn(#"Added Index 2", "Stop", each if not[Ending] then #"Filtered Rows"[End of Month]{[Index]+1} else [End of Month]),
    #"Filtered Rows 2" = Table.SelectRows(#"Added Custom 1", each ([Ending] = false)),
    #"Removed Columns3" = Table.RemoveColumns(#"Filtered Rows 2",{"Index", "Beginning", "Ending", "EOM_plus", "End of Month", "Label"}),
    #"Renamed Columns_1" = Table.RenameColumns(#"Removed Columns3",{{"Start of Month", "Start"}}),
	#"Renamed Columns_2" = Table.RenameColumns(#"Renamed Columns_1",{{"Merged", "Label"}})


in
    #"Renamed Columns_2"

So, is there any more elegant way to obtain my target?

答案1

得分: 1

以下是您提供的代码的翻译部分:

注意没有人想重新输入您的示例数据在提问时您需要提供示例数据

以下对程序和技能进行分组并找到这些组的第一个和最后一个日期您的其余解释让人感到困惑

let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"未旋转其他列" = Table.UnpivotOtherColumns(Source, {"Program", "Skill"}, "Attribute", "Value"),
#"分组的行" = Table.Group(#"未旋转其他列", {"Program", "Skill"}, {
{"开始", each List.First(Table.Sort(_,{{"Attribute", Order.Ascending}})[Attribute])},
{"停止", each List.Last(Table.Sort(_,{{"Attribute", Order.Ascending}})[Attribute])}
})
in #"分组的行"


let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"分组的行" = Table.Group(Source, {"Program", "Skill"}, {{"数据", each 
let b=Table.UnpivotOtherColumns(_, {"Program", "Skill"}, "Attribute", "Value"),
c = Table.Sort(b,{{"Attribute", Order.Ascending}}),
d= #table({"开始","停止"}, {{List.First(c[Attribute]),List.Last(c[Attribute])}})
in d, type table }}),
#"扩展的数据" = Table.ExpandTableColumn(#"分组的行", "数据", {"开始", "停止"}, {"开始", "停止"})
in #"扩展的数据"

希望这对您有所帮助。

英文:

Note: No one wants to retype your sample data. You need to provide that when asking a question

Below groups on Program and Skill and finds the first and last dates for those groups. The rest of your explanation is confusing

let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(Source, {"Program", "Skill"}, "Attribute", "Value"),
#"Grouped Rows" = Table.Group(#"Unpivoted Other Columns", {"Program", "Skill"}, {
    {"Start", each List.First(Table.Sort(_,{{"Attribute", Order.Ascending}})[Attribute])},
    {"Stop", each List.Last(Table.Sort(_,{{"Attribute", Order.Ascending}})[Attribute])}
    })
in  #"Grouped Rows"

or

let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Grouped Rows" = Table.Group(Source, {"Program", "Skill"}, {{"data", each 
    let b=Table.UnpivotOtherColumns(_, {"Program", "Skill"}, "Attribute", "Value"),
    c = Table.Sort(b,{{"Attribute", Order.Ascending}}),
    d= #table({"Start","Stop"}, {{List.First(c[Attribute]),List.Last(c[Attribute])}})
in d, type table }}),
#"Expanded data" = Table.ExpandTableColumn(#"Grouped Rows", "data", {"Start", "Stop"}, {"Start", "Stop"})
in  #"Expanded data"

MS Power Query 使用索引 +/- 1 占用了太多时间和内存。

Program Skill 2023-07 2023-08 2023-09 2023-10 2023-11 2023-12
P1 S1 1 1 2 2 2 2
P1 S2 3 3 2 2
P1 S3 1 1 1
P1 S4 2 2 2 2 3 3
P1 S5 1 1
P2 S1 1 1 1 1 1
P2 S3 2 2 2
P3 S3 1 1 1 1
P3 S4 1 1
P4 S2 1 1 1

答案2

得分: 1

I think this will do what you want, but I don't know how fast it will be.

Caveats:

  • 我创建了一个Label列,但它与你在结果中显示的标签不匹配。
  • 我没有将你的Skills从数据表中的内容翻译成结果表中的内容。

如果你能提供明确的翻译规则,可能是可行的,但可能不值得。

英文:

I think this will do what you want, but I don't know how fast it will be.

Caveats:

  • I created a Label column but it doesn't match with the labels you show in your results.
  • I did not translate your Skills from what you have in your data table to what you have in your results table.

If you can supply clear cut rules for the translations, it is probably doable, but might not be worthwhile.

let

//Change next line to reflect actual data source
    Source = Excel.CurrentWorkbook(){[Name="Table19"]}[Content],
     colTypes = List.Zip({Table.ColumnNames(Source), {type text, type text} & List.Repeat({Int64.Type}, Table.ColumnCount(Source)-2)}),
    #"Changed Type" = Table.TransformColumnTypes(Source,colTypes),

//Group by Program and Skill
    #"Grouped Rows" = Table.Group(#"Changed Type", {"Program", "Skill"}, {
        {"Start/Stop", (t)=>
            let 

        //Transpose to create two column table for each subgroup
                #"Demote & Transpose" = Table.Transpose(Table.DemoteHeaders(Table.RemoveColumns(t,{"Program","Skill"}))),

        //Make dates out of Column1
                #"Make Date" = Table.TransformColumns(#"Demote & Transpose", {"Column1", each Date.From(_)}),

        //Add columns for each Program/Skill until we run out of people
                x = List.Accumulate(
                        {1..List.Max(#"Make Date"[Column2])-1},
                        #"Make Date", (state, current)=>
                        Table.AddColumn(state,"x" & Text.From(current), 
                            each if ([Column2]??0) > current then [Column2] - current else null)),

        //Stack the table into a single two column table
                colNames = List.RemoveFirstN(Table.ColumnNames(x),2),
                #"2 Columns" = List.Accumulate(colNames,
                    Table.SelectColumns(x,{"Column1","Column2"}), (state, current)=>
                    Table.Combine({state,#table(null,{{null}}),
                        Table.RenameColumns(Table.SelectColumns(x,{"Column1", current}),{current,"Column2"}) })),

        //Transform Column 2 into nulls or `1` for grouping
                #"Transform" = Table.TransformColumns(#"2 Columns", {"Column2", each if not (_ = null) then 1 else null}),

        //Get start and stop from each subgroups earliest and latest dates
        //  Changing the stop date to the end of the month
                #"Group" = Table.Group(#"Transform",{"Column2"}, {
                    {"Start", each List.Min([Column1]), type date},
                    {"Stop", each Date.EndOfMonth(List.Max([Column1])), type date}
                        }, GroupKind.Local),

        //Remove the null groups
                #"Remove Nulls" = Table.SelectRows(#"Group", each [Column2] <> null),

        //Add an Index column to be able to create labels
                #"Add Index" =Table.TransformColumns(Table.AddIndexColumn(#"Remove Nulls","Index",1,1),{"Index", each Text.From(_)})
            in 
                #"Add Index"}}),

    #"Expanded Start/Stop" = Table.ExpandTableColumn(#"Grouped Rows", "Start/Stop", {"Start", "Stop", "Index"}),
    #"Add Label Column" = Table.AddColumn(#"Expanded Start/Stop", "Label", each Text.Combine({[Program],[Index]},"_"), type text),

//Cleanup
    #"Reordered Columns" = Table.ReorderColumns(#"Add Label Column",{"Program", "Skill", "Label", "Start", "Stop", "Index"}),
    #"Removed Columns" = Table.RemoveColumns(#"Reordered Columns",{"Index"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Removed Columns",{{"Start", type date}, {"Stop", type date}})
in
    #"Changed Type1"

Data

Program Skill 2023-07 2023-08 2023-09 2023-10 2023-11 2023-12 2024-01 2024-02 2024-03 2024-04 2024-05 2024-06 2024-07 2024-08 2024-09 2024-10 2024-11 2024-12
Program_1 Skill_1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1
Program_1 Skill_2 3 3 2 2 2 2 2 2 1 1 1 1 1 1
Program_1 Skill_3 1 1 1 1 1 1 1
Program_1 Skill_4 2 2 2 2 3 3 3 3 3 4 4 4 2 2 2 2 2 2
Program_1 Skill_5 1 1 1 1 1 1 1 1 1 1
Program_2 Skill_1 1 1 1 1 1 1 1 1 1 1
Program_2 Skill_3 2 2 2 2 2 1 1 1
Program_3 Skill_3 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
Program_3 Skill_4 1 1 1 1 2 2 2 1 1 1 1
Program_4 Skill_2 1 1 1 1 1 1 1 1 1 1 1

Results
MS Power Query 使用索引 +/- 1 占用了太多时间和内存。

答案3

得分: 0

问题可能是您在每一行中引用了整个表格 #"Added Index"。因此,如果您有 1,000 行,您会将数据从 #"Added Index" 步骤重新加载 1,000 次。而且我至少数到您已经使用了这种方法至少 5 次。

不要在每一行中引用表格,而是使用 Table.Buffer。这会在内存中保留该步骤的缓冲副本,因此您可以引用该缓冲副本而不是重新加载数据。

所以,不是这样:

#"Added Index" = Table.AddIndexColumn(#"EOM_plus", "Index", 0, 1, Int64.Type),
#"BEGINNING" = Table.AddColumn(#"Added Index", "Beginning", each [Start of Month]<>#"Added Index"[EOM_plus]{[Index]-1} or [Label]<>#"Added Index"[Label]{[Index]-1}, type logical),
#"ENDING" = Table.AddColumn(#"BEGINNING", "Ending", each [EOM_plus]<>#"Added Index"[Start of Month]{[Index]+1} or [Label]<>#"Added Index"[Label]{[Index]+1}, type logical),

可以这样做:

#"Added Index" = Table.AddIndexColumn(#"EOM_plus", "Index", 0, 1, Int64.Type),
Buffer = Table.Buffer(#"Added Index"),
#"BEGINNING" = Table.AddColumn(#"Added Index", "Beginning", each [Start of Month]<>Buffer[EOM_plus]{[Index]-1} or [Label]<>Buffer[Label]{[Index]-1}, type logical),
#"ENDING" = Table.AddColumn(#"BEGINNING", "Ending", each [EOM_plus]<>Buffer[Start of Month]{[Index]+1} or [Label]<>Buffer[Label]{[Index]+1}, type logical),

而不是这样:

#"Filtered Rows" = Table.SelectRows(toKeep, each ([toKeep] = true)),
#"Removed Columns2" = Table.RemoveColumns(#"Filtered Rows",{"Index", "toKeep"}),
#"Added Index 2" = Table.AddIndexColumn(#"Removed Columns2", "Index", 0, 1, Int64.Type),
#"Added Custom 1" = Table.AddColumn(#"Added Index 2", "Stop", each if not[Ending] then #"Filtered Rows"[End of Month]{[Index]+1} else [End of Month]),

可以这样做:

#"Filtered Rows" = Table.SelectRows(toKeep, each ([toKeep] = true)),
#"Filtered Rows Buffer" = Table.Buffer(#"Filtered Rows"),
#"Removed Columns2" = Table.RemoveColumns(#"Filtered Rows",{"Index", "toKeep"}),
#"Added Index1" = Table.AddIndexColumn(#"Removed Columns2", "Index", 0, 1, Int64.Type),
#"Added Custom 1" = Table.AddColumn(#"Added Index1", "Stop", each if not[Ending] then #"Filtered Rows Buffer"[End of Month]{[Index]+1} else [End of Month]),

这可能会加速操作,但是 @horseyride 提供了更适合您情况的更好方法,可以避免这个问题。

英文:

The problem could be that you are referencing the entire table #"Added Index" for each row in your table. So if you have 1,000 rows, you are re-loading the data from the #"Added Index" step 1,000 times. And I count at least 5 times you've used this approach.

Instead of referencing the table for each row, use Table.Buffer. This keeps a buffered copy of that step in memory, so you can refer to that buffered copy instead of reloading the data.

So instead of

#"Added Index" = Table.AddIndexColumn(#"EOM_plus", "Index", 0, 1, Int64.Type),
#"BEGINNING" = Table.AddColumn(#"Added Index", "Beginning", each [Start of Month]<>#"Added Index"[EOM_plus]{[Index]-1} or [Label]<>#"Added Index"[Label]{[Index]-1}, type logical),
#"ENDING" = Table.AddColumn(#"BEGINNING", "Ending", each [EOM_plus]<>#"Added Index"[Start of Month]{[Index]+1} or [Label]<>#"Added Index"[Label]{[Index]+1}, type logical),

you can do

#"Added Index" = Table.AddIndexColumn(#"EOM_plus", "Index", 0, 1, Int64.Type),
Buffer = Table.Buffer(#"Added Index"),
#"BEGINNING" = Table.AddColumn(#"Added Index", "Beginning", each [Start of Month]<>Buffer[EOM_plus]{[Index]-1} or [Label]<>Buffer[Label]{[Index]-1}, type logical),
#"ENDING" = Table.AddColumn(#"BEGINNING", "Ending", each [EOM_plus]<>Buffer[Start of Month]{[Index]+1} or [Label]<>Buffer[Label]{[Index]+1}, type logical),

and instead of

#"Filtered Rows" = Table.SelectRows(toKeep, each ([toKeep] = true)),
#"Removed Columns2" = Table.RemoveColumns(#"Filtered Rows",{"Index", "toKeep"}),
#"Added Index 2" = Table.AddIndexColumn(#"Removed Columns2", "Index", 0, 1, Int64.Type),
#"Added Custom 1" = Table.AddColumn(#"Added Index 2", "Stop", each if not[Ending] then #"Filtered Rows"[End of Month]{[Index]+1} else [End of Month]),

you can do

#"Filtered Rows" = Table.SelectRows(toKeep, each ([toKeep] = true)),
#"Filtered Rows Buffer" = Table.Buffer(#"Filtered Rows"),
#"Removed Columns2" = Table.RemoveColumns(#"Filtered Rows",{"Index", "toKeep"}),
#"Added Index1" = Table.AddIndexColumn(#"Removed Columns2", "Index", 0, 1, Int64.Type),
#"Added Custom 1" = Table.AddColumn(#"Added Index1", "Stop", each if not[Ending] then #"Filtered Rows Buffer"[End of Month]{[Index]+1} else [End of Month]),

This would probably speed things up, but @horseyride has given a better approach for your scenario that avoids the problem altogether.

huangapple
  • 本文由 发表于 2023年8月4日 23:36:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76837367.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定