导入不同格式的PDF中的各种表格到Excel中

huangapple go评论90阅读模式
英文:

Import Various Tables with Different Formats from PDF into Excel

问题

I am trying to import multiple PDF tables into Excel using Power Query but the tables appear on different pages with 3 different formats of the table. Sometimes some PDFs in the directory do not have the tables at all.

我正在尝试使用Power Query将多个PDF表格导入Excel,但这些表格出现在不同的页面上,有三种不同的表格格式。有时候,目录中的某些PDF文件根本没有这些表格。

I plan to import from multiple PDF tables with various formats and the values I need to extract are shown in the attached image in red.

我计划从多个具有不同格式的PDF表格中导入数据,我需要提取的数值在附加的红色图像中显示。

The results that I need are shown in the excel file format. The results in blue are from the first table format, the results in yellow are from the second table format, and the results in green are from the third table format. I just color coded it for the sake of explaining my problem.

我需要的结果以Excel文件格式显示。蓝色的结果来自第一种表格格式,黄色的结果来自第二种表格格式,绿色的结果来自第三种表格格式。我只是为了解释我的问题而对其进行了颜色编码。

Overall, I plan to read through a folder that has about 700 pdf files with these tables, but these tables do not appear on specific pages, and some of the PDFs may have multiple other tables that are not relevant.
我计划查看一个包含约700个PDF文件的文件夹,其中包含这些表格,但这些表格不会出现在特定页面上,而且其中一些PDF文件可能包含多个与问题无关的其他表格。

I want the Query to be able to skip all those irrelevant tables and pick the table formats that I specified.

我希望查询能够跳过所有这些与问题无关的表格,并选择我指定的表格格式。

Thank you in advance.

提前感谢您。

英文:

I am trying to import multiple PDF tables into Excel using Power Query but the tables appear on different pages with 3 different formats of the table. Sometimes some PDF's in the directory do not have the tables at all

I plan to import from multiple PDF tables with various formats and the values I need to extract are shown in the attached image in red.

The results that I need are shown in the excel file format. The results in blue are from the first table format, the results in yellow are from the second table format and the results in green are from the third table format. I just color coded it for the sake of explaining the my problem

Overall, I plan to read through a folder that has about 700 pdf files with these tables but these tables do not appear on specific pages and some of the PDFs may have multiple other tables that are not relevant.
I want the Query to be able to skip all those irrelevant tables and pick the table formats that I specified.

Thank you in advance.

I tried using Power Query but I am still a newbie in the application

答案1

得分: 1

根据在其他地方发布的示例数据文件,如果实际数据格式与该文件相同,那么这个方法可以帮助您合并指定目录中的所有PDF文件。它实际上不会跳过文件中的任何表格,因为您的示例数据没有包含任何这样的数据,供测试,但如果需要,可以通过在“Added Custom”步骤之前添加一些筛选器来删除不需要的表格。

let 

group= (filename)=>
//打开每个文件并提取所有表格
let Source = Pdf.Tables(File.Contents(filename), [Implementation="1.3"]),
List = List.Union(List.Transform(Source[Data], each Table.ColumnNames(_))), 
#"Expanded Data" = Table.ExpandTableColumn(Source, "Data", List,List),
#"Filtered Rows" = Table.SelectRows(#"Expanded Data", each Text.StartsWith([Name], "Table")),
#"Grouped Rows" = Table.Group(#"Filtered Rows", {"Name"}, {{"data", each _, type table}}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each split([data]))
in #"Added Custom",

//处理每个表格
split= (variable)=>  let
#"Filtered Rows1" = Table.SelectRows(variable,  each ([Column1] <> "" and [Column1]<>null and [Column1]<>"Final" and [Column2] <> "" and [Column2]<>null)),
#"Promoted Headers" = Table.PromoteHeaders(#"Filtered Rows1", [PromoteAllScalars=true]),
#"Removed Columns" = Table.RemoveColumns(#"Promoted Headers",List.FirstN(Table.ColumnNames(#"Promoted Headers"),3)), 
Names=List.RemoveNulls(List.Transform(Table.ColumnNames(#"Removed Columns"), each  if Text.Contains(_,"Volume") or Text.Contains(_,"Unit") then _ else null)),
#"Removed Other Columns" = Table.SelectColumns(#"Removed Columns",Names),
FirstName=List.First(Table.ColumnNames(#"Removed Other Columns")),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Removed Other Columns",{FirstName} , "Attribute", "Value"),
#"Renamed Columns" = Table.RenameColumns(#"Unpivoted Other Columns",{{FirstName, "Category"}}),
Transform = Table.TransformColumns(#"Renamed Columns",{{"Attribute",each Text.BeforeDelimiter(_,"_"), type text}}),
//日期
RepeatCount=Table.RowCount(#"Removed Columns"),
Dates = Record.ToList(variable{1}),
b = List.RemoveNulls(List.Transform(Dates, each try if Text.Contains (_,"(") and not Text.Contains (_,"Table")  then Text.BeforeDelimiter(_, "(")  else null otherwise null)),
a={List.Repeat(b,RepeatCount + 2)},
//距离
#"Promoted Headers5" = Table.PromoteHeaders(variable, [PromoteAllScalars=true]),
Distance = List.Repeat(List.RemoveNulls(List.Transform(Table.ColumnNames(#"Promoted Headers5"), each try if Text.Contains(_,"Distance") then Number.From(Text.Select(_,{"0".."9"})) else null otherwise null)),List.Count(Names)-1),DistanceTable= Table.FromList(Distance, Splitter.SplitByNothing(), {"Value"}, null, ExtraValues.Error),
#"Added Custom3" = Table.AddColumn(DistanceTable, "Category", each "Distance"),
//总数
#"Filtered Rows5" = Table.SelectRows(variable, each ([Column1] = "Final")){0},
r=Record.ToList(#"Filtered Rows5"),
s=List.PositionOf(r,"Final"),
rr=List.Transform(r, each if _="" then null else _),
t=List.RemoveNulls(List.RemoveFirstN(rr,s+1)),
TotalTable= Table.FromList(t, Splitter.SplitByNothing(), {"Value"}, null, ExtraValues.Error),
#"Added Custom4" = Table.AddColumn(TotalTable, "Category", each "Final"),
//组合所有内容
h = Table.FromColumns(Table.ToColumns(Transform&#"Added Custom3"&#"Added Custom4"&a),
#"Removed Columns1" = Table.RemoveColumns(h,{"Column2"}),
#"Changed Type" = Table.TransformColumnTypes(#"Removed Columns1",{{"Column1", type text}, {"Column3", type number}, {"Column4", type date}}),
#"Pivoted Column" = Table.Pivot(#"Changed Type", List.Distinct(#"Changed Type"[Column1]), "Column1", "Column3", List.Sum)
in #"Pivoted Column",

Source2 = Folder.Files("c:\temp\"),
#"Filtered Rows" = Table.SelectRows(Source2, each ([Extension] = ".pdf")),
#"Added Custom" = Table.AddColumn(#"Filtered Rows", "Data", each group([Folder Path]&[Name])),
#"Renamed Columns" = Table.RenameColumns(#"Added Custom",{{"Name", "zName"}}),
#"Expanded Data" = Table.ExpandTableColumn(#"Renamed Columns", "Data",{"Name","Custom"},{"Name","Custom"}),
List = List.Union(List.Transform(#"Expanded Data"[Custom], each Table.ColumnNames(_))), 
#"Expanded Data2" = Table.ExpandTableColumn(#"Expanded Data", "Custom", List,List),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded Data2",{{"Column4", type date}})  
in  #"Changed Type";

导入不同格式的PDF中的各种表格到Excel中

编辑以添加错误移除

Source2 = Folder.Files("c:\temp\"),
#"Filtered Rows" = Table.SelectRows(Source2, each ([Extension] = ".pdf")),
#"Added Custom" = Table.AddColumn(#"Filtered Rows", "Data", each group([Folder Path]&[Name])),
#"Renamed Columns" = Table.RenameColumns(#"Added Custom",{{"Name", "zName"}}),
#"Expanded Data" = Table.ExpandTableColumn(#"Renamed Columns", "Data",{"Name","Custom"},{"Name","Custom"}),
#"Removed Errors" = Table.RemoveRowsWithErrors(#"Expanded Data", {"Custom"}),
List = List.Union(List.Transform(#"Removed Errors"[Custom], each Table.ColumnNames(_))), 
#"Expanded Data2" = Table.ExpandTableColumn(#"Removed Errors", "Custom", List,List),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded Data2",{{"Column4", type date}})  
in  #"Changed Type";

编辑最终说明

[Column1]<> "Final" 更改为 not Text.Contains([Column1], "Final")

each ([Column1] = "Final") 更改为 each (Text.Contains([Column1],"Final"))

s=List.PositionOf(r,"Final") 更改为 `s=if List.PositionOf(r,"

英文:

Based on sample data file posted elsewhere, if actual data format was identical to that, this would work for you to combine all of those PDF files from a specified directory. It does not actually skip any tables in a file since your sample data did not include any data like that to test out, but should be modifiable if necessary by putting some filters before the #"Added Custom" step to remove the non-desirable tables

let 

group= (filename)=&gt;
//open each file and pull out all the tables
let Source = Pdf.Tables(File.Contents(filename), [Implementation=&quot;1.3&quot;]),
List = List.Union(List.Transform(Source[Data], each Table.ColumnNames(_))), 
#&quot;Expanded Data&quot; = Table.ExpandTableColumn(Source, &quot;Data&quot;, List,List),
#&quot;Filtered Rows&quot; = Table.SelectRows(#&quot;Expanded Data&quot;, each Text.StartsWith([Name], &quot;Table&quot;)),
#&quot;Grouped Rows&quot; = Table.Group(#&quot;Filtered Rows&quot;, {&quot;Name&quot;}, {{&quot;data&quot;, each _, type table}}),
#&quot;Added Custom&quot; = Table.AddColumn(#&quot;Grouped Rows&quot;, &quot;Custom&quot;, each split([data]))
in #&quot;Added Custom&quot;,

//process each table
split= (variable)=&gt;  let
#&quot;Filtered Rows1&quot; = Table.SelectRows(variable,  each ([Column1] &lt;&gt; &quot;&quot; and [Column1]&lt;&gt;null and [Column1]&lt;&gt;&quot;Final&quot; and [Column2] &lt;&gt; &quot;&quot; and [Column2]&lt;&gt;null)),
#&quot;Promoted Headers&quot; = Table.PromoteHeaders(#&quot;Filtered Rows1&quot;, [PromoteAllScalars=true]),
#&quot;Removed Columns&quot; = Table.RemoveColumns(#&quot;Promoted Headers&quot;,List.FirstN(Table.ColumnNames(#&quot;Promoted Headers&quot;),3)), 
Names=List.RemoveNulls(List.Transform(Table.ColumnNames(#&quot;Removed Columns&quot;), each  if Text.Contains(_,&quot;Volume&quot;) or Text.Contains(_,&quot;Unit&quot;) then _ else null)),
#&quot;Removed Other Columns&quot; = Table.SelectColumns(#&quot;Removed Columns&quot;,Names),
FirstName=List.First(Table.ColumnNames(#&quot;Removed Other Columns&quot;)),
#&quot;Unpivoted Other Columns&quot; = Table.UnpivotOtherColumns(#&quot;Removed Other Columns&quot;,{FirstName} , &quot;Attribute&quot;, &quot;Value&quot;),
#&quot;Renamed Columns&quot; = Table.RenameColumns(#&quot;Unpivoted Other Columns&quot;,{{FirstName, &quot;Category&quot;}}),
Transform = Table.TransformColumns(#&quot;Renamed Columns&quot;,{{&quot;Attribute&quot;,each Text.BeforeDelimiter(_,&quot;_&quot;), type text}}),
//Dates
RepeatCount=Table.RowCount(#&quot;Removed Columns&quot;),
Dates = Record.ToList(variable{1}),
b = List.RemoveNulls(List.Transform(Dates, each try if Text.Contains (_,&quot;(&quot;) and not Text.Contains (_,&quot;Table&quot;)  then Text.BeforeDelimiter(_, &quot;(&quot;)  else null otherwise null)),
a={List.Repeat(b,RepeatCount + 2)},
//Distance
#&quot;Promoted Headers5&quot; = Table.PromoteHeaders(variable, [PromoteAllScalars=true]),
Distance = List.Repeat(List.RemoveNulls(List.Transform(Table.ColumnNames(#&quot;Promoted Headers5&quot;), each try if Text.Contains(_,&quot;Distance&quot;) then Number.From(Text.Select(_,{&quot;0&quot;..&quot;9&quot;})) else null otherwise null)),List.Count(Names)-1),DistanceTable= Table.FromList(Distance, Splitter.SplitByNothing(), {&quot;Value&quot;}, null, ExtraValues.Error),
#&quot;Added Custom3&quot; = Table.AddColumn(DistanceTable, &quot;Category&quot;, each &quot;Distance&quot;),
//Total
#&quot;Filtered Rows5&quot; = Table.SelectRows(variable, each ([Column1] = &quot;Final&quot;)){0},
r=Record.ToList(#&quot;Filtered Rows5&quot;),
s=List.PositionOf(r,&quot;Final&quot;),
rr=List.Transform(r, each if _=&quot;&quot; then null else _),
t=List.RemoveNulls(List.RemoveFirstN(rr,s+1)),
TotalTable= Table.FromList(t, Splitter.SplitByNothing(), {&quot;Value&quot;}, null, ExtraValues.Error),
#&quot;Added Custom4&quot; = Table.AddColumn(TotalTable, &quot;Category&quot;, each &quot;Final&quot;),
//combine it all
h = Table.FromColumns(Table.ToColumns(Transform&amp;#&quot;Added Custom3&quot;&amp; #&quot;Added Custom4&quot;)&amp;a),
#&quot;Removed Columns1&quot; = Table.RemoveColumns(h,{&quot;Column2&quot;}),
#&quot;Changed Type&quot; = Table.TransformColumnTypes(#&quot;Removed Columns1&quot;,{{&quot;Column1&quot;, type text}, {&quot;Column3&quot;, type number}, {&quot;Column4&quot;, type date}}),
#&quot;Pivoted Column&quot; = Table.Pivot(#&quot;Changed Type&quot;, List.Distinct(#&quot;Changed Type&quot;[Column1]), &quot;Column1&quot;, &quot;Column3&quot;, List.Sum)
in #&quot;Pivoted Column&quot;,

Source2 = Folder.Files(&quot;c:\temp\&quot;),
#&quot;Filtered Rows&quot; = Table.SelectRows(Source2, each ([Extension] = &quot;.pdf&quot;)),
#&quot;Added Custom&quot; = Table.AddColumn(#&quot;Filtered Rows&quot;, &quot;Data&quot;, each group([Folder Path]&amp;[Name])),
#&quot;Renamed Columns&quot; = Table.RenameColumns(#&quot;Added Custom&quot;,{{&quot;Name&quot;, &quot;zName&quot;}}),
#&quot;Expanded Data&quot; = Table.ExpandTableColumn(#&quot;Renamed Columns&quot;, &quot;Data&quot;,{&quot;Name&quot;,&quot;Custom&quot;},{&quot;Name&quot;,&quot;Custom&quot;}),
List = List.Union(List.Transform(#&quot;Expanded Data&quot;[Custom], each Table.ColumnNames(_))), 
#&quot;Expanded Data2&quot; = Table.ExpandTableColumn(#&quot;Expanded Data&quot;, &quot;Custom&quot;, List,List),
#&quot;Changed Type&quot; = Table.TransformColumnTypes(#&quot;Expanded Data2&quot;,{{&quot;Column4&quot;, type date}})  
in  #&quot;Changed Type&quot;

导入不同格式的PDF中的各种表格到Excel中

EDIT to add error removal

Source2 = Folder.Files(&quot;c:\temp\&quot;),
#&quot;Filtered Rows&quot; = Table.SelectRows(Source2, each ([Extension] = &quot;.pdf&quot;)),
#&quot;Added Custom&quot; = Table.AddColumn(#&quot;Filtered Rows&quot;, &quot;Data&quot;, each group([Folder Path]&amp;[Name])),
#&quot;Renamed Columns&quot; = Table.RenameColumns(#&quot;Added Custom&quot;,{{&quot;Name&quot;, &quot;zName&quot;}}),
#&quot;Expanded Data&quot; = Table.ExpandTableColumn(#&quot;Renamed Columns&quot;, &quot;Data&quot;,{&quot;Name&quot;,&quot;Custom&quot;},{&quot;Name&quot;,&quot;Custom&quot;}),
#&quot;Removed Errors&quot; = Table.RemoveRowsWithErrors(#&quot;Expanded Data&quot;, {&quot;Custom&quot;}),
List = List.Union(List.Transform(#&quot;Removed Errors&quot;[Custom], each Table.ColumnNames(_))), 
#&quot;Expanded Data2&quot; = Table.ExpandTableColumn(#&quot;Removed Errors&quot;, &quot;Custom&quot;, List,List),
#&quot;Changed Type&quot; = Table.TransformColumnTypes(#&quot;Expanded Data2&quot;,{{&quot;Column4&quot;, type date}})  
in  #&quot;Changed Type&quot;

EDIT final notes

change [Column1]&lt;&gt;&quot;Final&quot; to not Text.Contains([Column1], &quot;Final&quot;)


change  each ([Column1] = &quot;Final&quot;) to each (Text.Contains([Column1],&quot;Final&quot;))


change s=List.PositionOf(r,&quot;Final&quot;) to s=if List.PositionOf(r,&quot;Final&quot;) = -1 then List.PositionOf(r,&quot;Pre-Flow Final&quot;) else  List.PositionOf(r,&quot;Final&quot;)

huangapple
  • 本文由 发表于 2023年8月11日 02:17:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76878368.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定