How to use Power Query to find a word in a PDF document and pull a table out

huangapple go评论97阅读模式
英文:

How to use Power Query to find a word in a PDF document and pull a table out

问题

我正在尝试将多个PDF表格导入到PowerQuery中,但这些表格出现在不同的页面上。在一个PDF中,表格可能出现在第5页,而在另一个PDF中,表格可能出现在第10页。在这种情况下,表格的表头是相同的。我该如何使用PowerQuery来查找这些表格的表头并将它们导入到Excel中呢?

谢谢您的帮助。

英文:

I am trying to import multiple PDF tables into PowerQuery but the table appear on different pages. In one pdf, the table can appear on Page 5 while the other the table can appear on page 10. In this case the table headers are the same. How can I use PowerQuery to find this table header and pull these tables into excel?

Thanks for the anticipated help.

答案1

得分: 1

这似乎适用于您的三个示例文件

let PullPDF = (variable)=> let
Source = Pdf.Tables(File.Contents(variable)),
List = List.Union(List.Transform(Source[Data], each Table.ColumnNames(_))), 
#"Expanded Data" = Table.ExpandTableColumn(Source, "Data", List,List)   ,
// 假设所有包含“Volume”一词的行上方的数据都是垃圾
Row=Table.RemoveFirstN(#"Expanded Data",List.PositionOf(#"Expanded Data"[Column2],"Volume")),
#"Removed Other Columns" = Table.SelectColumns(Row,{"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7"}),
#"Promoted Headers" = Table.PromoteHeaders(#"Removed Other Columns", [PromoteAllScalars=true]),
// 根据实际列名和格式更改或删除下一行
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"A3 Unit", type text}, {"Volume", Int64.Type}, {"∆V", Int64.Type}, {"Method", type text}, {"Volume_1", Int64.Type}, {"∆V_2", Int64.Type}, {"Delta#(lf)Volume", Int64.Type}})
in  #"Changed Type",


Source2 = Folder.Files("c:\temp\"),
#"Filtered Rows" = Table.SelectRows(Source2, each ([Extension] = ".pdf")),
#"Added Custom" = Table.AddColumn(#"Filtered Rows", "Data", each PullPDF([Folder Path]&[Name])),
List = List.Union(List.Transform(#"Added Custom"[Data], each Table.ColumnNames(_))), 
#"Expanded Data" = Table.ExpandTableColumn(#"Added Custom", "Data", List,List)   
in  #"Expanded Data"

我在另一个网站上查看了一下,因为我很无聊,但实际上,要求人们登录另一个网站来帮助您在这个网站上并不是最佳选择。

英文:

This seems to work on your three sample files

let PullPDF = (variable)=> let
Source = Pdf.Tables(File.Contents(variable)),
List = List.Union(List.Transform(Source[Data], each Table.ColumnNames(_))), 
#"Expanded Data" = Table.ExpandTableColumn(Source, "Data", List,List)   ,
// assume all data above row containing word Volume is garbage
Row=Table.RemoveFirstN(#"Expanded Data",List.PositionOf(#"Expanded Data"[Column2],"Volume")),
#"Removed Other Columns" = Table.SelectColumns(Row,{"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7"}),
#"Promoted Headers" = Table.PromoteHeaders(#"Removed Other Columns", [PromoteAllScalars=true]),
// change or remove next row as needed based on real column names and formats
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"A3 Unit", type text}, {"Volume", Int64.Type}, {"∆V", Int64.Type}, {"Method", type text}, {"Volume_1", Int64.Type}, {"∆V_2", Int64.Type}, {"Delta#(lf)Volume", Int64.Type}})
in  #"Changed Type",


Source2 = Folder.Files("c:\temp\"),
#"Filtered Rows" = Table.SelectRows(Source2, each ([Extension] = ".pdf")),
#"Added Custom" = Table.AddColumn(#"Filtered Rows", "Data", each PullPDF([Folder Path]&[Name])),
List = List.Union(List.Transform(#"Added Custom"[Data], each Table.ColumnNames(_))), 
#"Expanded Data" = Table.ExpandTableColumn(#"Added Custom", "Data", List,List)   
in  #"Expanded Data"

I looked on the other site since I was bored, but really, requiring people to log into another site to help you on this one is not optimal

huangapple
  • 本文由 发表于 2023年8月9日 11:57:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76864474.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定