如何使用AWS Textract在Java中检索存在于PDF中的表格

huangapple go评论92阅读模式
英文:

How to retrieve tables which exists in a pdf using AWS Textract in java

问题

我在下面的文章中找到了用Python进行操作的方法。

https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html

此外,我还使用了下面的文章来提取文本。

https://docs.aws.amazon.com/textract/latest/dg/detecting-document-text.html

但是上述文章只帮助我获取了文本,我还使用了"block.getBlockType()"函数
来获取块的类型,但是没有一个块的类型返回为"CELL",即使图像/PDF中有表格。

请帮我找一个类似于"boto3"的Java库,用于提取所有表格。

英文:

I found article below to do in python.

https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html

also I used article below to extract text.

https://docs.aws.amazon.com/textract/latest/dg/detecting-document-text.html

but above article helped to get only text, I also used function "block.getBlockType()"
of Block but none of block returned its type as "CELL" even tables are there in image/pdf.

Help me found java library similar to "boto3" to extract all tables.

答案1

得分: 5

我所做的是,我在JSON响应中为每个数据集创建了模型,可以使用这些模型在JSF中构建表格视图。

public static List<TableModel> getTablesFromTextract(TextractModel textractModel) {
    List<TableModel> tables = null;

    try {
        if (textractModel != null) {
            tables = new ArrayList<>();
            List<BlockModel> tableBlocks = new ArrayList<>();
            Map<String, BlockModel> blockMap = new HashMap<>();

            for (BlockModel block : textractModel.getBlocks()) {
                if (block.getBlockType().equals("TABLE")) {
                    tableBlocks.add(block);
                }
                blockMap.put(block.getId(), block);
            }

            for (BlockModel blockModel : tableBlocks) {
                Map<Long, Map<Long, String>> rowMap = new HashMap<>();

                for (RelationshipModel relationship : blockModel.getRelationships()) {
                    if (relationship.getType().equals("CHILD")) {
                        for (String id : relationship.getIds()) {
                            BlockModel cell = blockMap.get(id);

                            if (cell.getBlockType().equals("CELL")) {
                                long rowIndex = cell.getRowIndex();
                                long columnIndex = cell.getColumnIndex();

                                if (!rowMap.containsKey(rowIndex)) {
                                    rowMap.put(rowIndex, new HashMap<>());
                                }

                                Map<Long, String> columnMap = rowMap.get(rowIndex);
                                columnMap.put(columnIndex, getCellText(cell, blockMap));
                            }
                        }
                    }
                }
                tables.add(new TableModel(blockModel, rowMap));
            }
            System.out.println("row Map " + tables.toString());
        }
    } catch (Exception e) {
        LOG.error("Could not get table from textract model", e);
    }
    return tables;
}

private static String getCellText(BlockModel cell, Map<String, BlockModel> blockMap) {
    String text = "";

    try {
        if (cell != null && CollectionUtils.isNotEmpty(cell.getRelationships())) {
            for (RelationshipModel relationship : cell.getRelationships()) {
                if (relationship.getType().equals("CHILD")) {
                    for (String id : relationship.getIds()) {
                        BlockModel word = blockMap.get(id);
                        if (word.getBlockType().equals("WORD")) {
                            text += word.getText() + " ";
                        } else if (word.getBlockType().equals("SELECTION_ELEMENT")) {
                            if (word.getSelectionStatus().equals("SELECTED")) {
                                text += "X ";
                            }
                        }
                    }
                }
            }
        }
    } catch (Exception e) {
        LOG.error("Could not get cell text of table", e);
    }
    return text;
}

要创建视图的TableModel:

public class TableModel {
    private BlockModel table;
    private Map<Long, Map<Long, String>> rowMap;

    public TableModel(BlockModel table, Map<Long, Map<Long, String>> rowMap) {
        this.table = table;
        this.rowMap = rowMap;
    }

    public BlockModel getTable() {
        return table;
    }

    public void setTable(BlockModel table) {
        this.table = table;
    }

    public Map<Long, Map<Long, String>> getRowMap() {
        return rowMap;
    }

    public void setRowMap(Map<Long, Map<Long, String>> rowMap) {
        this.rowMap = rowMap;
    }

    @Override
    public String toString() {
        return table.getId() + " - " + rowMap.toString();
    }
}
英文:

What I did, I created models of each dataset in the json response and can use this models to build a table view in jsf.

public static List&lt;TableModel&gt; getTablesFromTextract(TextractModel textractModel) {
List&lt;TableModel&gt; tables = null;
try {
if (textractModel != null) {
tables = new ArrayList&lt;&gt;();
List&lt;BlockModel&gt; tableBlocks = new ArrayList&lt;&gt;();
Map&lt;String, BlockModel&gt; blockMap = new HashMap&lt;&gt;();
for (BlockModel block : textractModel.getBlocks()) {
if (block.getBlockType().equals(&quot;TABLE&quot;)) {
tableBlocks.add(block);
}
blockMap.put(block.getId(), block);
}
for (BlockModel blockModel : tableBlocks) {
Map&lt;Long, Map&lt;Long, String&gt;&gt; rowMap = new HashMap&lt;&gt;();
for (RelationshipModel relationship : blockModel.getRelationships()) {
if (relationship.getType().equals(&quot;CHILD&quot;)) {
for (String id : relationship.getIds()) {
BlockModel cell = blockMap.get(id);
if (cell.getBlockType().equals(&quot;CELL&quot;)) {
long rowIndex = cell.getRowIndex();
long columnIndex = cell.getColumnIndex();
if (!rowMap.containsKey(rowIndex)) {
rowMap.put(rowIndex, new HashMap&lt;&gt;());
}
Map&lt;Long, String&gt; columnMap = rowMap.get(rowIndex);
columnMap.put(columnIndex, getCellText(cell, blockMap));
}
}
}
}
tables.add(new TableModel(blockModel, rowMap));
}
System.out.println(&quot;row Map &quot; + tables.toString());
}
} catch (Exception e) {
LOG.error(&quot;Could not get table from textract model&quot;, e);
}
return tables;
}
private static String getCellText(BlockModel cell, Map&lt;String, BlockModel&gt; blockMap) {
String text = &quot;&quot;;
try {
if (cell != null
&amp;&amp; CollectionUtils.isNotEmpty(cell.getRelationships())) {
for (RelationshipModel relationship : cell.getRelationships()) {
if (relationship.getType().equals(&quot;CHILD&quot;)) {
for (String id : relationship.getIds()) {
BlockModel word = blockMap.get(id);
if (word.getBlockType().equals(&quot;WORD&quot;)) {
text += word.getText() + &quot; &quot;;
} else if (word.getBlockType().equals(&quot;SELECTION_ELEMENT&quot;)) {
if (word.getSelectionStatus().equals(&quot;SELECTED&quot;)) {
text += &quot;X &quot;;
}
}
}
}
}
}
} catch (Exception e) {
LOG.error(&quot;Could not get cell text of table&quot;, e);
}
return text;
}

TableModel to create the view from:

public class TableModel {
private BlockModel table;
private Map&lt;Long, Map&lt;Long, String&gt;&gt; rowMap;
public TableModel(BlockModel table, Map&lt;Long, Map&lt;Long, String&gt;&gt; rowMap) {
this.table = table;
this.rowMap = rowMap;
}
public BlockModel getTable() {
return table;
}
public void setTable(BlockModel table) {
this.table = table;
}
public Map&lt;Long, Map&lt;Long, String&gt;&gt; getRowMap() {
return rowMap;
}
public void setRowMap(Map&lt;Long, Map&lt;Long, String&gt;&gt; rowMap) {
this.rowMap = rowMap;
}
@Override
public String toString() {
return table.getId() + &quot; - &quot; + rowMap.toString();
}

答案2

得分: 0

以下是翻译好的代码部分:

public class AnalyzeDocument {

	public DocumentModel startProcess(byte[] content) {

		Region region = Region.EU_WEST_2;
		TextractClient textractClient = TextractClient.builder().region(region)
				.credentialsProvider(EnvironmentVariableCredentialsProvider.create()).build();

		return analyzeDoc(textractClient, content);
	}

	public DocumentModel analyzeDoc(TextractClient textractClient, byte[] content) {

		try {
			SdkBytes sourceBytes = SdkBytes.fromByteArray(content);
			Util util = new Util();
			Document myDoc = Document.builder().bytes(sourceBytes).build();

			List<FeatureType> featureTypes = new ArrayList<FeatureType>();
			featureTypes.add(FeatureType.FORMS);
			featureTypes.add(FeatureType.TABLES);

			AnalyzeDocumentRequest analyzeDocumentRequest = AnalyzeDocumentRequest.builder().featureTypes(featureTypes)
					.document(myDoc).build();

			AnalyzeDocumentResponse analyzeDocument = textractClient.analyzeDocument(analyzeDocumentRequest);
			List<Block> docInfo = analyzeDocument.blocks();
//			util.displayBlockInfo(docInfo);
			PageModel pageModel = util.getTableResults(docInfo);
			DocumentModel documentModel = new DocumentModel();
			documentModel.getPages().add(pageModel);

			Iterator<Block> blockIterator = docInfo.iterator();

			while (blockIterator.hasNext()) {
				Block block = blockIterator.next();
				log.debug("The block type is " + block.blockType().toString());
			}
			return documentModel;
		} catch (TextractException e) {

			System.err.println(e.getMessage());
		}
		return null;
	}

}


public PageModel getTableResults(List<Block> blocks) {
	List<Block> tableBlocks = new ArrayList<>();
	Map<String, Block> blockMap = new HashMap<>();
	for (Block block : blocks) {
		blockMap.put(block.id(), block);
		if (block.blockType().equals(BlockType.TABLE)) {
			tableBlocks.add(block);
			log.debug("added table: " + block.text());
		}
	}
	PageModel page = new PageModel();

	if (tableBlocks.size() == 0) {
		return null;
	}
	int i = 0;
	for (Block table : tableBlocks) {
		page.getTables().add(generateTable(table, blockMap, i++));
	}
	return page;
}

private TableModel generateTable(Block table, Map<String, Block> blockMap, int index) {
	TableModel model = new TableModel();
	Map<Integer, Map<Integer, String>> rows = getRowsColumnsMap(table, blockMap);
	model.setTableId("Table_" + index);
	for (Map.Entry<Integer, Map<Integer, String>> entry : rows.entrySet()) {
		RowModel rowModel = new RowModel();
		Map<Integer, String> value = entry.getValue();
		for (int i = 0; i < value.size(); i++) {
			rowModel.getCells().add(value.get(i));
		}
		model.getRows().add(rowModel);
	}

	return model;
}

private Map<Integer, Map<Integer, String>> getRowsColumnsMap(Block block, Map<String, Block> blockMap) {

	Map<Integer, Map<Integer, String>> rows = new HashMap<>();

	for (Relationship relationship : block.relationships()) {
		if (relationship.type().equals(RelationshipType.CHILD)) {
			for (String childId : relationship.ids()) {
				Block cell = blockMap.get(childId);
				if (cell != null) {
					int rowIndex = cell.rowIndex();
					int colIndex = cell.columnIndex();
					if (rows.get(rowIndex) == null) {
						Map<Integer, String> row = new HashMap<>();
						rows.put(rowIndex, row);
					}
					rows.get(rowIndex).put(colIndex, getText(cell, blockMap));
				}
			}
		}
	}
	return rows;
}

public String getText(Block block, Map<String, Block> blockMap) {
	String text = "";
	if (block.relationships() != null && block.relationships().size() > 0) {
		for (Relationship relationship : block.relationships()) {
			if (relationship.type().equals(RelationshipType.CHILD)) {
				for (String childId : relationship.ids()) {
					Block wordBlock = blockMap.get(childId);
					if (wordBlock != null && wordBlock.blockType() != null) {
						if (wordBlock.blockType().equals(BlockType.WORD))) {
							text += wordBlock.text() + " ";
						}
					}
				}
			}
		}
	}

	return text;
}

注意:我已经根据你的要求将代码进行了翻译,只保留了代码部分,并删除了不属于代码的其他内容。如果你有任何进一步的问题或需要补充,请随时提问。

英文:

I have something similar:

public class AnalyzeDocument {

	public DocumentModel startProcess(byte[] content) {

		Region region = Region.EU_WEST_2;
		TextractClient textractClient = TextractClient.builder().region(region)
				.credentialsProvider(EnvironmentVariableCredentialsProvider.create()).build();

		return analyzeDoc(textractClient, content);
	}

	public DocumentModel analyzeDoc(TextractClient textractClient, byte[] content) {

		try {
			SdkBytes sourceBytes = SdkBytes.fromByteArray(content);
			Util util = new Util();
			Document myDoc = Document.builder().bytes(sourceBytes).build();

			List&lt;FeatureType&gt; featureTypes = new ArrayList&lt;FeatureType&gt;();
			featureTypes.add(FeatureType.FORMS);
			featureTypes.add(FeatureType.TABLES);

			AnalyzeDocumentRequest analyzeDocumentRequest = AnalyzeDocumentRequest.builder().featureTypes(featureTypes)
					.document(myDoc).build();

			AnalyzeDocumentResponse analyzeDocument = textractClient.analyzeDocument(analyzeDocumentRequest);
			List&lt;Block&gt; docInfo = analyzeDocument.blocks();
//			util.displayBlockInfo(docInfo);
			PageModel pageModel = util.getTableResults(docInfo);
			DocumentModel documentModel = new DocumentModel();
			documentModel.getPages().add(pageModel);

			Iterator&lt;Block&gt; blockIterator = docInfo.iterator();

			while (blockIterator.hasNext()) {
				Block block = blockIterator.next();
				log.debug(&quot;The block type is &quot; + block.blockType().toString());
			}
			return documentModel;
		} catch (TextractException e) {

			System.err.println(e.getMessage());
		}
		return null;
	}

and this is the util file:


	public PageModel getTableResults(List&lt;Block&gt; blocks) {
		List&lt;Block&gt; tableBlocks = new ArrayList&lt;&gt;();
		Map&lt;String, Block&gt; blockMap = new HashMap&lt;&gt;();
		for (Block block : blocks) {
			blockMap.put(block.id(), block);
			if (block.blockType().equals(BlockType.TABLE)) {
				tableBlocks.add(block);
				log.debug(&quot;added table: &quot; + block.text());
			}
		}
		PageModel page = new PageModel();

		if (tableBlocks.size() == 0) {
			return null;
		}
		int i = 0;
		for (Block table : tableBlocks) {
			page.getTables().add(generateTable(table, blockMap, i++));
		}
		return page;
	}

	private TableModel generateTable(Block table, Map&lt;String, Block&gt; blockMap, int index) {
		TableModel model = new TableModel();
		Map&lt;Integer, Map&lt;Integer, String&gt;&gt; rows = getRowsColumnsMap(table, blockMap);
		model.setTableId(&quot;Table_&quot; + index);
		for (Map.Entry&lt;Integer, Map&lt;Integer, String&gt;&gt; entry : rows.entrySet()) {
			RowModel rowModel = new RowModel();
			Map&lt;Integer, String&gt; value = entry.getValue();
			for (int i = 0; i &lt; value.size(); i++) {
				rowModel.getCells().add(value.get(i));
			}
			model.getRows().add(rowModel);
		}

		return model;
	}

	private Map&lt;Integer, Map&lt;Integer, String&gt;&gt; getRowsColumnsMap(Block block, Map&lt;String, Block&gt; blockMap) {

		Map&lt;Integer, Map&lt;Integer, String&gt;&gt; rows = new HashMap&lt;&gt;();

		for (Relationship relationship : block.relationships()) {
			if (relationship.type().equals(RelationshipType.CHILD)) {
				for (String childId : relationship.ids()) {
					Block cell = blockMap.get(childId);
					if (cell != null) {
						int rowIndex = cell.rowIndex();
						int colIndex = cell.columnIndex();
						if (rows.get(rowIndex) == null) {
							Map&lt;Integer, String&gt; row = new HashMap&lt;&gt;();
							rows.put(rowIndex, row);
						}
						rows.get(rowIndex).put(colIndex, getText(cell, blockMap));
					}
				}
			}
		}
		return rows;
	}

	public String getText(Block block, Map&lt;String, Block&gt; blockMap) {
		String text = &quot;&quot;;
		if (block.relationships() != null &amp;&amp; block.relationships().size() &gt; 0) {
			for (Relationship relationship : block.relationships()) {
				if (relationship.type().equals(RelationshipType.CHILD)) {
					for (String childId : relationship.ids()) {
						Block wordBlock = blockMap.get(childId);
						if (wordBlock != null &amp;&amp; wordBlock.blockType() != null) {
							if (wordBlock.blockType().equals(BlockType.WORD))) {
								text += wordBlock.text() + &quot; &quot;;
							}
						}
					}
				}
			}
		}

		return text;
	}

huangapple
  • 本文由 发表于 2020年4月8日 02:30:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/61086945.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定