英文:
How to retrieve tables which exists in a pdf using AWS Textract in java
问题
我在下面的文章中找到了用Python进行操作的方法。
https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html
此外,我还使用了下面的文章来提取文本。
https://docs.aws.amazon.com/textract/latest/dg/detecting-document-text.html
但是上述文章只帮助我获取了文本,我还使用了"block.getBlockType()"函数
来获取块的类型,但是没有一个块的类型返回为"CELL",即使图像/PDF中有表格。
请帮我找一个类似于"boto3"的Java库,用于提取所有表格。
英文:
I found article below to do in python.
https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html
also I used article below to extract text.
https://docs.aws.amazon.com/textract/latest/dg/detecting-document-text.html
but above article helped to get only text, I also used function "block.getBlockType()"
of Block but none of block returned its type as "CELL" even tables are there in image/pdf.
Help me found java library similar to "boto3" to extract all tables.
答案1
得分: 5
我所做的是,我在JSON响应中为每个数据集创建了模型,可以使用这些模型在JSF中构建表格视图。
public static List<TableModel> getTablesFromTextract(TextractModel textractModel) {
    List<TableModel> tables = null;
    try {
        if (textractModel != null) {
            tables = new ArrayList<>();
            List<BlockModel> tableBlocks = new ArrayList<>();
            Map<String, BlockModel> blockMap = new HashMap<>();
            for (BlockModel block : textractModel.getBlocks()) {
                if (block.getBlockType().equals("TABLE")) {
                    tableBlocks.add(block);
                }
                blockMap.put(block.getId(), block);
            }
            for (BlockModel blockModel : tableBlocks) {
                Map<Long, Map<Long, String>> rowMap = new HashMap<>();
                for (RelationshipModel relationship : blockModel.getRelationships()) {
                    if (relationship.getType().equals("CHILD")) {
                        for (String id : relationship.getIds()) {
                            BlockModel cell = blockMap.get(id);
                            if (cell.getBlockType().equals("CELL")) {
                                long rowIndex = cell.getRowIndex();
                                long columnIndex = cell.getColumnIndex();
                                if (!rowMap.containsKey(rowIndex)) {
                                    rowMap.put(rowIndex, new HashMap<>());
                                }
                                Map<Long, String> columnMap = rowMap.get(rowIndex);
                                columnMap.put(columnIndex, getCellText(cell, blockMap));
                            }
                        }
                    }
                }
                tables.add(new TableModel(blockModel, rowMap));
            }
            System.out.println("row Map " + tables.toString());
        }
    } catch (Exception e) {
        LOG.error("Could not get table from textract model", e);
    }
    return tables;
}
private static String getCellText(BlockModel cell, Map<String, BlockModel> blockMap) {
    String text = "";
    try {
        if (cell != null && CollectionUtils.isNotEmpty(cell.getRelationships())) {
            for (RelationshipModel relationship : cell.getRelationships()) {
                if (relationship.getType().equals("CHILD")) {
                    for (String id : relationship.getIds()) {
                        BlockModel word = blockMap.get(id);
                        if (word.getBlockType().equals("WORD")) {
                            text += word.getText() + " ";
                        } else if (word.getBlockType().equals("SELECTION_ELEMENT")) {
                            if (word.getSelectionStatus().equals("SELECTED")) {
                                text += "X ";
                            }
                        }
                    }
                }
            }
        }
    } catch (Exception e) {
        LOG.error("Could not get cell text of table", e);
    }
    return text;
}
要创建视图的TableModel:
public class TableModel {
    private BlockModel table;
    private Map<Long, Map<Long, String>> rowMap;
    public TableModel(BlockModel table, Map<Long, Map<Long, String>> rowMap) {
        this.table = table;
        this.rowMap = rowMap;
    }
    public BlockModel getTable() {
        return table;
    }
    public void setTable(BlockModel table) {
        this.table = table;
    }
    public Map<Long, Map<Long, String>> getRowMap() {
        return rowMap;
    }
    public void setRowMap(Map<Long, Map<Long, String>> rowMap) {
        this.rowMap = rowMap;
    }
    @Override
    public String toString() {
        return table.getId() + " - " + rowMap.toString();
    }
}
英文:
What I did, I created models of each dataset in the json response and can use this models to build a table view in jsf.
public static List<TableModel> getTablesFromTextract(TextractModel textractModel) {
List<TableModel> tables = null;
try {
if (textractModel != null) {
tables = new ArrayList<>();
List<BlockModel> tableBlocks = new ArrayList<>();
Map<String, BlockModel> blockMap = new HashMap<>();
for (BlockModel block : textractModel.getBlocks()) {
if (block.getBlockType().equals("TABLE")) {
tableBlocks.add(block);
}
blockMap.put(block.getId(), block);
}
for (BlockModel blockModel : tableBlocks) {
Map<Long, Map<Long, String>> rowMap = new HashMap<>();
for (RelationshipModel relationship : blockModel.getRelationships()) {
if (relationship.getType().equals("CHILD")) {
for (String id : relationship.getIds()) {
BlockModel cell = blockMap.get(id);
if (cell.getBlockType().equals("CELL")) {
long rowIndex = cell.getRowIndex();
long columnIndex = cell.getColumnIndex();
if (!rowMap.containsKey(rowIndex)) {
rowMap.put(rowIndex, new HashMap<>());
}
Map<Long, String> columnMap = rowMap.get(rowIndex);
columnMap.put(columnIndex, getCellText(cell, blockMap));
}
}
}
}
tables.add(new TableModel(blockModel, rowMap));
}
System.out.println("row Map " + tables.toString());
}
} catch (Exception e) {
LOG.error("Could not get table from textract model", e);
}
return tables;
}
private static String getCellText(BlockModel cell, Map<String, BlockModel> blockMap) {
String text = "";
try {
if (cell != null
&& CollectionUtils.isNotEmpty(cell.getRelationships())) {
for (RelationshipModel relationship : cell.getRelationships()) {
if (relationship.getType().equals("CHILD")) {
for (String id : relationship.getIds()) {
BlockModel word = blockMap.get(id);
if (word.getBlockType().equals("WORD")) {
text += word.getText() + " ";
} else if (word.getBlockType().equals("SELECTION_ELEMENT")) {
if (word.getSelectionStatus().equals("SELECTED")) {
text += "X ";
}
}
}
}
}
}
} catch (Exception e) {
LOG.error("Could not get cell text of table", e);
}
return text;
}
TableModel to create the view from:
public class TableModel {
private BlockModel table;
private Map<Long, Map<Long, String>> rowMap;
public TableModel(BlockModel table, Map<Long, Map<Long, String>> rowMap) {
this.table = table;
this.rowMap = rowMap;
}
public BlockModel getTable() {
return table;
}
public void setTable(BlockModel table) {
this.table = table;
}
public Map<Long, Map<Long, String>> getRowMap() {
return rowMap;
}
public void setRowMap(Map<Long, Map<Long, String>> rowMap) {
this.rowMap = rowMap;
}
@Override
public String toString() {
return table.getId() + " - " + rowMap.toString();
}
答案2
得分: 0
以下是翻译好的代码部分:
public class AnalyzeDocument {
	public DocumentModel startProcess(byte[] content) {
		Region region = Region.EU_WEST_2;
		TextractClient textractClient = TextractClient.builder().region(region)
				.credentialsProvider(EnvironmentVariableCredentialsProvider.create()).build();
		return analyzeDoc(textractClient, content);
	}
	public DocumentModel analyzeDoc(TextractClient textractClient, byte[] content) {
		try {
			SdkBytes sourceBytes = SdkBytes.fromByteArray(content);
			Util util = new Util();
			Document myDoc = Document.builder().bytes(sourceBytes).build();
			List<FeatureType> featureTypes = new ArrayList<FeatureType>();
			featureTypes.add(FeatureType.FORMS);
			featureTypes.add(FeatureType.TABLES);
			AnalyzeDocumentRequest analyzeDocumentRequest = AnalyzeDocumentRequest.builder().featureTypes(featureTypes)
					.document(myDoc).build();
			AnalyzeDocumentResponse analyzeDocument = textractClient.analyzeDocument(analyzeDocumentRequest);
			List<Block> docInfo = analyzeDocument.blocks();
//			util.displayBlockInfo(docInfo);
			PageModel pageModel = util.getTableResults(docInfo);
			DocumentModel documentModel = new DocumentModel();
			documentModel.getPages().add(pageModel);
			Iterator<Block> blockIterator = docInfo.iterator();
			while (blockIterator.hasNext()) {
				Block block = blockIterator.next();
				log.debug("The block type is " + block.blockType().toString());
			}
			return documentModel;
		} catch (TextractException e) {
			System.err.println(e.getMessage());
		}
		return null;
	}
}
public PageModel getTableResults(List<Block> blocks) {
	List<Block> tableBlocks = new ArrayList<>();
	Map<String, Block> blockMap = new HashMap<>();
	for (Block block : blocks) {
		blockMap.put(block.id(), block);
		if (block.blockType().equals(BlockType.TABLE)) {
			tableBlocks.add(block);
			log.debug("added table: " + block.text());
		}
	}
	PageModel page = new PageModel();
	if (tableBlocks.size() == 0) {
		return null;
	}
	int i = 0;
	for (Block table : tableBlocks) {
		page.getTables().add(generateTable(table, blockMap, i++));
	}
	return page;
}
private TableModel generateTable(Block table, Map<String, Block> blockMap, int index) {
	TableModel model = new TableModel();
	Map<Integer, Map<Integer, String>> rows = getRowsColumnsMap(table, blockMap);
	model.setTableId("Table_" + index);
	for (Map.Entry<Integer, Map<Integer, String>> entry : rows.entrySet()) {
		RowModel rowModel = new RowModel();
		Map<Integer, String> value = entry.getValue();
		for (int i = 0; i < value.size(); i++) {
			rowModel.getCells().add(value.get(i));
		}
		model.getRows().add(rowModel);
	}
	return model;
}
private Map<Integer, Map<Integer, String>> getRowsColumnsMap(Block block, Map<String, Block> blockMap) {
	Map<Integer, Map<Integer, String>> rows = new HashMap<>();
	for (Relationship relationship : block.relationships()) {
		if (relationship.type().equals(RelationshipType.CHILD)) {
			for (String childId : relationship.ids()) {
				Block cell = blockMap.get(childId);
				if (cell != null) {
					int rowIndex = cell.rowIndex();
					int colIndex = cell.columnIndex();
					if (rows.get(rowIndex) == null) {
						Map<Integer, String> row = new HashMap<>();
						rows.put(rowIndex, row);
					}
					rows.get(rowIndex).put(colIndex, getText(cell, blockMap));
				}
			}
		}
	}
	return rows;
}
public String getText(Block block, Map<String, Block> blockMap) {
	String text = "";
	if (block.relationships() != null && block.relationships().size() > 0) {
		for (Relationship relationship : block.relationships()) {
			if (relationship.type().equals(RelationshipType.CHILD)) {
				for (String childId : relationship.ids()) {
					Block wordBlock = blockMap.get(childId);
					if (wordBlock != null && wordBlock.blockType() != null) {
						if (wordBlock.blockType().equals(BlockType.WORD))) {
							text += wordBlock.text() + " ";
						}
					}
				}
			}
		}
	}
	return text;
}
注意:我已经根据你的要求将代码进行了翻译,只保留了代码部分,并删除了不属于代码的其他内容。如果你有任何进一步的问题或需要补充,请随时提问。
英文:
I have something similar:
public class AnalyzeDocument {
	public DocumentModel startProcess(byte[] content) {
		Region region = Region.EU_WEST_2;
		TextractClient textractClient = TextractClient.builder().region(region)
				.credentialsProvider(EnvironmentVariableCredentialsProvider.create()).build();
		return analyzeDoc(textractClient, content);
	}
	public DocumentModel analyzeDoc(TextractClient textractClient, byte[] content) {
		try {
			SdkBytes sourceBytes = SdkBytes.fromByteArray(content);
			Util util = new Util();
			Document myDoc = Document.builder().bytes(sourceBytes).build();
			List<FeatureType> featureTypes = new ArrayList<FeatureType>();
			featureTypes.add(FeatureType.FORMS);
			featureTypes.add(FeatureType.TABLES);
			AnalyzeDocumentRequest analyzeDocumentRequest = AnalyzeDocumentRequest.builder().featureTypes(featureTypes)
					.document(myDoc).build();
			AnalyzeDocumentResponse analyzeDocument = textractClient.analyzeDocument(analyzeDocumentRequest);
			List<Block> docInfo = analyzeDocument.blocks();
//			util.displayBlockInfo(docInfo);
			PageModel pageModel = util.getTableResults(docInfo);
			DocumentModel documentModel = new DocumentModel();
			documentModel.getPages().add(pageModel);
			Iterator<Block> blockIterator = docInfo.iterator();
			while (blockIterator.hasNext()) {
				Block block = blockIterator.next();
				log.debug("The block type is " + block.blockType().toString());
			}
			return documentModel;
		} catch (TextractException e) {
			System.err.println(e.getMessage());
		}
		return null;
	}
and this is the util file:
	public PageModel getTableResults(List<Block> blocks) {
		List<Block> tableBlocks = new ArrayList<>();
		Map<String, Block> blockMap = new HashMap<>();
		for (Block block : blocks) {
			blockMap.put(block.id(), block);
			if (block.blockType().equals(BlockType.TABLE)) {
				tableBlocks.add(block);
				log.debug("added table: " + block.text());
			}
		}
		PageModel page = new PageModel();
		if (tableBlocks.size() == 0) {
			return null;
		}
		int i = 0;
		for (Block table : tableBlocks) {
			page.getTables().add(generateTable(table, blockMap, i++));
		}
		return page;
	}
	private TableModel generateTable(Block table, Map<String, Block> blockMap, int index) {
		TableModel model = new TableModel();
		Map<Integer, Map<Integer, String>> rows = getRowsColumnsMap(table, blockMap);
		model.setTableId("Table_" + index);
		for (Map.Entry<Integer, Map<Integer, String>> entry : rows.entrySet()) {
			RowModel rowModel = new RowModel();
			Map<Integer, String> value = entry.getValue();
			for (int i = 0; i < value.size(); i++) {
				rowModel.getCells().add(value.get(i));
			}
			model.getRows().add(rowModel);
		}
		return model;
	}
	private Map<Integer, Map<Integer, String>> getRowsColumnsMap(Block block, Map<String, Block> blockMap) {
		Map<Integer, Map<Integer, String>> rows = new HashMap<>();
		for (Relationship relationship : block.relationships()) {
			if (relationship.type().equals(RelationshipType.CHILD)) {
				for (String childId : relationship.ids()) {
					Block cell = blockMap.get(childId);
					if (cell != null) {
						int rowIndex = cell.rowIndex();
						int colIndex = cell.columnIndex();
						if (rows.get(rowIndex) == null) {
							Map<Integer, String> row = new HashMap<>();
							rows.put(rowIndex, row);
						}
						rows.get(rowIndex).put(colIndex, getText(cell, blockMap));
					}
				}
			}
		}
		return rows;
	}
	public String getText(Block block, Map<String, Block> blockMap) {
		String text = "";
		if (block.relationships() != null && block.relationships().size() > 0) {
			for (Relationship relationship : block.relationships()) {
				if (relationship.type().equals(RelationshipType.CHILD)) {
					for (String childId : relationship.ids()) {
						Block wordBlock = blockMap.get(childId);
						if (wordBlock != null && wordBlock.blockType() != null) {
							if (wordBlock.blockType().equals(BlockType.WORD))) {
								text += wordBlock.text() + " ";
							}
						}
					}
				}
			}
		}
		return text;
	}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论