英文:
Python ProcessPoolExecutor issue
问题
我正在尝试使用tesseract处理文件夹中的PDF文件。在同步执行文件时,似乎该函数运行正常,但当我尝试使用processpoolexecutor来实现时,什么都没有发生。真的希望能得到一些帮助/见解。
系统详细信息:
- Ubuntu 22.04,i5,Python 3.10.6(venv),VS Code
代码:
# 从PDF文件中提取文本并将内容存储在文件中以供自然语言处理分析
# 尝试使用camelot和tabular两个包都无法提取所需的表格内容
# 该脚本使用tesseract执行光学字符识别(OCR)
from glob import glob
import pytesseract
from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import as_completed
from pdf2image import convert_from_path as pdf2img
import pathlib as pl
import multiprocessing as mpc
def ProcessPDF(par_FilePath):
lstImages = pdf2img(par_FilePath)
intImgs = len(lstImages)
strOCRd = ''
for it, im in enumerate(lstImages):
npg = '='*50 + f'Pg:{it+1}' + '='*50 + '\n' # 结束每一页
pgText = pytesseract.image_to_string(im) # 执行OCR
strOCRd += pgText + '\n' + npg # 添加到字符串
print(f'Processing: {pl.Path(par_FilePath).name} : {int(it/intImgs*100)}%')
fStem = pl.Path(par_FilePath).stem
fDir = str(pl.Path(par_FilePath).parent) + '/'
with open(fDir + fStem + '.txt', 'w') as fobj: # 保存文件
fobj.write(strOCRd)
return f'Completed: {pl.Path(par_FilePath).name}'
if __name__ == '__main__':
strFolderPDF = r'/home/*****/proj/rfp_model/pdfFiles/'
lstFiles = glob(strFolderPDF + '*.pdf')
numFiles = len(lstFiles)
numCPUs = mpc.cpu_count()
print(f'Starting pool executor, processing {numFiles} files with {numCPUs} workers.')
with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
ftrResults = ppe.map(ProcessPDF, lstFiles)
for ftrResult in as_completed(ftrResults):
print(ftrResult)
# 这段代码是有效的
# for ipath in lstFiles:
# ProcessPDF(ipath)
当前结果:
- 在调试器中运行时,我可以看到调用堆栈中的每个线程启动,但然后什么都不发生。控制台没有活动。系统监视器显示我的4个CPU都达到了100%。我尝试设置1和2个工作线程,但没有成功。在中断之前等待了10分钟。
预期结果:
- 应该看到4个进程启动并在控制台中打印出语句,函数将处理每个PDF的每一页。
英文:
I'm trying to process PDF files in a folder using tesseract. Seems that the function works fine when executed on files synchronously, but when i try and implement processpoolexecutor, nothing happens. Would truly appreciate some help/insight.
System details:
- Ubuntu 22.04, i5, python 3.10.6 (venv), vscode
Code:
# scrap text from pdf's and store content in files for nlp analysis
# tried to use both camelot and tabular and both packages could not scrap the required table contents
# this script implements ocr using tesseract
from glob import glob
import pytesseract
from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import as_completed
from pdf2image import convert_from_path as pdf2img
import pathlib as pl
import multiprocessing as mpc
def ProcessPDF(par_FilePath):
lstImages = pdf2img(par_FilePath)
intImgs = len(lstImages)
strOCRd = ''
for it, im in enumerate(lstImages):
npg = '='*50+f'Pg:{it+1}'+'='*50+'\n' #end each page
pgText = pytesseract.image_to_string(im) #perform ocr
strOCRd += pgText + '\n' + npg # add to string
print(f'Processing: {pl.Path(par_FilePath).name} : {int(it/intImgs*100)}%')
fStem = pl.Path(par_FilePath).stem
fDir = str(pl.Path(par_FilePath).parent)+'/'
with open(fDir + fStem + '.txt', 'w') as fobj: #save file
fobj.write(strOCRd)
return f'Completed: {pl.Path(par_FilePath).name}'
if __name__ == '__main__':
strFolderPDF = r'/home/*****/proj/rfp_model/pdfFiles/'
lstFiles = glob(strFolderPDF+'*.pdf')
numFiles = len(lstFiles)
numCPUs = mpc.cpu_count()
print(f'Starting pool executor, processing {numFiles} files with {numCPUs} workers.')
with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
ftrResults = ppe.map(ProcessPDF, lstFiles)
for ftrResult in as_completed(ftrResults):
print(ftrResult)
#this works
#for ipath in lstFiles:
# ProcessPDF(ipath)
Current Outcome:
- When run in the debugger i can see each of the threads spin up in the call stack, but then nothing. No activity the console. The system monitor shows each of my 4 cpu's hit 100%. I experimented by setting 1, and 2 workers, but no success. Waited for 10 minutes before hitting interrupt.
Expected Outcome: - Should see 4 processes kick off and print statements in the console as the function works through each page of each pdf.
答案1
得分: 1
ProcessPoolExecutor的map函数不会返回future对象
将:
with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
ftrResults = ppe.map(ProcessPDF, lstFiles)
for ftrResult in as_completed(ftrResults):
print(ftrResult)
更改为:
with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
ftrResults = ppe.map(ProcessPDF, lstFiles)
for ftrResult in ftrResults:
print(ftrResult)
您当前的代码将引发AttributeError异常。
英文:
ProcessPoolExecutor's map function does not return future objects
Change:
with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
ftrResults = ppe.map(ProcessPDF, lstFiles)
for ftrResult in as_completed(ftrResults):
print(ftrResult)
...to...
with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
ftrResults = ppe.map(ProcessPDF, lstFiles)
for ftrResult in ftrResults:
print(ftrResult)
Your current code will induce an AttributeError exception
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论