Python ProcessPoolExecutor 问题

huangapple go评论65阅读模式
英文:

Python ProcessPoolExecutor issue

问题

我正在尝试使用tesseract处理文件夹中的PDF文件。在同步执行文件时,似乎该函数运行正常,但当我尝试使用processpoolexecutor来实现时,什么都没有发生。真的希望能得到一些帮助/见解。

系统详细信息:

  • Ubuntu 22.04,i5,Python 3.10.6(venv),VS Code

代码:

# 从PDF文件中提取文本并将内容存储在文件中以供自然语言处理分析
# 尝试使用camelot和tabular两个包都无法提取所需的表格内容
# 该脚本使用tesseract执行光学字符识别(OCR)

from glob import glob 
import pytesseract
from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import as_completed
from pdf2image import convert_from_path as pdf2img
import pathlib as pl
import multiprocessing as mpc

def ProcessPDF(par_FilePath):
    lstImages = pdf2img(par_FilePath)
    intImgs = len(lstImages)
    strOCRd = ''
    for it, im in enumerate(lstImages):
        npg = '='*50 + f'Pg:{it+1}' + '='*50 + '\n'  # 结束每一页
        pgText = pytesseract.image_to_string(im)  # 执行OCR
        strOCRd += pgText + '\n' + npg  # 添加到字符串
        print(f'Processing: {pl.Path(par_FilePath).name} : {int(it/intImgs*100)}%')
    fStem  = pl.Path(par_FilePath).stem
    fDir =  str(pl.Path(par_FilePath).parent) + '/'
    
    with open(fDir + fStem + '.txt', 'w') as fobj:  # 保存文件
        fobj.write(strOCRd)
    return f'Completed: {pl.Path(par_FilePath).name}'

if __name__ == '__main__':
    strFolderPDF = r'/home/*****/proj/rfp_model/pdfFiles/'
    lstFiles = glob(strFolderPDF + '*.pdf')
    numFiles = len(lstFiles)
    numCPUs = mpc.cpu_count()
    print(f'Starting pool executor, processing {numFiles} files with {numCPUs} workers.')
    with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
        ftrResults = ppe.map(ProcessPDF, lstFiles)
    
        for ftrResult in as_completed(ftrResults):
            print(ftrResult)

    # 这段代码是有效的
    # for ipath in lstFiles:
    #    ProcessPDF(ipath)

当前结果:

  • 在调试器中运行时,我可以看到调用堆栈中的每个线程启动,但然后什么都不发生。控制台没有活动。系统监视器显示我的4个CPU都达到了100%。我尝试设置1和2个工作线程,但没有成功。在中断之前等待了10分钟。

预期结果:

  • 应该看到4个进程启动并在控制台中打印出语句,函数将处理每个PDF的每一页。
英文:

I'm trying to process PDF files in a folder using tesseract. Seems that the function works fine when executed on files synchronously, but when i try and implement processpoolexecutor, nothing happens. Would truly appreciate some help/insight.

System details:

  • Ubuntu 22.04, i5, python 3.10.6 (venv), vscode

Code:

# scrap text from pdf's and store content in files for nlp analysis
# tried to use both camelot and tabular and both packages could not scrap the required table contents
# this script implements ocr using tesseract  
from glob import glob 
import pytesseract
from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import as_completed
from pdf2image import convert_from_path as pdf2img
import pathlib as pl
import multiprocessing as mpc
def ProcessPDF(par_FilePath):
lstImages = pdf2img(par_FilePath)
intImgs = len(lstImages)
strOCRd = ''
for it, im in enumerate(lstImages):
npg = '='*50+f'Pg:{it+1}'+'='*50+'\n' #end each page
pgText = pytesseract.image_to_string(im) #perform ocr
strOCRd += pgText + '\n' + npg # add to string
print(f'Processing: {pl.Path(par_FilePath).name} : {int(it/intImgs*100)}%')
fStem  = pl.Path(par_FilePath).stem
fDir =  str(pl.Path(par_FilePath).parent)+'/'
with open(fDir + fStem + '.txt', 'w') as fobj: #save file
fobj.write(strOCRd)
return f'Completed: {pl.Path(par_FilePath).name}'
if __name__ == '__main__':
strFolderPDF = r'/home/*****/proj/rfp_model/pdfFiles/'
lstFiles = glob(strFolderPDF+'*.pdf')
numFiles = len(lstFiles)
numCPUs = mpc.cpu_count()
print(f'Starting pool executor, processing {numFiles} files with {numCPUs} workers.')
with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
ftrResults = ppe.map(ProcessPDF, lstFiles)
for ftrResult in as_completed(ftrResults):
print(ftrResult)
#this works 
#for ipath in lstFiles:
#    ProcessPDF(ipath)

Current Outcome:

  • When run in the debugger i can see each of the threads spin up in the call stack, but then nothing. No activity the console. The system monitor shows each of my 4 cpu's hit 100%. I experimented by setting 1, and 2 workers, but no success. Waited for 10 minutes before hitting interrupt.
    Expected Outcome:
  • Should see 4 processes kick off and print statements in the console as the function works through each page of each pdf.

答案1

得分: 1

ProcessPoolExecutor的map函数不会返回future对象

将:

with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
    ftrResults = ppe.map(ProcessPDF, lstFiles)
    for ftrResult in as_completed(ftrResults):
        print(ftrResult)

更改为:

with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
    ftrResults = ppe.map(ProcessPDF, lstFiles)
    for ftrResult in ftrResults:
        print(ftrResult)

您当前的代码将引发AttributeError异常。

英文:

ProcessPoolExecutor's map function does not return future objects

Change:

with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
ftrResults = ppe.map(ProcessPDF, lstFiles)
for ftrResult in as_completed(ftrResults):
print(ftrResult)

...to...

with ProcessPoolExecutor(max_workers=numCPUs) as ppe:
ftrResults = ppe.map(ProcessPDF, lstFiles)
for ftrResult in ftrResults:
print(ftrResult)

Your current code will induce an AttributeError exception

huangapple
  • 本文由 发表于 2023年3月12日 14:29:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/75711427.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定