“Extremely slow POST times” 可以翻译为 “非常慢的POST时间”。

huangapple go评论52阅读模式

Extremely slow POST times


I've translated the content you provided, excluding the code part. Here's the translated text:

"I'm working on a 'PDF manager' that takes in a PDF file and parses its sentence and I'm storing the PDFs in object storage and the sentences in a DB.

My problem is that the POST times are taking extremely long. This was not an issue before and even with version control I could not find what exactly is causing the issue.

I'm using Azure SQL DB for the database and Azure object storage. I initially thought the DB was the issue but I tried running the same thing on a local DB and got a connection timeout from how long it was taking.

The API's method...

There are two methods here which I initially thought are causing the issue. The first one parses the sentences and places them into a text file, which is then stored in object storage. But I made that optional through the 'WithTxtFile' boolean, and even when that's false, the issue persists. The other thing was the 'WithImages' which implements OCR on the images in the PDF. But again, even when that is turned off through the 'WithImages' boolean, the issue persists.

I'm sure it's not due to the sentence parsing logic since I haven't changed it since starting development, and the issue only started recently.

I also think that the code reaches 'await DB.SaveChangesAsync()' and that a new row is actually created considering I'm getting the following message in the terminal...

Which - from my understanding - indicates that a row has been created and inputted with the required values. But I'm not sure of this point, I could be wrong.

Moreover, the PDFs I'm inputting are rather small. One is 950kb and contains ~5000 words. Even this takes a while when using Azure storage and reaches DB timeout when using a local DB. The only file where the process finishes successfully is a 3kb file that contains nothing but basic text and no images.

Any help would be extremely appreciated!"

Let me know if you need further assistance or have any questions related to this translation.


I'm working on a "PDF manager" that takes in a PDF file and parses its sentence and I'm storing the PDFs in object storage and the sentences in a DB.

My problem is that the POST times are taking extremely long. This was not an issue before and even with version control I could not find what exactly is causing the issue.

I'm using Azure SQL DB for the database and Azure object storage. I initially thought the DB was the issue but I tried running the same thing on a local DB and got a connection timeout from how long it was taking.

The API's method

    //(POST) API to post a PDF to the database and parse its sentences
    public async Task<ActionResult<List<PDF>>> PostPDF(List<IFormFile> Files, bool? WithImages, bool? WithTxtFile)
        List<PDF> ListPDF = new List<PDF>();
        bool _WithImages = WithImages ?? false;
        bool _WithTxtFile = WithTxtFile ?? false;
        foreach(var file in Files)
            if(System.IO.Path.GetExtension(file.FileName) == ".pdf") //Ensure the file being sent is a PDF
                var FileInBytes = ManipulatorPDF.LoadBytePDF(file); //Byte array representing the PDF
                PdfDocument FileLoader = ManipulatorPDF.LoadPDF(file);
                List<Sentences> Sentences = ManipulatorPDF.GetSentences(FileLoader, _WithImages);
                //Create a new instance of a PDF with all the required parameters.
                //Save the file instance and sentences in an Azure blob storage using AzureSerices.SaveFile()
                string SentencesInTxt = (_WithTxtFile) ? await AzureServices.SaveFile(ManipulatorPDF.SentencesToText(Sentences), file.FileName.Substring(0, file.FileName.Length - 4)+"_Sentence.txt", "sentences-container") 
                : "This file was not posted with 'WithTxtFile' enabled";
                PDF FileInstance = new PDF(file.FileName, file.Length, FileLoader.Pages.Count, Sentences, await AzureServices.SaveFile(FileInBytes, file.FileName, "pdf-container"), SentencesInTxt);

                DB.Add(FileInstance); //Add an instance of the PDF class to the database
                ListPDF.Add(FileInstance); //Add instnace to a list
                return BadRequest("Bad request: Only PDFs are accepted. File(s) sent is not a PDF"); //Return bad request message if a non-PDF is sent
        Cache.Remove("ListPDF"); //Unload the cache after a new post
        await DB.SaveChangesAsync();
        return ListPDF; //Return the list of PDFs representing a successful POST

There are two methods here which I initally thought are causing the issue. The first is one that parses the sentences and places them into a text file which is then stored in object storage. But I made that optional through the WIthTxtFile boolean and even when that false the issue persists. The other thing was the With Images which implements OCR on the images in the PDF. But again even when that is turned off through the WithImages boolean that issues persists.

I'm sure it's not due to the sentence parsing logic since I haven't changed it since starting development, and the issue only started recently.

I also think that the code reaches await DB.SaveChangesAsync() and that a new row is actually created considering I'm getting the following message in terminal:

info: Microsoft.EntityFrameworkCore.Database.Command[20101]
      Executed DbCommand (58ms) [Parameters=[@p0='?' (Size = 4000), @p1='?' (DbType = Double), @p2='?' (Size = 4000), @p3='?' (DbType = Int32), @p4='?' (Size = 4000), @p5='?' (DbType = DateTime2)], CommandType='Text', CommandTimeout='30']
      INSERT INTO [PDFs] ([FileLink], [FileSize], [Name], [NumberOfPages], [SentencesLinkTxt], [TimeOfUpload])

Which - from my understanding - indicates that a row has been created and inputted with the required values. But I'm not sure of this point, I could be wrong.

Moreover, the PDFs I'm inputting are rather small. One is 950kb and contains ~5000 words. Even this takes a while when using Azure storage and reaches DB timeout when using a local DB. The only file where the process finishes successfully is a 3kb file that contains nothing but basic text and no images.

Any help would be extremely appreciated!


得分: 1





Parsing a file for text and images is not something I would recommend doing within the scope of a request. While you have flags set to potentially skip these, you are still calling code like ManipulatorPDF.GetSentences even if you have the flags set to false.

When doing something like a file parse based on an upload, this should ideally be treated as a more asynchronous background process. The user uploads a file which saves the uploaded file to storage in a temporary capacity and creates a unique identifier key to send back to the client immediately. A job is then added to a queue to be picked up by a background worker which goes through submitted jobs to process the uploads. The Web client gets back that key along with a current status like "Submitted" or "Pending Processing" and can make another request to the server using that key periodically to get a current status. As the background job processes through the queue and gets to that request, the current status can be updated to "Processing", then finally "Complete" or an "Error" with available details.

The issue with heavyweight jobs like this is not just considering the fact that the job might not be completable within the time of a single request depending on the size and complexity of the file being uploaded, but consider that several users could potentially kick off this request all around the same time. This can quickly cause various issues with memory use, CPU processing, or licensing limitations with libraries used to parse the PDF to name a few.

  • 本文由 发表于 2023年4月17日 06:29:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76030634.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
