Azure Functions 实例共享变量吗?

huangapple go评论71阅读模式
英文:

Instances of Azure Functions are sharing variables?

问题

不确定问题是否有意义,但这是我观察到的。我的 Azure 函数使用 BlobTrigger 处理上传到 Blob 存储的 PDF 文件。一切正常,直到我一次性上传多个 blob,此时使用下面的代码,我观察到以下情况:

  • 第一个 context.getLogger() 正确记录触发函数的每个 blob。

  • 在 Azure 文件共享中,每个 PDF 文件都被正确保存。

  • 在许多情况下,第二个 context.getLogger() 返回不正确的结果(来自其他文件之一),就像变量在我的函数实例之间共享一样。请注意,lines[19] 对于每个 PDF 都是唯一的。

  • 我在代码后面的部分也注意到类似的行为,错误的 PDF 数据被记录。

编辑:为了清楚起见,我理解当多个实例并行运行时,日志不会按顺序显示。然而,与其在上传 10 个文件时获得 10 个唯一结果,我获得了大部分是重复项的结果,这个问题在代码后面变得更糟,在基于 X 做 Y 的情况下,10 次调用中有 9 次会产生垃圾数据。

Main.class

public class main {
   @FunctionName("veninv")
       @StorageAccount("Storage")
       public void blob(
           @BlobTrigger(
                   name = "blob",
                   dataType = "binary",
                   path = "veninv/{name}") 
               byte[] content,
           @BindingName("name") String blobname,
           final ExecutionContext context
           ) {

         context.getLogger().info("BlobTrigger by: " + blobname + "(" + content.length + " bytes)");

           //Writing byte[] to a file in Azure Functions file storage
               File tempfile = new File (tempdir, blobname);
               OutputStream os = new FileOutputStream(tempfile);
               os.write(content);
               os.close();

               String[] lines  = Pdf.getLines(tempfile);
               context.getLogger().info(lines[19]);
           }
    }

Pdf.class

   public static String[] getLines(File PDF) throws Exception {
           PDDocument doc = PDDocument.load(PDF);
           PDFTextStripper pdfStripper = new PDFTextStripper();
           String text = pdfStripper.getText(doc);
           lines = text.split(System.getProperty("line.separator"));
           doc.close();
           return lines;
   }

我真的不明白这里发生了什么,希望能得到一些帮助。

英文:

Not sure if the question makes sense, but it's what I'm observing. My Azure Function uses a BlobTrigger to process PDF files that are uploaded to a Blob Storage. Things work fine, until I upload several blobs at once, in which case, using the code below I observe the following:

  • The first context.getLogger() correctly logs each blob that triggers the Function.

  • In the Azure File Share, each PDF file is correctly saved.

  • The second context.getLogger() in many cases returns incorrect results (from one of the other files), as if variables are being shared between instances of my Function. Note that lines[19] is unique for each PDF.

  • I notice similar behavior later on in my code where data from the wrong PDF is logged.

EDIT: to be clear, I understand logs aren't going to be in order when multiple instances run in parallel. However, rather than getting 10 unique results for lines[19] when I upload 10 files, the majority of the results are duplicates and this issue worsens later on in my code when based on X I want to do Y, and 9 out of 10 invocations produce garbage data.

Main.class

public class main {
   @FunctionName("veninv")
       @StorageAccount("Storage")
       public void blob(
       	@BlobTrigger(
       			name = "blob",
       			dataType = "binary",
       			path = "veninv/{name}") 
       		byte[] content,
       	@BindingName("name") String blobname,
       	final ExecutionContext context
       	) {

         context.getLogger().info("BlobTrigger by: " + blobname + "(" + content.length + " bytes)");

           //Writing byte[] to a file in Azure Functions file storage
           	File tempfile = new File (tempdir, blobname);
           	OutputStream os = new FileOutputStream(tempfile);
           	os.write(content);
           	os.close();

           	String[] lines  = Pdf.getLines(tempfile);
           	context.getLogger().info(lines[19]);
           }
    }

Pdf.class

   public static String[] getLines(File PDF) throws Exception {
   		PDDocument doc = PDDocument.load(PDF);
   	    PDFTextStripper pdfStripper = new PDFTextStripper();
   	    String text = pdfStripper.getText(doc);
   	    lines = text.split(System.getProperty("line.separator"));
   	    doc.close();
   	    return lines;
   }

I don't really understand what's going on here, so hoping for some assistance.

答案1

得分: 5

是的,Azure函数调用可以共享变量。我需要查看所有的代码才能百分之百确定,但看起来lines对象被声明为static,可以在不同调用之间共享。让我们尝试将static String[]更改为String[],看看问题是否消失?

Azure函数很容易上手,但很容易忘记执行环境。你的函数调用并不像它们看起来那样隔离。有一个父线程调用你的函数,许多静态变量不是“线程安全”的。静态变量表示全局状态,因此可以全局访问。此外,它不附加到任何特定的对象实例。变量的“静态性”与其所在的内存空间有关,而不是它的值。因此,相同的变量可以从引用它的所有类实例中访问。

附注:你在这里通过减少并发性解决了问题,但这可能会影响可伸缩性。我建议进行负载测试。静态变量也可以很有用。许多静态变量是线程安全的,你可以在Azure函数中使用它们,比如你的httpClient或sqlClient数据库连接!可以阅读一下第三个链接,在这里

英文:

Yes. Azure function invocations can share variables. I'd need to see all the code to be 100% certain, but it looks like the lines object is declared as static and it could be shared across invocations. Let's try changing from a static String[] to String[] and see if the problem goes away?

Azure functions are easy to get off the ground, it's easy to forget about the execution environment. Your functions invocations aren't as isolated as they appear. There is a parent thread calling your function, and many static variables aren't "thread safe." Static variable represents a global state so it is globally accessible. Also, it is not attached with any particular object instance. The "staticness" of the variable relates to the memory space it sits in not it’s value. So, the same variable is accessible from all class instances in which it is referenced.

PS. You've solved the issue in your answer here by reducing concurrency, but that may come at a cost to scalability. I'd recommend load testing that. Also static variables can be useful. Many are thread-safe and you want to use them in Azure functions, such as your httpClient or sqlClient DB connections! Give number three a read, here.

答案2

得分: 1

不,很难相信函数会有这样严重的问题。我看到在你的情况下可能会引起这个问题的一些潜在问题:

  1. 你确定每次都在上传到不同的唯一 blob 吗?你可以通过记录 blobname 参数来检查。
  2. 由于你将文件存储在临时目录中 File tempfile = new File(tempdir, blobname);,如果 blob 名称与第1点中提到的相同,它会被最后一次写入覆盖。如果可能的话,你可以考虑直接从字节或流构造 PDF,而不是在文件系统中创建一个中间文件。如果我没记错的话,你正在使用 PDFBox,它支持从 byte[] 加载 https://pdfbox.apache.org/docs/2.0.3/javadocs/index.html?org/apache/pdfbox/pdmodel/PDDocument.html(查看接受 byte[] 的 load 方法重载)。我还回答了你的另一个相关问题
  3. 检查是否有静态字段引起了这个问题。
  4. 你不需要使用你计划引入的单独队列。尽管如果实际问题得到解决,你根本不需要它,Blob 触发器已经使用内部队列,默认并发数是 24,但你可以在 host.json 中进行配置。https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-blob-trigger?tabs=java#concurrency-and-memory-usage

更新:

看起来在你的 PDF 类中,你在某个方法外部声明了 'lines' 变量为静态,这是问题的根本原因。与函数无关,而与 static 有关 Azure Functions 实例共享变量吗?

以下是正确的代码(注意 'lines' 变量现在是方法内部的局部变量):

public static String[] getLines(File PDF) throws Exception {
    PDDocument doc = PDDocument.load(PDF);
    PDFTextStripper pdfStripper = new PDFTextStripper();
    String text = pdfStripper.getText(doc);
    String[] lines = text.split(System.getProperty("line.separator"));
    doc.close();
    return lines;
}
英文:

No, it's quite hard to believe that function can have such a serious issue. I see some potential problems which might be causing this in your case:

  1. Are you sure you are uploading to a different unique blob for each file every time? You can check by logging the blobname param.
  2. Since you store the file in temp directory File tempfile = new File (tempdir, blobname);, if the blob name is same as mentioned in #1, it would overwrite with last write wins. If it's possible to construct pdf directly from bytes or stream, you can consider that instead of creating an intermediate file in filesystem. If I am not wrong you are using PDFBox which has support to load from byte[] https://pdfbox.apache.org/docs/2.0.3/javadocs/index.html?org/apache/pdfbox/pdmodel/PDDocument.html (check the load method overload which accepts byte[]). I have also answered your another question related to this.
  3. Check if you have static field causing this.
  4. You don't need to use a separate queue which you are thinking to introduce. Though you won't need it at all if the actual issue is fixed, Blob trigger already uses internal queue, default concurrency is 24, but you can configure it in host.json. https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-blob-trigger?tabs=java#concurrency-and-memory-usage

UPDATE:

Looks like in your pdf class you declared 'lines' somewhere outside the method as static which is the root cause of this problem. It's nothing to do with function, but the devil of static Azure Functions 实例共享变量吗?

Below is the correct code (notes 'lines' variable is now local to the method):

public static String[] getLines(File PDF) throws Exception {
           PDDocument doc = PDDocument.load(PDF);
           PDFTextStripper pdfStripper = new PDFTextStripper();
           String text = pdfStripper.getText(doc);
           String[] lines = text.split(System.getProperty("line.separator"));
           doc.close();
           return lines;
   }

答案3

得分: 0

只想分享将 host.json 更改如下,以停止并发函数调用,似乎已经解决了我的问题:

{
    "version": "2.0",
    "extensions": {
        "queues": {
            "batchSize": 1,
            "newBatchThreshold": 0
        }
    }
}

非常感谢 @KrishnenduGhosh-MSFT 提供的帮助。我仍然不确定为什么并发函数调用会导致我遇到的问题,但考虑到我的程序还连接到 SQL 数据库和 Sharepoint 站点(两者都受限制),顺序处理无论如何都是最佳解决方案。

英文:

Just wanting to share that changing host.json to the following, to stop concurrent function invocation, appears to have fixed my issue:

{
    "version": "2.0",
    "extensions": {
        "queues": {
            "batchSize": 1,
            "newBatchThreshold": 0
        }
    }
}

Massive thanks to @KrishnenduGhosh-MSFT for their help. I'm still unsure why concurrent function invocation led to the issues I was experiencing, but given that my program also connects to a SQL database and Sharepoint site (both which are throttled) sequential-processing is the best solution regardless.

huangapple
  • 本文由 发表于 2020年8月29日 05:41:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/63641197.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定