如何反序列化大型JSON成员

huangapple go评论157阅读模式
英文:

How to deserialize huge JSON members

问题

我正在调用一个以JSON对象形式返回响应的API。JSON对象的一个成员可以有一个非常长的(10MiB到3GiB+)base-64编码的值。例如:

{
    "name0": "value0",
    "name1": "value1",
    "data": "(这里有一个非常非常长的base-64值)",
    "name2": "value2"
}

我需要从响应体中获取数据以及其他的名称/值。我该如何获取这些数据?

我目前在这个应用程序中使用Newtonsoft.Json来(反)序列化JSON数据,对于较小的数据块,我通常会有一个类型为byte[]Data属性,但这些数据可以超过2GiB,即使它小于2GiB,也可能会有很多响应返回,我们可能会耗尽内存。

我希望有一种方法可以编写一个自定义的JsonConverter或其他内容,以逐渐将数据序列化/反序列化为System.IO.Stream,但我不确定如何读取一个不能完全放入内存的单个字符串“token”。有什么建议吗?

英文:

I'm calling an API that returns its responses as JSON objects. One of the members of the JSON objects can have a really long (10MiB to 3GiB+) base-64 encoded value. For example:

{
    "name0": "value0",
    "name1": "value1",
    "data": "(very very long base-64 value here)",
    "name2": "value2"
}

I need the data and the other names/values from the body. How do I get this data?

I'm currently using Newtonsoft.Json to (de)serialize JSON data in this application, and for smaller chunks of data, I would usually have a Data property of type byte[], but this data can be more than 2GiB and even if it's smaller than that, there may be so many responses coming back that we could run out of memory.

I'm hoping there is a way to write a custom JsonConverter or something to serialize/deserialize the data gradually as a System.IO.Stream, but I'm not sure how to read a single string "token" that cannot itself fit into memory. Any suggestions?

答案1

得分: 0

一个3GiB以上的字符串值太大,无法适应.NET字符串,因为它将超过最大.NET字符串长度。因此,你不能使用Json.NET来读取你的JSON响应,因为Json.NET的JsonTextReader在读取时总是会完全实例化属性值,即使跳过

至于将其反序列化为Streambyte []数组,如评论中所述Pangiotis Kanavos

> 既不是JSON.NET的JsonTextReader也不是System.Text.Json的Utf8JsonReader有一个检索节点作为流的方法。所有与字节相关的方法都一次返回整个内容。

因此,对于足够大的data值,你将超出最大.NET数组长度

那么你有哪些选项?

首先,我鼓励你尝试更改响应格式。JSON不是巨大的Base64编码属性值的理想格式,因为一般来说,大多数JSON序列化器都会完全实例化每个属性。正如Pangiotis Kanavos建议的,你可以将二进制数据作为响应体发送,将其余属性作为自定义头部发送。或者查看https://stackoverflow.com/q/53407860以获取其他选项。如果你这样做,你将能够直接从响应体流复制到某个中间流。

其次,你可以尝试泛化来自此答案的代码,由mtosh编写,适用于https://stackoverflow.com/q/54983533/3744182。该答案展示了如何使用System.Text.Json中的Utf8JsonReader逐个令牌地遍历流。你可以尝试重新编写该答案以支持逐步读取单个字符串值,但是我必须承认,我不知道Utf8JsonReader是否实际支持按块读取属性值的部分内容而不加载整个值。因此,我不能推荐这种方法。

第三,你可以采用来自此答案的方法,用于https://stackoverflow.com/q/66092495/3744182,并使用由JsonReaderWriterFactory.CreateJsonReader()返回的读取器来手动解析JSON。该工厂返回一个XmlDictionaryReader,可以在传输JSON到XML时即时进行编码,因此支持通过XmlReader.ReadContentAsBase64(Byte[], Int32, Int32)逐步读取Base64属性。这是WCF的DataContractJsonSerializer使用的读取器,不建议用于新开发,但已经移植到了.NET Core,因此可以在没有其他选项的情况下使用。

那么,这将如何工作?首先,定义一个与你的JSON对应的模型,其中Data属性表示为Stream

public partial class Model : IDisposable
{
    Stream data;

    public string Name0 { get; set; }
    public string Name1 { get; set; }
    [System.Text.Json.Serialization.JsonIgnore] // 用于调试目的添加的
    public Stream Data { get => data; set => this.data = value; }
    public string Name2 { get; set; }

    public virtual void Dispose() => Interlocked.Exchange(ref data, null)?.Dispose();
}

接下来,定义以下扩展方法:

public class JsonReaderWriterExtensions
{
    const int BufferSize = 8192;
    private static readonly Microsoft.IO.RecyclableMemoryStreamManager manager = new ();

    public static Stream CreateTemporaryStream() =>
        // 创建一些临时流来保存反序列化的二进制数据。
        // 可以是通过File.Create(Path.GetTempFileName(), BufferSize, FileOptions.DeleteOnClose)创建的临时FileStream
        // 或者是通过MSFT的Microsoft.IO.RecyclableMemoryStream nuget包返回的RecyclableMemoryStream。
        manager.GetStream();

    public static T DeserializeModelWithStreams<T>(Stream inputStream) where T : new() =>
        PopulateModelWithStreams(inputStream, new T());

    public static T PopulateModelWithStreams<T>(Stream inputStream, T model)
    {
        ArgumentNullException.ThrowIfNull(inputStream);
        ArgumentNullException.ThrowIfNull(model);

        var type = model.GetType();

        using (var reader = JsonReaderWriterFactory.CreateJsonReader(inputStream, XmlDictionaryReaderQuotas.Max))
        {
            // TODO: 不在根级别的Stream值属性。
            if (reader.MoveToContent() != XmlNodeType.Element)
                throw new XmlException();
            while (reader.Read() && reader.NodeType != XmlNodeType.EndElement)
            {
                switch (reader.NodeType)
                {
                    case XmlNodeType.Element:
                        var name = reader.LocalName;
                        // TODO:
                        // 在这里,我们可以使用DataMemberAttribute.Name或其他属性构建将类型映射到JSON的合同。
                        var property = type.GetProperty(name, BindingFlags.IgnoreCase | BindingFlags.Public | BindingFlags.Instance);
                        if (property == null || !property.CanWrite || property.GetIndexParameters().Length > 0 || Attribute.IsDefined(property, typeof(IgnoreDataMemberAttribute)))
                            continue;
                        // 反序列化值
                        using (var subReader = reader

<details>
<summary>英文:</summary>

A 3GiB+ string value is too large to fit in a .NET string, as it will exceed the [maximum .NET string length](https://stackoverflow.com/q/140468).  Thus you **cannot use Json.NET to read your JSON response** because Json.NET&#39;s `JsonTextReader` will always fully materialize property values as it reads, [even when skipping then](https://github.com/JamesNK/Newtonsoft.Json/issues/1021).  

As for deserializing to a `Stream` or `byte []` array, as noted in [comments](https://stackoverflow.com/questions/76730752/how-to-deserialize-huge-json-members#comment135276174_76730752) by [Panagiotis Kanavos](https://stackoverflow.com/users/134204/panagiotis-kanavos)

&gt; Neither JSON.NET&#39;s JsonTextReader nor System.Text.Json&#39;s Utf8JsonReader have a method that retrieves a node as a stream. **All the byte-related methods return the entire content at once.**

Thus for sufficiently large `data` values you will exceed the [maximum .NET array length](https://stackoverflow.com/q/1391672).

So what are your options?

**Firstly**, I would encourage you to try to change the response format.  JSON isn&#39;t an ideal format for huge Base64-encoded property values as, in general, most JSON serializers will fully materialize each property.  Instead as suggested by Panagiotis Kanavos, send the binary data in the response body and the remaining properties as custom headers.  Or see *https://stackoverflow.com/q/53407860* for additional options.  If you do that you will be able to copy directly from the response body stream to some intermediate stream.

**Secondly**, you could attempt to generalize the code from [this answer](https://stackoverflow.com/a/55429664/3744182) by [mtosh](https://stackoverflow.com/users/7217527/mtosh) to *https://stackoverflow.com/q/54983533/3744182*.  That answer shows how to iterate through a stream token-by-token using `Utf8JsonReader` from System.Text.Json.  You could attempt to rewrite that answer to support reading of **individual string values** incrementally -- however I must admit that I do not know whether `Utf8JsonReader` actually supports reading portions of a property value in chunks without loading the entire value.  As such, I can&#39;t recommend this approach.

**Thirdly**, you could adopt the approach from [this answer](https://stackoverflow.com/a/66095518/3744182) to *https://stackoverflow.com/q/66092495/3744182* and use the reader returned by [`JsonReaderWriterFactory.CreateJsonReader()`](https://learn.microsoft.com/en-us/dotnet/api/system.runtime.serialization.json.jsonreaderwriterfactory.createjsonreader) to manually parse your JSON.  This factory returns an [`XmlDictionaryReader`](https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmldictionaryreader) that transcodes from JSON to XML on the fly, and thus supports incremental reading of Base64 properties via [`XmlReader.ReadContentAsBase64(Byte[], Int32, Int32)`](https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlreader.readcontentasbase64?#System_Xml_XmlReader_ReadContentAsBase64_System_Byte___System_Int32_System_Int32_).  This is the reader used by WCF&#39;s [`DataContractJsonSerializer`](https://learn.microsoft.com/en-us/dotnet/api/system.runtime.serialization.json.datacontractjsonserializer) which is not recommended for new development, but has been ported to .NET Core, so can be used when no other options present themselves.

So, how would this work?  First define a model corresponding to your JSON as follows, with your `Data` property represented as a `Stream`:

	public partial class Model : IDisposable
	{
		Stream data;
	
		public string Name0 { get; set; }
		public string Name1 { get; set; }
		[System.Text.Json.Serialization.JsonIgnore] // Added for debugging purposes
		public Stream Data { get =&gt; data; set =&gt; this.data = value; }
		public string Name2 { get; set; }
		
		public virtual void Dispose() =&gt; Interlocked.Exchange(ref data, null)?.Dispose();
	}

Next, define the following extension methods:

	public class JsonReaderWriterExtensions
	{
		const int BufferSize = 8192;
    	private static readonly Microsoft.IO.RecyclableMemoryStreamManager manager = new ();

		public static Stream CreateTemporaryStream() =&gt; 
			// Create some temporary stream to hold the deserialized binary data.  
			// Could be a FileStream created with FileOptions.DeleteOnClose or a Microsoft.IO.RecyclableMemoryStream
			// File.Create(Path.GetTempFileName(), BufferSize, FileOptions.DeleteOnClose);
			manager.GetStream();
		
		public static T DeserializeModelWithStreams&lt;T&gt;(Stream inputStream) where T : new() =&gt;
			PopulateModelWithStreams(inputStream, new T());

		public static T PopulateModelWithStreams&lt;T&gt;(Stream inputStream, T model)
		{
			ArgumentNullException.ThrowIfNull(inputStream);
			ArgumentNullException.ThrowIfNull(model);

			var type = model.GetType();
			
			using (var reader = JsonReaderWriterFactory.CreateJsonReader(inputStream, XmlDictionaryReaderQuotas.Max))
			{
				// TODO: Stream-valued properties not at the root level.
				if (reader.MoveToContent() != XmlNodeType.Element)
					throw new XmlException();
				while (reader.Read() &amp;&amp; reader.NodeType != XmlNodeType.EndElement)
				{
					switch (reader.NodeType)
					{
						case XmlNodeType.Element:
							var name = reader.LocalName;
							// TODO:
							// Here we could use use DataMemberAttribute.Name or other attributes to build a contract mapping the type to the JSON.
							var property = type.GetProperty(name, BindingFlags.IgnoreCase | BindingFlags.Public | BindingFlags.Instance);
							if (property == null || !property.CanWrite || property.GetIndexParameters().Length &gt; 0 || Attribute.IsDefined(property, typeof(IgnoreDataMemberAttribute)))
								continue;
							// Deserialize the value
							using (var subReader = reader.ReadSubtree())
							{
								subReader.MoveToContent();
								if (typeof(Stream).IsAssignableFrom(property.PropertyType))
								{
									var streamValue = CreateTemporaryStream();	
									byte[] buffer = new byte[BufferSize];
									int readBytes = 0;
									while ((readBytes = subReader.ReadElementContentAsBase64(buffer, 0, buffer.Length)) &gt; 0)
										streamValue.Write(buffer, 0, readBytes);
									if (streamValue.CanSeek)
										streamValue.Position = 0;
									property.SetValue(model, streamValue);
								}
								else
								{
									var settings = new DataContractJsonSerializerSettings
									{
										RootName = name,
										// Modify other settings as required e.g. DateTimeFormat.
									};
									var serializer = new DataContractJsonSerializer(property.PropertyType, settings);
									var value = serializer.ReadObject(subReader);
									if (value != null)
										property.SetValue(model, value);
								}
							}
							Debug.Assert(reader.NodeType == XmlNodeType.EndElement);
							break;
						default:
							reader.Skip();
							break;
					}
				}
			}

			return model;
		}
	}

And now you could deserialize your model as follows:

    using var model = JsonReaderWriterExtensions.DeserializeModelWithStreams&lt;Model&gt;(responseStream);

Notes:


 1. Since the value of `data` may be arbitrarily large, you cannot deserialize its contents into a `MemoryStream`.  Alternatives include:

     - A temporary `FileStream` e.g. as returned by `File.Create(Path.GetTempFileName(), BufferSize, FileOptions.DeleteOnClose)`.
     - A [`RecyclableMemoryStream`](https://github.com/microsoft/Microsoft.IO.RecyclableMemoryStream) as returned by MSFT&#39;s [`Microsoft.IO.RecyclableMemoryStream`](https://www.nuget.org/packages/Microsoft.IO.RecyclableMemoryStream/) nuget package.

    The demo code above uses `RecyclableMemoryStream` but you could change it to use a `FileStream` if you prefer.  Either way you will need to dispose of it after you are done.

 2. I am using reflection to bind c# properties to JSON properties by name, ignoring case.  For properties whose value type is not a `Stream`, I am using `DataContractJsonSerializer` to deserialize their values. This serializer has many quirks such as a funky default `DateTime` format so you may need to play around with your [`DataContractJsonSerializerSettings`](https://learn.microsoft.com/en-us/dotnet/api/system.runtime.serialization.json.datacontractjsonserializersettings), or deserialize certain properties manually. 

 3. My method `JsonReaderWriterExtensions.DeserializeModelWithStreams()` only supports `Stream`-valued properties at the root level.  If you have nested huge Base64-valued properites you will need to rewrite `JsonReaderWriterExtensions.PopulateModelWithStreams()` to be recursive (which basically would amount to writing your own serializer).

 3. For a discussion of how the reader returned by `JsonReaderWriterFactory` transcodes from JSON to XML, see *https://stackoverflow.com/q/59839437/3744182* and *[Mapping Between JSON and XML](https://learn.microsoft.com/en-us/dotnet/framework/wcf/feature-details/mapping-between-json-and-xml)*.


Demo fiddle [here](https://dotnetfiddle.net/4pcEPb).

</details>



huangapple
  • 本文由 发表于 2023年7月20日 22:09:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76730752.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定