如何反序列化一个庞大而复杂的JSON文件

huangapple go评论74阅读模式
英文:

How to deserialize a huge, complex JSON file

问题

我有一个巨大的(大约50GB)JSON文件需要反序列化。JSON文件包含14个数组,可以在这里找到一个简短的示例。

我编写了我的POCO文件,声明了15个类(每个数组一个类,还有一个根类),现在我正在尝试获取我的数据。由于原始数据很大且以zip文件的形式提供,我尝试不解压整个文件。因此,在以下代码中使用了IO.Compression。

using System.IO.Compression;
using System.Text.Json;
using System.Text.Json.Nodes;

namespace read_and_parse
{
    internal class Program
    {
        static void Main() 
        {
            var fc = new Program();

            string zip_path = @"C:\Projects\BBR\Download_Total\example_json.zip";
            using FileStream file = File.OpenRead(zip_path);
            using (var zip = new ZipArchive(file, ZipArchiveMode.Read))
            {
                foreach (ZipArchiveEntry entry in zip.Entries)
                {

                    string[] name_split = entry.Name.Split('_');
                    string name = name_split.Last().Substring(0, name_split.Last().Length - 5);
                    bool canConvert = long.TryParse(name, out long number1);
                    if (canConvert == true)
                    {
                        Task task = fc.ParseJsonFromZippedFile(entry);
                    }
                }
            }
        }

        private async Task ParseJsonFromZippedFile(ZipArchiveEntry entry)
        {
            JsonSerializerOptions options = new JsonSerializerOptions { PropertyNamingPolicy = JsonNamingPolicy.CamelCase };
            await using Stream entryStream = entry.Open();

            IAsyncEnumerable<JsonNode?> enumerable = JsonSerializer.DeserializeAsyncEnumerable<JsonNode>(entryStream, options);
            await foreach (JsonNode? obj in enumerable) 
            {
                // 仅解析对象的子集
                JsonNode? bbrSagNode = obj?["BBRSaglist"];
                if (bbrSagNode is null) continue;
                else
                {
                    var bbrSag = bbrSagNode.Deserialize<BBRSagList>();                    
                }
            }

        }

    }
}

不幸的是,我没有得到任何输出,而且在任务的foreach循环中失败了。它以System.Threading.Tasks.VoidTaskResult失败。

我如何才能反序列化数据?

英文:

I have a huge (approx. 50GB) JSON file to deserialize. The JSON file consists of 14 arrays, and short example of it can be found here.

I wrote my POCO file, declaring 15 classes (one for each array, and a root class) and now I am trying to get my data in. Since the original data are huge and come in a zip file I am trying not to unpack the whole thing. Hence, the use of IO.Compression in the following code.

using System.IO.Compression;
using System.Text.Json;
using System.Text.Json.Nodes;

namespace read_and_parse
{
    internal class Program
    {
        static void Main() 
        {
            var fc = new Program();

            string zip_path = @&quot;C:\Projects\BBR\Download_Total\example_json.zip&quot;;
            using FileStream file = File.OpenRead(zip_path);
            using (var zip = new ZipArchive(file, ZipArchiveMode.Read))
            {
                foreach (ZipArchiveEntry entry in zip.Entries)
                {

                    string[] name_split = entry.Name.Split(&#39;_&#39;);
                    string name = name_split.Last().Substring(0, name_split.Last().Length - 5);
                    bool canConvert = long.TryParse(name, out long number1);
                    if (canConvert == true)
                    {
                        Task task = fc.ParseJsonFromZippedFile(entry);
                    }
                }
            }
        }

        private async Task ParseJsonFromZippedFile(ZipArchiveEntry entry)
        {
            JsonSerializerOptions options = new JsonSerializerOptions { PropertyNamingPolicy = JsonNamingPolicy.CamelCase };
            await using Stream entryStream = entry.Open();

            IAsyncEnumerable&lt;JsonNode?&gt; enumerable = JsonSerializer.DeserializeAsyncEnumerable&lt;JsonNode&gt;(entryStream, options);
            await foreach (JsonNode? obj in enumerable) 
            {
                // Parse only subset of the object
                JsonNode? bbrSagNode = obj?[&quot;BBRSaglist&quot;];
                if (bbrSagNode is null) continue;
                else
                {
                    var bbrSag = bbrSagNode.Deserialize&lt;BBRSagList&gt;();                    
                }
            }

        }

    }
}

Unfortunately I do not get anything out of it and it fails in the foreach-loop of the task. It fails with a System.Threading.Tasks.VoidTaskResult.

How do I get the data deserialized?

答案1

得分: 3

你的根JSON容器不是一个数组,而是一个对象:

{
    "BBRSagList": [ /* BBRSagList的内容 */ ],
    "BygningList": [ /* BygningList的内容 */ ]
}

您将无法使用 JsonSerializer.DeserializeAsyncEnumerable<T> 来反序列化这种JSON,因为此方法仅支持JSON 数组 的异步流式反序列化,而不支持对象。不幸的是,System.Text.Json不直接支持对象的流式反序列化,甚至不支持流式操作,而是支持流水线操作。如果您需要使用System.Text.Json流式处理文件,您需要构建在 此答案 中由 mtosh 提供的方法,链接是 https://stackoverflow.com/q/54983533/3744182

作为替代方案,您可以使用Json.NET,它专为通过 JsonTextReader 进行流式处理而设计。您的JSON对象包含多个数组值属性,使用Json.NET,您将能够异步流式处理您的 entryStream,将每个数组值加载到 JToken 中,然后为每个令牌调用某个回调函数。

首先,引入以下扩展方法:

public static partial class JsonExtensions
{
    /// <summary>
    /// 异步流式处理包含具有数组值的属性的JSON对象流,并对每个属性值指定回调函数。
    /// 读取器必须位于对象上,否则将引发异常。
    /// </summary>
    public static async Task StreamJsonObjectArrayPropertyValues(Stream stream, Dictionary<string, Action<JToken>> itemActions, FloatParseHandling? floatParseHandling = default, DateParseHandling? dateParseHandling = default, CancellationToken cancellationToken = default)
    {
        // StreamReader和JsonTextReader不实现IAsyncDisposable,因此让调用者处理流。
        using (var textReader = new StreamReader(stream, leaveOpen : true))
        using (var reader = new JsonTextReader(textReader) { CloseInput = false })
        {
            if (floatParseHandling != null)
                reader.FloatParseHandling = floatParseHandling.Value;
            if (dateParseHandling != null)
                reader.DateParseHandling = dateParseHandling.Value;
            await StreamJsonObjectArrayPropertyValues(reader, itemActions, cancellationToken).ConfigureAwait(false);
        }
    }

    /// <summary>
    /// 异步流式处理给定的JSON对象,其属性具有数组值,并对每个属性值指定回调函数。
    /// 读取器必须位于对象上,否则将引发异常。
    /// </summary>
    public static async Task StreamJsonObjectArrayPropertyValues(JsonReader reader, Dictionary<string, Action<JToken>> actions, CancellationToken cancellationToken = default)
    {
        var loadSettings = new JsonLoadSettings { LineInfoHandling = LineInfoHandling.Ignore }; // 为了性能,不加载行信息。
        (await reader.MoveToContentAndAssertAsync(cancellationToken).ConfigureAwait(false)).AssertTokenType(JsonToken.StartObject);
        while ((await reader.ReadToContentAndAssertAsync(cancellationToken).ConfigureAwait(false)).TokenType != JsonToken.EndObject)
        {
            if (reader.TokenType != JsonToken.PropertyName)
                throw new JsonReaderException();
            var name = (string)reader.Value!;
            await reader.ReadToContentAndAssertAsync().ConfigureAwait(false);
            if (actions.TryGetValue(name, out var action) && reader.TokenType == JsonToken.StartArray)
            {
                await foreach (var token in reader.LoadAsyncEnumerable(loadSettings, cancellationToken).ConfigureAwait(false))
                {
                    action(token);
                }
            }
            else
            {
                await reader.SkipAsync().ConfigureAwait(false);
            }
        }
    }

    // ... 其他扩展方法在这里
}

现在,您可以执行以下操作来处理 "BBRSagList" 数组中的条目:

private static async Task ParseJsonFromZippedFile(ZipArchiveEntry entry)
{
    await using Stream entryStream = entry.Open();
    Dictionary<string, Action<JToken>> actions = new ()
    {
        ["BBRSagList"] = ProcessBBRSagList,
    };
    // 让每个单独的操作识别日期和时间。
    await JsonExtensions.StreamJsonObjectArrayPropertyValues(entryStream, actions, dateParseHandling: DateParseHandling.None);
}

static void ProcessBBRSagList(JToken token)
{
    var brsagList = token.ToObject<BBRSagList>();
    
    // 根据需要处理每个BBRSagList。
    Console.WriteLine("Deserialized {0}, result = {1}", brsagList, JsonConvert.SerializeObject(brsagList));
}

注意:

  • 正如 Fildor-standswithMods评论 中观察到的那样,您还必须将您的 Main() 方法声明为 public static async Task Main(),并等待 ParseJsonFromZippedFile(entry)
public static async Task Main()
{
    string zip_path = @"C:\Projects\BBR\Download_Total\example_json.zip";
    using FileStream file = File.OpenRead(zip_path);
    using (var zip = new ZipArchive(file, ZipArchiveMode.Read))
    {
        foreach (ZipArchiveEntry entry in zip.Entries)
        {
            string[] name_split = entry.Name.Split('_');
            string name = name_split.Last().Substring(0, name_split.Last().Length - 5);
            bool canConvert = long.TryParse(name, out long number1);
            if (canConvert)
            {
                await ParseJsonFromZippedFile(entry);
            }
        }
    }
}

(我将 ParseJsonFromZippedFile() 方法设置为静态方法,因此无需分配 Program 实例。)

演示示例在 这里

英文:

Your root JSON container is not an array, it's an object:

{
    &quot;BBRSagList&quot;: [ /* Contents of BBRSagList */ ],
    &quot;BygningList&quot;: [ /* Contents of BygningList*/ ]
}

You will not be able to use JsonSerializer.DeserializeAsyncEnumerable&lt;T&gt; to deserialize such JSON because this method only supports async streaming deserialization of JSON arrays, not objects. And unfortunately System.Text.Json does not directly support streaming deserialization of objects, or even streaming in general, it supports pipelining. If you need to stream through a file using System.Text.Json you will need to build on this answer by mtosh to https://stackoverflow.com/q/54983533/3744182.

As an alternative, you could use Json.NET which is designed for streaming via JsonTextReader. Your JSON object consists of multiple array-valued properties, and using Json.NET you will be able to stream through your entryStream asynchronously, load each array value into a JToken, then call some callback for each token.

First, introduce the following extension methods:

public static partial class JsonExtensions
{
	/// &lt;summary&gt;
	/// Asynchronously stream through a stream containing a JSON object whose properties have array values and call some callback for each value specified by property name
	/// The reader must be positioned on an object or an exception will be thrown.
	/// &lt;/summary&gt;
	public static async Task StreamJsonObjectArrayPropertyValues(Stream stream, Dictionary&lt;string, Action&lt;JToken&gt;&gt; itemActions, FloatParseHandling? floatParseHandling = default, DateParseHandling? dateParseHandling = default, CancellationToken cancellationToken = default)
	{
		// StreamReader and JsonTextReader do not implement IAsyncDisposable so let the caller dispose the stream.
		using (var textReader = new StreamReader(stream, leaveOpen : true))
		using (var reader = new JsonTextReader(textReader) { CloseInput = false })
		{
			if (floatParseHandling != null)
				reader.FloatParseHandling = floatParseHandling.Value;
			if (dateParseHandling != null)
				reader.DateParseHandling = dateParseHandling.Value;
			await StreamJsonObjectArrayPropertyValues(reader, itemActions, cancellationToken).ConfigureAwait(false);
		}
	}

	/// &lt;summary&gt;
	/// Asynchronously stream through a given JSON object whose properties have array values and call some callback for each value specified by property name
	/// The reader must be positioned on an object or an exception will be thrown.
	/// &lt;/summary&gt;
	public static async Task StreamJsonObjectArrayPropertyValues(JsonReader reader, Dictionary&lt;string, Action&lt;JToken&gt;&gt; actions, CancellationToken cancellationToken = default)
	{
		var loadSettings = new JsonLoadSettings { LineInfoHandling = LineInfoHandling.Ignore }; // For performance do not load line info.
		(await reader.MoveToContentAndAssertAsync(cancellationToken).ConfigureAwait(false)).AssertTokenType(JsonToken.StartObject);
		while ((await reader.ReadToContentAndAssertAsync(cancellationToken).ConfigureAwait(false)).TokenType != JsonToken.EndObject)
		{
			if (reader.TokenType != JsonToken.PropertyName)
				throw new JsonReaderException();
			var name = (string)reader.Value!;
			await reader.ReadToContentAndAssertAsync().ConfigureAwait(false);
			if (actions.TryGetValue(name, out var action) &amp;&amp; reader.TokenType == JsonToken.StartArray)
			{
				await foreach (var token in reader.LoadAsyncEnumerable(loadSettings, cancellationToken).ConfigureAwait(false))
				{
					action(token);
				}
			}
			else
			{
				await reader.SkipAsync().ConfigureAwait(false);
			}
		}
	}
	
	/// &lt;summary&gt;
	/// Asynchronously load and return JToken values from a stream containing a JSON array.  
	/// The reader must be positioned on an array or an exception will be thrown.
	/// &lt;/summary&gt;
	public static async IAsyncEnumerable&lt;JToken&gt; LoadAsyncEnumerable(this JsonReader reader, JsonLoadSettings? settings = default, [EnumeratorCancellation] CancellationToken cancellationToken = default)
	{
		(await reader.MoveToContentAndAssertAsync().ConfigureAwait(false)).AssertTokenType(JsonToken.StartArray);
		cancellationToken.ThrowIfCancellationRequested();
		while ((await reader.ReadToContentAndAssertAsync(cancellationToken).ConfigureAwait(false)).TokenType != JsonToken.EndArray)
		{
			cancellationToken.ThrowIfCancellationRequested();
			yield return await JToken.LoadAsync(reader, settings, cancellationToken).ConfigureAwait(false);
		}
		cancellationToken.ThrowIfCancellationRequested();
	}

	public static JsonReader AssertTokenType(this JsonReader reader, JsonToken tokenType) =&gt; 
		reader.TokenType == tokenType ? reader : throw new JsonSerializationException(string.Format(&quot;Unexpected token {0}, expected {1}&quot;, reader.TokenType, tokenType));

	public static async Task&lt;JsonReader&gt; ReadToContentAndAssertAsync(this JsonReader reader, CancellationToken cancellationToken = default) =&gt;
		await (await reader.ReadAndAssertAsync(cancellationToken).ConfigureAwait(false)).MoveToContentAndAssertAsync(cancellationToken).ConfigureAwait(false);

	public static async Task&lt;JsonReader&gt; MoveToContentAndAssertAsync(this JsonReader reader, CancellationToken cancellationToken = default)
	{
		if (reader == null)
			throw new ArgumentNullException();
		if (reader.TokenType == JsonToken.None)       // Skip past beginning of stream.
			await reader.ReadAndAssertAsync(cancellationToken).ConfigureAwait(false);
		while (reader.TokenType == JsonToken.Comment) // Skip past comments.
			await reader.ReadAndAssertAsync(cancellationToken).ConfigureAwait(false);
		return reader;
	}

	public static async Task&lt;JsonReader&gt; ReadAndAssertAsync(this JsonReader reader, CancellationToken cancellationToken = default)
	{
		if (reader == null)
			throw new ArgumentNullException();
		if (!await reader.ReadAsync(cancellationToken).ConfigureAwait(false))
			throw new JsonReaderException(&quot;Unexpected end of JSON stream.&quot;);
		return reader;
	}
}

And now you will be able to do the following, to process the entries in the "BBRSagList" array:

private static async Task ParseJsonFromZippedFile(ZipArchiveEntry entry)
{
    await using Stream entryStream = entry.Open();
	Dictionary&lt;string, Action&lt;JToken&gt;&gt; actions = new ()
	{
		[&quot;BBRSagList&quot;] = ProcessBBRSagList,
	};
	// Let each individual action recognize dates and times.
	await JsonExtensions.StreamJsonObjectArrayPropertyValues(entryStream , actions, dateParseHandling : DateParseHandling.None);
}

static void ProcessBBRSagList(JToken token)
{
	var brsagList = token.ToObject&lt;BBRSagList&gt;();
	
	// Handle each BBRSagList however you want.
	Console.WriteLine(&quot;Deserialized {0}, result = {1}&quot;, brsagList, JsonConvert.SerializeObject(brsagList));
}

Notes:

  • As observed by Fildor-standswithMods in comments, you must also declare your Main() method as public static async Task Main() and also await ParseJsonFromZippedFile(entry)

    public static async Task Main()
    {
        string zip_path = @&quot;C:\Projects\BBR\Download_Total\example_json.zip&quot;;
        using FileStream file = File.OpenRead(zip_path);
        using (var zip = new ZipArchive(file, ZipArchiveMode.Read))
        {
            foreach (ZipArchiveEntry entry in zip.Entries)
            {
                string[] name_split = entry.Name.Split(&#39;_&#39;);
                string name = name_split.Last().Substring(0, name_split.Last().Length - 5);
                bool canConvert = long.TryParse(name, out long number1);
                if (canConvert)
                {
                    await ParseJsonFromZippedFile(entry);
                }
            }
        }
    }
    

    (I made ParseJsonFromZippedFile() a static method so there is no reason to allocate a Program instance.)

Demo fiddle here.

答案2

得分: 0

由于我的公司安全性,我无法访问数据示例。请检查是否没有根元素,只有一个JsonArray/list...

我制作了一个示例。尝试使用ToListAsync扩展方法,然后您可以反序列化每个对象并添加到主列表中,等等。

void Main()
{
    JsonSerializerOptions options = new JsonSerializerOptions { PropertyNamingPolicy = JsonNamingPolicy.CamelCase };
    var jobj = JsonObject.Parse("[{\"name\":\"Tom Cruise\",\"age\":56,\"Born At\":\"Syracuse, NY\",\"Birthdate\":\"July 3, 1962\",\"photo\":\"https://jsonformatter.org/img/tom-cruise.jpg\"},{\"name\":\"Robert Downey Jr.\",\"age\":53,\"Born At\":\"New York City, NY\",\"Birthdate\":\"April 4, 1965\",\"photo\":\"https://jsonformatter.org/img/Robert-Downey-Jr.jpg\"}]");

    var jStream = new System.IO.MemoryStream(System.Text.Encoding.UTF8.GetBytes(jobj.ToJsonString()));

    var _enumerable = Task.Run(() => System.Text.Json.JsonSerializer.DeserializeAsyncEnumerable<JsonNode>(jStream, options).ToListAsync());
    foreach (JsonNode obj in _enumerable.Result)
    {
        obj.Dump(obj["name"].ToString());
    }
}

public static class AsyncEnumerableExtensions
{
    public static async Task<List<T>> ToListAsync<T>(this IAsyncEnumerable<T> items,
        CancellationToken cancellationToken = default)
    {
        var results = new List<T>();
        await foreach (var item in items.WithCancellation(cancellationToken)
                                        .ConfigureAwait(false))
            results.Add(item);
        return results;
    }
}

如何反序列化一个庞大而复杂的JSON文件

英文:

due to my company security i cant access to data example. please check if you have no root element. just a JsonArray/list ...

i made an example. try to use ToListAsync extension
and you can deserialize each object and add to main list.. etc..

void Main()
{
	JsonSerializerOptions options = new JsonSerializerOptions { PropertyNamingPolicy = JsonNamingPolicy.CamelCase };
	var jobj = JsonObject.Parse(&quot;[{\&quot;name\&quot;:\&quot;Tom Cruise\&quot;,\&quot;age\&quot;:56,\&quot;Born At\&quot;:\&quot;Syracuse, NY\&quot;,\&quot;Birthdate\&quot;:\&quot;July 3, 1962\&quot;,\&quot;photo\&quot;:\&quot;https://jsonformatter.org/img/tom-cruise.jpg\&quot;},{\&quot;name\&quot;:\&quot;Robert Downey Jr.\&quot;,\&quot;age\&quot;:53,\&quot;Born At\&quot;:\&quot;New York City, NY\&quot;,\&quot;Birthdate\&quot;:\&quot;April 4, 1965\&quot;,\&quot;photo\&quot;:\&quot;https://jsonformatter.org/img/Robert-Downey-Jr.jpg\&quot;}]&quot;);
	
	var jStream = new System.IO.MemoryStream(System.Text.Encoding.UTF8.GetBytes(jobj.ToJsonString()));
	
	var _enumerable = Task.Run(() =&gt; System.Text.Json.JsonSerializer.DeserializeAsyncEnumerable&lt;JsonNode&gt;(jStream, options).ToListAsync());
	foreach (JsonNode obj in _enumerable.Result)
	{
		obj.Dump(obj[&quot;name&quot;].ToString());
	}
}


public static class AsyncEnumerableExtensions
{
	public static async Task&lt;List&lt;T&gt;&gt; ToListAsync&lt;T&gt;(this IAsyncEnumerable&lt;T&gt; items,
		CancellationToken cancellationToken = default)
	{
		var results = new List&lt;T&gt;();
		await foreach (var item in items.WithCancellation(cancellationToken)
										.ConfigureAwait(false))
			results.Add(item);
		return results;
	}
}

如何反序列化一个庞大而复杂的JSON文件

huangapple
  • 本文由 发表于 2023年7月12日 22:07:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76671482.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定