将复杂的 JSON 转换为数据框(DataFrame)。

huangapple go评论133阅读模式
英文:

complicated json to df

问题

我有100个URL,当我点击它时,它会显示JSON文件。
但是JSON文件有点复杂,它看起来像这样:

  1. {
  2. "release": [
  3. {
  4. "id":"1234",
  5. "version":"1.0",
  6. "releaseDate":"2023-07-31",
  7. "xxx": "ssss",
  8. "yyy": "uuuu"
  9. },
  10. {
  11. "id" :"2345",
  12. "version": "1.1",
  13. "releaseDate":"2023-05-12",
  14. "xxx":"sssss"
  15. },
  16. ...
  17. ],
  18. "user":false
  19. }

我想要计算过去6个月的发布数量,但是复杂的JSON使得常用的json.loads...pd.read_json...normalize...无法正常工作。

还有 "...." 实际上包含一些HTML标签,如下所示,因此最好只选择 "releaseDate" 进行过滤。

  1. "att":"<p><em>as Alice</em> for...."

我尝试过

我可以使用这个来计算所有时间的发布数量:

  1. releases = len(json_data['release'])

但如何限制它在过去的6个月内?
非常感谢任何帮助!

英文:

I have 100 url and when I click it, it will show json file.
But the json file is a little bit complicated, it looks like this:

  1. {
  2. &quot;release&quot;: [
  3. {
  4. &quot;id&quot;:&quot;1234&quot;,
  5. &quot;version&quot;:&quot;1.0&quot;,
  6. &quot;releaseDate&quot;:&quot;2023-07-31&quot;,
  7. &quot;xxx&quot;: &quot;ssss&quot;,
  8. &quot;yyy&quot;: &quot;uuuu&quot; }
  9. {
  10. &quot;id&quot; :&quot;2345&quot;,
  11. &quot;version&quot;: &quot;1.1&quot;
  12. &quot;releaseDate&quot;:&quot;2023-05-12&quot;
  13. &quot;xxx&quot;:&quot;sssss&quot;
  14. .....}
  15. ],
  16. &quot;user&quot;:false
  17. }

I want to count the release for past 6 month, but the complicated json makes the popular json.loads...pd.read_json...normalize...doesnot work

also the .... actually contains some html label like below, so it will be better to just select the "releaseDate" to filter.

  1. &quot;att&quot;:&quot;&lt;p&gt;&lt;em&gt;as Alice&lt;/em&gt; for.....

What I tried

I can use this to count the release for all time

  1. releases=len(json_data[&#39;releases&#39;])

but how can I limit it to the past 6 month?
any help is really appreciated!!

答案1

得分: 1

创建一个包含六个月前日期的字符串:

  1. six_months_ago = "2023-02-28"

然后使用len()与一个列表推导式,只选择那些在该日期或之后发布的项目:

  1. releases = len([r for r in json_data["releases"] if r["releaseDate"] >= six_months_ago])
英文:

Create a string that contains the date from six months ago:

  1. six_months_ago = &quot;2023-02-28&quot;

And then use len() with a list comprehension that only chooses items that were released on or after that date:

  1. releases = len([r for r in json_data[&quot;releases&quot;] if r[&quot;releaseDate&quot;] &gt;= six_months_ago])

答案2

得分: 1

考虑这个示例:

  1. import json
  2. json_string = r"""{
  3. "release": [
  4. {
  5. "id":"1234",
  6. "version":"1.0",
  7. "releaseDate":"2023-07-31",
  8. "xxx": "ssss",
  9. "yyy": "uuuu" },
  10. {
  11. "id" :"2345",
  12. "version": "1.1",
  13. "releaseDate":"2023-05-12",
  14. "xxx":"sssss"},
  15. {
  16. "id" :"485",
  17. "version": "1.2",
  18. "releaseDate":"2022-05-12",
  19. "xxx":"sssss"}
  20. ],
  21. "user":false
  22. }"""
  23. data = json.loads(json_string)
  24. df = pd.DataFrame(data["release"])
  25. df["releaseDate"] = pd.to_datetime(df["releaseDate"], dayfirst=False)
  26. print(df)

打印:

  1. id version releaseDate xxx yyy
  2. 0 1234 1.0 2023-07-31 ssss uuuu
  3. 1 2345 1.1 2023-05-12 sssss NaN
  4. 2 485 1.2 2022-05-12 sssss NaN

然后,要过滤这个数据框,您可以执行以下操作:

  1. now_minus_6_months = pd.Timestamp.now() - pd.DateOffset(months=6)
  2. print(df[df["releaseDate"] > now_minus_6_months])

打印:

  1. id version releaseDate xxx yyy
  2. 0 1234 1.0 2023-07-31 ssss uuuu
  3. 1 2345 1.1 2023-05-12 sssss NaN
英文:

Consider this example:

  1. import json
  2. json_string = r&quot;&quot;&quot;{
  3. &quot;release&quot;: [
  4. {
  5. &quot;id&quot;:&quot;1234&quot;,
  6. &quot;version&quot;:&quot;1.0&quot;,
  7. &quot;releaseDate&quot;:&quot;2023-07-31&quot;,
  8. &quot;xxx&quot;: &quot;ssss&quot;,
  9. &quot;yyy&quot;: &quot;uuuu&quot; },
  10. {
  11. &quot;id&quot; :&quot;2345&quot;,
  12. &quot;version&quot;: &quot;1.1&quot;,
  13. &quot;releaseDate&quot;:&quot;2023-05-12&quot;,
  14. &quot;xxx&quot;:&quot;sssss&quot;},
  15. {
  16. &quot;id&quot; :&quot;485&quot;,
  17. &quot;version&quot;: &quot;1.2&quot;,
  18. &quot;releaseDate&quot;:&quot;2022-05-12&quot;,
  19. &quot;xxx&quot;:&quot;sssss&quot;}
  20. ],
  21. &quot;user&quot;:false
  22. }&quot;&quot;&quot;
  23. data = json.loads(json_string)
  24. df = pd.DataFrame(data[&quot;release&quot;])
  25. df[&quot;releaseDate&quot;] = pd.to_datetime(df[&quot;releaseDate&quot;], dayfirst=False)
  26. print(df)

Prints:

  1. id version releaseDate xxx yyy
  2. 0 1234 1.0 2023-07-31 ssss uuuu
  3. 1 2345 1.1 2023-05-12 sssss NaN
  4. 2 485 1.2 2022-05-12 sssss NaN

Then to filter this dataframe you can do:

  1. now_minus_6_months = pd.Timestamp.now() - pd.DateOffset(months=6)
  2. print(df[df[&quot;releaseDate&quot;] &gt; now_minus_6_months])

Prints:

  1. id version releaseDate xxx yyy
  2. 0 1234 1.0 2023-07-31 ssss uuuu
  3. 1 2345 1.1 2023-05-12 sssss NaN

huangapple
  • 本文由 发表于 2023年7月31日 23:17:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76804983.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定