将复杂的 JSON 转换为数据框(DataFrame)。

huangapple go评论102阅读模式
英文:

complicated json to df

问题

我有100个URL,当我点击它时,它会显示JSON文件。
但是JSON文件有点复杂,它看起来像这样:

{
  "release": [
    {
     "id":"1234",
     "version":"1.0",
     "releaseDate":"2023-07-31",
     "xxx": "ssss",
     "yyy": "uuuu"
    },
    {
     "id" :"2345",
     "version": "1.1",
     "releaseDate":"2023-05-12",
     "xxx":"sssss"
    },
    ...
  ],
  "user":false
}

我想要计算过去6个月的发布数量,但是复杂的JSON使得常用的json.loads...pd.read_json...normalize...无法正常工作。

还有 "...." 实际上包含一些HTML标签,如下所示,因此最好只选择 "releaseDate" 进行过滤。

"att":"<p><em>as Alice</em> for...."

我尝试过

我可以使用这个来计算所有时间的发布数量:

releases = len(json_data['release'])

但如何限制它在过去的6个月内?
非常感谢任何帮助!

英文:

I have 100 url and when I click it, it will show json file.
But the json file is a little bit complicated, it looks like this:

{
  &quot;release&quot;: [
    {
     &quot;id&quot;:&quot;1234&quot;,
     &quot;version&quot;:&quot;1.0&quot;,
     &quot;releaseDate&quot;:&quot;2023-07-31&quot;,
     &quot;xxx&quot;: &quot;ssss&quot;,
     &quot;yyy&quot;: &quot;uuuu&quot; }
    {
     &quot;id&quot; :&quot;2345&quot;,
     &quot;version&quot;: &quot;1.1&quot;
     &quot;releaseDate&quot;:&quot;2023-05-12&quot;
     &quot;xxx&quot;:&quot;sssss&quot;
      .....}
],
&quot;user&quot;:false
}

I want to count the release for past 6 month, but the complicated json makes the popular json.loads...pd.read_json...normalize...doesnot work

also the .... actually contains some html label like below, so it will be better to just select the "releaseDate" to filter.

&quot;att&quot;:&quot;&lt;p&gt;&lt;em&gt;as Alice&lt;/em&gt; for.....  

What I tried

I can use this to count the release for all time

releases=len(json_data[&#39;releases&#39;])

but how can I limit it to the past 6 month?
any help is really appreciated!!

答案1

得分: 1

创建一个包含六个月前日期的字符串:

six_months_ago = "2023-02-28"

然后使用len()与一个列表推导式,只选择那些在该日期或之后发布的项目:

releases = len([r for r in json_data["releases"] if r["releaseDate"] >= six_months_ago])
英文:

Create a string that contains the date from six months ago:

six_months_ago = &quot;2023-02-28&quot;

And then use len() with a list comprehension that only chooses items that were released on or after that date:

releases = len([r for r in json_data[&quot;releases&quot;] if r[&quot;releaseDate&quot;] &gt;= six_months_ago])

答案2

得分: 1

考虑这个示例:

import json

json_string = r"""{
  "release": [
    {
     "id":"1234",
     "version":"1.0",
     "releaseDate":"2023-07-31",
     "xxx": "ssss",
     "yyy": "uuuu" },
    {
     "id" :"2345",
     "version": "1.1",
     "releaseDate":"2023-05-12",
     "xxx":"sssss"},
    {
     "id" :"485",
     "version": "1.2",
     "releaseDate":"2022-05-12",
     "xxx":"sssss"}
],
"user":false
}"""

data = json.loads(json_string)

df = pd.DataFrame(data["release"])
df["releaseDate"] = pd.to_datetime(df["releaseDate"], dayfirst=False)
print(df)

打印:

     id version releaseDate    xxx   yyy
0  1234     1.0  2023-07-31   ssss  uuuu
1  2345     1.1  2023-05-12  sssss   NaN
2   485     1.2  2022-05-12  sssss   NaN

然后,要过滤这个数据框,您可以执行以下操作:

now_minus_6_months = pd.Timestamp.now() - pd.DateOffset(months=6)
print(df[df["releaseDate"] > now_minus_6_months])

打印:

     id version releaseDate    xxx   yyy
0  1234     1.0  2023-07-31   ssss  uuuu
1  2345     1.1  2023-05-12  sssss   NaN
英文:

Consider this example:

import json

json_string = r&quot;&quot;&quot;{
  &quot;release&quot;: [
    {
     &quot;id&quot;:&quot;1234&quot;,
     &quot;version&quot;:&quot;1.0&quot;,
     &quot;releaseDate&quot;:&quot;2023-07-31&quot;,
     &quot;xxx&quot;: &quot;ssss&quot;,
     &quot;yyy&quot;: &quot;uuuu&quot; },
    {
     &quot;id&quot; :&quot;2345&quot;,
     &quot;version&quot;: &quot;1.1&quot;,
     &quot;releaseDate&quot;:&quot;2023-05-12&quot;,
     &quot;xxx&quot;:&quot;sssss&quot;},
    {
     &quot;id&quot; :&quot;485&quot;,
     &quot;version&quot;: &quot;1.2&quot;,
     &quot;releaseDate&quot;:&quot;2022-05-12&quot;,
     &quot;xxx&quot;:&quot;sssss&quot;}
],
&quot;user&quot;:false
}&quot;&quot;&quot;

data = json.loads(json_string)

df = pd.DataFrame(data[&quot;release&quot;])
df[&quot;releaseDate&quot;] = pd.to_datetime(df[&quot;releaseDate&quot;], dayfirst=False)
print(df)

Prints:

     id version releaseDate    xxx   yyy
0  1234     1.0  2023-07-31   ssss  uuuu
1  2345     1.1  2023-05-12  sssss   NaN
2   485     1.2  2022-05-12  sssss   NaN

Then to filter this dataframe you can do:

now_minus_6_months = pd.Timestamp.now() - pd.DateOffset(months=6)
print(df[df[&quot;releaseDate&quot;] &gt; now_minus_6_months])

Prints:

     id version releaseDate    xxx   yyy
0  1234     1.0  2023-07-31   ssss  uuuu
1  2345     1.1  2023-05-12  sssss   NaN

huangapple
  • 本文由 发表于 2023年7月31日 23:17:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76804983.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定