从 Pandas 中检索类似 JSON 结构的列表中的数值。

huangapple go评论67阅读模式
英文:

Retrieve values from a list resembling json with Pandas

问题

我在我的DataFrame中有类似JSON但不完全相同的值。

我的列看起来像这样:

> ['"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"', nan,
       '"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"',
       ...,
       '"{immediateJob : ""Head of M360 APAC (if this role is created)"",Trimester :""Q"",Year :""}{immediateJob : ""Head of EM Research APAC"",Trimester :""Q"",Year :""}"',
       '"{immediateJob : ""Change Manager su aree di business diverse da quelle attuali"",Trimester :""Q2"",Year :""2024""}"',
       '"{immediateJob : ""Manager ITGP"",Trimester :""Q3"",Year :""2025""}"']

通常这是一个缺失值,也经常是一行上的两个值:

Immediatejob 是可以的,但其他的值在这里缺失

'"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"'

在第2行,我有两个不同的值
'"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"'

我尝试检索 immediate job 的值,但不起作用,因为通常一行中有几行。

DF['col'].str.split('immediateJob : ""', expand=True)[1].str.split('"",', expand=True)[0]

我想要的是首先在有多个值的情况下将其分开为多个行,条件是在另一列中的标识符,然后检索 immediate job、quarter 和 year 的值。

英文:

I have values in my DataFrame that look like JSON but not quite.

My column looks like this:

> ['"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"', nan,
       '"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"',
       ...,
       '"{immediateJob : ""Head of M360 APAC (if this role is created)"",Trimester :""Q"",Year :""}{immediateJob : ""Head of EM Research APAC"",Trimester :""Q"",Year :""}"',
       '"{immediateJob : ""Change Manager su aree di business diverse da quelle attuali"",Trimester :""Q2"",Year :""2024""}"',
       '"{immediateJob : ""Manager ITGP"",Trimester :""Q3"",Year :""2025""}"']

Often it is a missing value, and often it is two values on a single line:

Immediatejob is ok but other are missing here

'"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"'

In line 2, I have two differents values
'"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"'

I tried to retrieve the values of immediate job, but it does not work because there are often several lines in one.

`DF['col'].str.split('immediateJob : ""', expand=True)[1].str.split('"",', expand=True)[0]`

What I would like is first to separate in several rows when there are several values conditional to the identifier that is in another column and then retrieve the values of immediate job, quarter and year

答案1

得分: 1

['DIA',
'Responsable conformité',
'Head of M360 APAC (if this role is created)',
'Change Manager su aree di business diverse da quelle attuali',
'Manager ITGP']

英文:

If you only want to retrieve immediate job, you could manipulate a list like this:

list_test = ['"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"', 
         None,
         '"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"',
         '"{immediateJob : ""Head of M360 APAC (if this role is created)"",Trimester :""Q"",Year :""}{immediateJob : ""Head of EM Research APAC"",Trimester :""Q"",Year :""}"',
         '"{immediateJob : ""Change Manager su aree di business diverse da quelle attuali"",Trimester :""Q2"",Year :""2024""}"',
         '"{immediateJob : ""Manager ITGP"",Trimester :""Q3"",Year :""2025""}"']

list_test = [s.replace('}{', '''}'","'{''') for s in list_test if s is not None]

[s.split('immediateJob : ""')[1].split('"",')[0] for s in list_test]

Output:

['DIA',
 'Responsable conformité',
 'Head of M360 APAC (if this role is created)',
 'Change Manager su aree di business diverse da quelle attuali',
 'Manager ITGP']

I've removed the Nan value from the beginning, was it necessary to keep it?

答案2

得分: 1

以下是翻译好的部分:

尝试此正则表达式模式和Python代码以查找您想要的每个部分:

"{\s*immediateJob\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Trimester\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Year\s*:\s*(\"\"\d+\"\"|\"\")}"
import re

string = ['"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"', 
        None,
        '"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"',
        '"{immediateJob : ""Head of M360 APAC (if this role is created)"",Trimester :""Q"",Year :""}{immediateJob : ""Head of EM Research APAC"",Trimester :""Q"",Year :""}"',
        '"{immediateJob : ""Change Manager su aree di business diverse da quelle attuali"",Trimester :""Q2"",Year :""2024""}"',
        '"{immediateJob : ""Manager ITGP"",Trimester :""Q3"",Year :""2025""}"']

re_pattern = re.compile(r"{\s*immediateJob\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Trimester\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Year\s*:\s*(\"\"\d+\"\"|\"\")}")

for i in string:
    if i is not None:
        print(re_pattern.findall(i))

您的输入结果如下所示:

从 Pandas 中检索类似 JSON 结构的列表中的数值。

然后,如果您想选择“immediate job”值,您应该选择每个列表中元组的第一个组件:

import re

string = ['"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"', 
        None,
        '"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"',
        '"{immediateJob : ""Head of M360 APAC (if this role is created)"",Trimester :""Q"",Year :""}{immediateJob : ""Head of EM Research APAC"",Trimester :""Q"",Year :""}"',
        '"{immediateJob : ""Change Manager su aree di business diverse da quelle attuali"",Trimester :""Q2"",Year :""2024""}"',
        '"{immediateJob : ""Manager ITGP"",Trimester :""Q3"",Year :""2025""}"']

re_pattern = re.compile(r"{\s*immediateJob\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Trimester\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Year\s*:\s*(\"\"\d+\"\"|\"\")}")

ans = []
for i in string:
    if i is not None:
        for s in re_pattern.findall(i):
            ans.append(s[0].replace("\"", ""))

print(ans)

输出结果如下:

['DIA', 'Responsable conformité', 'Responsable projet', 'Head of M360 APAC (if this role is created)', 'Head of EM Research APAC', 'Change Manager su aree di business diverse da quelle attuali', 'Manager ITGP']
英文:

try this regex pattern and Python code to find each section you want:

"{\s*immediateJob\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Trimester\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Year\s*:\s*(\"\"\d+\"\"|\"\")}"
import re

string = ['"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"', 
        None,
        '"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"',
        '"{immediateJob : ""Head of M360 APAC (if this role is created)"",Trimester :""Q"",Year :""}{immediateJob : ""Head of EM Research APAC"",Trimester :""Q"",Year :""}"',
        '"{immediateJob : ""Change Manager su aree di business diverse da quelle attuali"",Trimester :""Q2"",Year :""2024""}"',
        '"{immediateJob : ""Manager ITGP"",Trimester :""Q3"",Year :""2025""}"']

re_pattern = re.compile(r"{\s*immediateJob\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Trimester\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Year\s*:\s*(\"\"\d+\"\"|\"\")}")

for i in string:
    if i is not None:
        print(re_pattern.findall(i))

result on your input:
从 Pandas 中检索类似 JSON 结构的列表中的数值。

and then, if you want to select "immediate job" values, you should select first component of tuples in each list

import re

string = ['"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"', 
        None,
        '"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"',
        '"{immediateJob : ""Head of M360 APAC (if this role is created)"",Trimester :""Q"",Year :""}{immediateJob : ""Head of EM Research APAC"",Trimester :""Q"",Year :""}"',
        '"{immediateJob : ""Change Manager su aree di business diverse da quelle attuali"",Trimester :""Q2"",Year :""2024""}"',
        '"{immediateJob : ""Manager ITGP"",Trimester :""Q3"",Year :""2025""}"']

re_pattern = re.compile(r"{\s*immediateJob\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Trimester\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Year\s*:\s*(\"\"\d+\"\"|\"\")}")

ans = []
for i in string:
    if i is not None:
        for s in re_pattern.findall(i):
            ans.append(s[0].replace("\"", ""))

print(ans)

output

['DIA', 'Responsable conformité', 'Responsable projet', 'Head of M360 APAC (if this role is created)', 'Head of EM Research APAC', 'Change Manager su aree di business diverse da quelle attuali', 'Manager ITGP']

huangapple
  • 本文由 发表于 2023年6月29日 23:49:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/76582674.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定