英文:
Retrieve values from a list resembling json with Pandas
问题
我在我的DataFrame中有类似JSON但不完全相同的值。
我的列看起来像这样:
> ['"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"', nan,
'"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"',
...,
'"{immediateJob : ""Head of M360 APAC (if this role is created)"",Trimester :""Q"",Year :""}{immediateJob : ""Head of EM Research APAC"",Trimester :""Q"",Year :""}"',
'"{immediateJob : ""Change Manager su aree di business diverse da quelle attuali"",Trimester :""Q2"",Year :""2024""}"',
'"{immediateJob : ""Manager ITGP"",Trimester :""Q3"",Year :""2025""}"']
通常这是一个缺失值,也经常是一行上的两个值:
Immediatejob 是可以的,但其他的值在这里缺失
'"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"'
在第2行,我有两个不同的值
'"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"'
我尝试检索 immediate job 的值,但不起作用,因为通常一行中有几行。
DF['col'].str.split('immediateJob : ""', expand=True)[1].str.split('"",', expand=True)[0]
我想要的是首先在有多个值的情况下将其分开为多个行,条件是在另一列中的标识符,然后检索 immediate job、quarter 和 year 的值。
英文:
I have values in my DataFrame that look like JSON but not quite.
My column looks like this:
> ['"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"', nan,
'"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"',
...,
'"{immediateJob : ""Head of M360 APAC (if this role is created)"",Trimester :""Q"",Year :""}{immediateJob : ""Head of EM Research APAC"",Trimester :""Q"",Year :""}"',
'"{immediateJob : ""Change Manager su aree di business diverse da quelle attuali"",Trimester :""Q2"",Year :""2024""}"',
'"{immediateJob : ""Manager ITGP"",Trimester :""Q3"",Year :""2025""}"']
Often it is a missing value, and often it is two values on a single line:
Immediatejob is ok but other are missing here
'"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"'
In line 2, I have two differents values
'"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"'
I tried to retrieve the values of immediate job, but it does not work because there are often several lines in one.
`DF['col'].str.split('immediateJob : ""', expand=True)[1].str.split('"",', expand=True)[0]`
What I would like is first to separate in several rows when there are several values conditional to the identifier that is in another column and then retrieve the values of immediate job, quarter and year
答案1
得分: 1
['DIA',
'Responsable conformité',
'Head of M360 APAC (if this role is created)',
'Change Manager su aree di business diverse da quelle attuali',
'Manager ITGP']
英文:
If you only want to retrieve immediate job
, you could manipulate a list like this:
list_test = ['"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"',
None,
'"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"',
'"{immediateJob : ""Head of M360 APAC (if this role is created)"",Trimester :""Q"",Year :""}{immediateJob : ""Head of EM Research APAC"",Trimester :""Q"",Year :""}"',
'"{immediateJob : ""Change Manager su aree di business diverse da quelle attuali"",Trimester :""Q2"",Year :""2024""}"',
'"{immediateJob : ""Manager ITGP"",Trimester :""Q3"",Year :""2025""}"']
list_test = [s.replace('}{', '''}'","'{''') for s in list_test if s is not None]
[s.split('immediateJob : ""')[1].split('"",')[0] for s in list_test]
Output:
['DIA',
'Responsable conformité',
'Head of M360 APAC (if this role is created)',
'Change Manager su aree di business diverse da quelle attuali',
'Manager ITGP']
I've removed the Nan value from the beginning, was it necessary to keep it?
答案2
得分: 1
以下是翻译好的部分:
尝试此正则表达式模式和Python代码以查找您想要的每个部分:
"{\s*immediateJob\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Trimester\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Year\s*:\s*(\"\"\d+\"\"|\"\")}"
import re
string = ['"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"',
None,
'"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"',
'"{immediateJob : ""Head of M360 APAC (if this role is created)"",Trimester :""Q"",Year :""}{immediateJob : ""Head of EM Research APAC"",Trimester :""Q"",Year :""}"',
'"{immediateJob : ""Change Manager su aree di business diverse da quelle attuali"",Trimester :""Q2"",Year :""2024""}"',
'"{immediateJob : ""Manager ITGP"",Trimester :""Q3"",Year :""2025""}"']
re_pattern = re.compile(r"{\s*immediateJob\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Trimester\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Year\s*:\s*(\"\"\d+\"\"|\"\")}")
for i in string:
if i is not None:
print(re_pattern.findall(i))
您的输入结果如下所示:
然后,如果您想选择“immediate job”值,您应该选择每个列表中元组的第一个组件:
import re
string = ['"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"',
None,
'"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"',
'"{immediateJob : ""Head of M360 APAC (if this role is created)"",Trimester :""Q"",Year :""}{immediateJob : ""Head of EM Research APAC"",Trimester :""Q"",Year :""}"',
'"{immediateJob : ""Change Manager su aree di business diverse da quelle attuali"",Trimester :""Q2"",Year :""2024""}"',
'"{immediateJob : ""Manager ITGP"",Trimester :""Q3"",Year :""2025""}"']
re_pattern = re.compile(r"{\s*immediateJob\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Trimester\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Year\s*:\s*(\"\"\d+\"\"|\"\")}")
ans = []
for i in string:
if i is not None:
for s in re_pattern.findall(i):
ans.append(s[0].replace("\"", ""))
print(ans)
输出结果如下:
['DIA', 'Responsable conformité', 'Responsable projet', 'Head of M360 APAC (if this role is created)', 'Head of EM Research APAC', 'Change Manager su aree di business diverse da quelle attuali', 'Manager ITGP']
英文:
try this regex pattern and Python code to find each section you want:
"{\s*immediateJob\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Trimester\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Year\s*:\s*(\"\"\d+\"\"|\"\")}"
import re
string = ['"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"',
None,
'"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"',
'"{immediateJob : ""Head of M360 APAC (if this role is created)"",Trimester :""Q"",Year :""}{immediateJob : ""Head of EM Research APAC"",Trimester :""Q"",Year :""}"',
'"{immediateJob : ""Change Manager su aree di business diverse da quelle attuali"",Trimester :""Q2"",Year :""2024""}"',
'"{immediateJob : ""Manager ITGP"",Trimester :""Q3"",Year :""2025""}"']
re_pattern = re.compile(r"{\s*immediateJob\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Trimester\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Year\s*:\s*(\"\"\d+\"\"|\"\")}")
for i in string:
if i is not None:
print(re_pattern.findall(i))
and then, if you want to select "immediate job" values, you should select first component of tuples in each list
import re
string = ['"{immediateJob : ""DIA"",Trimester :""Q"",Year :""}"',
None,
'"{immediateJob : ""Responsable conformité"",Trimester :""Q3"",Year :""2022""}{immediateJob : ""Responsable projet"",Trimester :""Q3"",Year :""2022""}"',
'"{immediateJob : ""Head of M360 APAC (if this role is created)"",Trimester :""Q"",Year :""}{immediateJob : ""Head of EM Research APAC"",Trimester :""Q"",Year :""}"',
'"{immediateJob : ""Change Manager su aree di business diverse da quelle attuali"",Trimester :""Q2"",Year :""2024""}"',
'"{immediateJob : ""Manager ITGP"",Trimester :""Q3"",Year :""2025""}"']
re_pattern = re.compile(r"{\s*immediateJob\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Trimester\s*:\s*(\"\"[\w (){}]+\"\"|\"\")\s*,\s*Year\s*:\s*(\"\"\d+\"\"|\"\")}")
ans = []
for i in string:
if i is not None:
for s in re_pattern.findall(i):
ans.append(s[0].replace("\"", ""))
print(ans)
output
['DIA', 'Responsable conformité', 'Responsable projet', 'Head of M360 APAC (if this role is created)', 'Head of EM Research APAC', 'Change Manager su aree di business diverse da quelle attuali', 'Manager ITGP']
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论