英文:
From Python's nested dictionary to flat Pandas dataframe
问题
I have a nested dictionary of the public information of employment history of people and I would like to construct the panel data similar to the following table.
Here is the nested dictionary.
The nested dictionary for person 1 in the above table is as follows.
{
'basicInformation': {
'individualId': 6092353,
'firstName': 'A',
'middleName': 'ANTHONY',
'lastName': 'OLIVETTI',
'otherNames': ['ALBERT A OLIVETTI', 'ALBERT ANTHONY OLIVETTI', 'ANTHONY A OLIVETTI', 'ANTHONY OLIVETTI'],
'bcScope': 'Active',
'iaScope': 'Active',
'daysInIndustryCalculatedDate': '10/16/2013'
},
'currentEmployments': [
{
'firmId': 8174,
'firmName': 'UBS FINANCIAL SERVICES INC.',
'iaOnly': 'N',
'registrationBeginDate': '10/17/2013',
'firmBCScope': 'ACTIVE',
'firmIAScope': 'ACTIVE',
'iaSECNumber': '7163',
'iaSECNumberType': '801',
'bdSECNumber': '16267',
'branchOfficeLocations': [
{
'locatedAtFlag': 'Y',
'supervisedFromFlag': 'N',
'privateResidenceFlag': 'N',
'branchOfficeId': '88789',
'street1': '1251 AVE OF THE AMERICAS',
'street2': '2ND FLOOR',
'city': 'NEW YORK',
'cityAlias': ['MANHATTAN', 'NEW YORK', 'NEW YORK CITY', 'NY', 'NY CITY', 'NYC'],
'state': 'NY',
'country': 'United States',
'zipCode': '10020',
'latitude': '40.758908',
'longitude': '-73.97902',
'geoLocation': '40.758908,-73.97902',
'nonRegisteredOfficeFlag': 'N',
'elaBeginDate': '07/15/2013'
}
]
}
],
'currentIAEmployments': [
{
'firmId': 8174,
'firmName': 'UBS FINANCIAL SERVICES INC.',
'iaOnly': 'Y',
'registrationBeginDate': '2/24/2014',
'firmBCScope': 'ACTIVE',
'firmIAScope': 'ACTIVE',
'iaSECNumber': '7163',
'iaSECNumberType': '801',
'bdSECNumber': '16267',
'branchOfficeLocations': [
{
'locatedAtFlag': 'Y',
'supervisedFromFlag': 'N',
'privateResidenceFlag': 'N',
'branchOfficeId': '88789',
'street1': '1251 AVE OF THE AMERICAS',
'street2': '2ND FLOOR',
'city': 'NEW YORK',
'cityAlias': ['MANHATTAN', 'NEW YORK', 'NEW YORK CITY', 'NY', 'NY CITY', 'NYC'],
'state': 'NY',
'country': 'United States',
'zipCode': '10020',
'latitude': '40.758908',
'longitude': '-73.97902',
'geoLocation': '40.758908,-73.97902',
'nonRegisteredOfficeFlag': 'N',
'elaBeginDate': '07/15/2013'
}
]
}
],
'previousEmployments': [],
'previousIAEmployments': [],
'disclosureFlag': 'N',
'iaDisclosureFlag': 'N',
'disclosures': [],
'examsCount': {
'stateExamCount': 1,
'principalExamCount': 0,
'productExamCount': 3
},
'stateExamCategory': [
{
'examCategory': 'Series 66',
'examName': 'Uniform Combined State Law Examination',
'examTakenDate': '2/18/2014',
'examScope': 'BOTH'
}
],
'principalExamCategory': [],
'productExamCategory': [
{
'examCategory': 'SIE',
'examName': 'Securities Industry Essentials Examination',
'examTakenDate': '10/1/2018',
'examScope': 'BC'
},
{
'examCategory': 'Series 3',
'examName': 'National Commodity Futures Examination',
'examTakenDate': '10/27/2014',
'examScope': 'BC'
},
{
'examCategory': 'Series 7',
'examName': 'General Securities Representative Examination',
'examTaken
<details>
<summary>英文:</summary>
I have a nested dictionary of the public information of employment history of people and I would like to construct the panel data similar to the following table.
[![enter image description here][1]][1]
Here is the nested dictionary.
The nested dictionary for person 1 in the above table is as follows.
{'basicInformation': {'individualId': 6092353,
'firstName': 'A','middleName': 'ANTHONY','lastName': 'OLIVETTI',
'otherNames': ['ALBERT A OLIVETTI',
'ALBERT ANTHONY OLIVETTI',
'ANTHONY A OLIVETTI',
'ANTHONY OLIVETTI'],
'bcScope': 'Active',
'iaScope': 'Active',
'daysInIndustryCalculatedDate': '10/16/2013'},
'currentEmployments': [{'firmId': 8174,
'firmName': 'UBS FINANCIAL SERVICES INC.',
'iaOnly': 'N',
'registrationBeginDate': '10/17/2013',
'firmBCScope': 'ACTIVE',
'firmIAScope': 'ACTIVE',
'iaSECNumber': '7163',
'iaSECNumberType': '801',
'bdSECNumber': '16267',
'branchOfficeLocations': [{'locatedAtFlag': 'Y',
'supervisedFromFlag': 'N',
'privateResidenceFlag': 'N',
'branchOfficeId': '88789',
'street1': '1251 AVE OF THE AMERICAS',
'street2': '2ND FLOOR',
'city': 'NEW YORK',
'cityAlias': ['MANHATTAN',
'NEW YORK',
'NEW YORK CITY',
'NY',
'NY CITY',
'NYC'],
'state': 'NY',
'country': 'United States',
'zipCode': '10020',
'latitude': '40.758908',
'longitude': '-73.97902',
'geoLocation': '40.758908,-73.97902',
'nonRegisteredOfficeFlag': 'N',
'elaBeginDate': '07/15/2013'}]}],
'currentIAEmployments': [{'firmId': 8174,
'firmName': 'UBS FINANCIAL SERVICES INC.',
'iaOnly': 'Y',
'registrationBeginDate': '2/24/2014',
'firmBCScope': 'ACTIVE',
'firmIAScope': 'ACTIVE',
'iaSECNumber': '7163',
'iaSECNumberType': '801',
'bdSECNumber': '16267',
'branchOfficeLocations': [{'locatedAtFlag': 'Y',
'supervisedFromFlag': 'N',
'privateResidenceFlag': 'N',
'branchOfficeId': '88789',
'street1': '1251 AVE OF THE AMERICAS',
'street2': '2ND FLOOR',
'city': 'NEW YORK',
'cityAlias': ['MANHATTAN',
'NEW YORK',
'NEW YORK CITY',
'NY',
'NY CITY',
'NYC'],
'state': 'NY',
'country': 'United States',
'zipCode': '10020',
'latitude': '40.758908',
'longitude': '-73.97902',
'geoLocation': '40.758908,-73.97902',
'nonRegisteredOfficeFlag': 'N',
'elaBeginDate': '07/15/2013'}]}],
'previousEmployments': [],
'previousIAEmployments': [],
'disclosureFlag': 'N',
'iaDisclosureFlag': 'N',
'disclosures': [],
'examsCount': {'stateExamCount': 1,
'principalExamCount': 0,
'productExamCount': 3},
'stateExamCategory': [{'examCategory': 'Series 66',
'examName': 'Uniform Combined State Law Examination',
'examTakenDate': '2/18/2014',
'examScope': 'BOTH'}],
'principalExamCategory': [],
'productExamCategory': [{'examCategory': 'SIE',
'examName': 'Securities Industry Essentials Examination',
'examTakenDate': '10/1/2018',
'examScope': 'BC'},
{'examCategory': 'Series 3',
'examName': 'National Commodity Futures Examination',
'examTakenDate': '10/27/2014',
'examScope': 'BC'},
{'examCategory': 'Series 7',
'examName': 'General Securities Representative Examination',
'examTakenDate': '10/17/2013',
'examScope': 'BC'}],
'registrationCount': {'approvedSRORegistrationCount': 10,
'approvedFinraRegistrationCount': 1,
'approvedStateRegistrationCount': 7,
'approvedIAStateRegistrationCount': 2},
'registeredStates': [{'state': 'California',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '5/31/2022'},
{'state': 'Connecticut',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '2/26/2014'},
{'state': 'Florida',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '2/26/2014'},
{'state': 'New Jersey',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/23/2014'},
{'state': 'New Jersey',
'regScope': 'IA',
'status': 'APPROVED',
'regDate': '2/24/2014'},
{'state': 'New York',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '2/18/2014'},
{'state': 'New York',
'regScope': 'IA',
'status': 'APPROVED',
'regDate': '10/26/2021'},
{'state': 'North Carolina',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '5/31/2022'},
{'state': 'Pennsylvania',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '2/26/2014'}],
'registeredSROs': [{'sro': 'BOX Exchange LLC', 'status': 'APPROVED'},
{'sro': 'Cboe Exchange, Inc.', 'status': 'APPROVED'},
{'sro': 'FINRA', 'status': 'APPROVED'},
{'sro': 'NYSE American LLC', 'status': 'APPROVED'},
{'sro': 'NYSE Arca, Inc.', 'status': 'APPROVED'},
{'sro': 'NYSE Chicago, Inc.', 'status': 'APPROVED'},
{'sro': 'Nasdaq ISE, LLC', 'status': 'APPROVED'},
{'sro': 'Nasdaq PHLX LLC', 'status': 'APPROVED'},
{'sro': 'Nasdaq Stock Market', 'status': 'APPROVED'},
{'sro': 'New York Stock Exchange', 'status': 'APPROVED'}],
'brokerDetails': {'hasBCComments': 'N',
'hasIAComments': 'N',
'legacyReportStatusDescription': 'Not Requested'}}
The nested dictionary for person 2 in the above table is as follows.
{'basicInformation': {'individualId': 2652161,
'firstName': 'ALBERT',
'middleName': 'B',
'lastName': 'HORMAN',
'otherNames': ['A B HORMAN', 'ALBERT WILLIAM HORMAN', 'BILL HORMAN'],
'bcScope': 'Active',
'iaScope': 'Active',
'daysInIndustryCalculatedDate': '9/17/1995'},
'currentEmployments': [{'firmId': 7784,
'firmName': 'FIDELITY BROKERAGE SERVICES LLC',
'iaOnly': 'N',
'registrationBeginDate': '1/1/2008',
'firmBCScope': 'ACTIVE',
'firmIAScope': 'NOTINSCOPE',
'bdSECNumber': '23292',
'branchOfficeLocations': [{'locatedAtFlag': 'Y',
'supervisedFromFlag': 'N',
'privateResidenceFlag': 'N',
'branchOfficeId': '369366',
'street1': '825 EAST 1180 SOUTH',
'city': 'AMERICAN FORK',
'cityAlias': ['AM FORK', 'AMERICAN FORK', 'HIGHLAND', 'TIMPANOGOS'],
'state': 'UT',
'country': 'United States',
'zipCode': '84003',
'latitude': '40.405984',
'longitude': '-111.82903',
'geoLocation': '40.405984,-111.82903',
'nonRegisteredOfficeFlag': 'N',
'elaBeginDate': '07/04/2022'}]}],
'currentIAEmployments': [{'firmId': 288590,
'firmName': 'FIDELITY PERSONAL AND WORKPLACE ADVISORS',
'iaOnly': 'Y',
'registrationBeginDate': '7/13/2018',
'firmBCScope': 'NOTINSCOPE',
'firmIAScope': 'ACTIVE',
'iaSECNumber': '112027',
'iaSECNumberType': '801',
'branchOfficeLocations': [{'locatedAtFlag': 'Y',
'supervisedFromFlag': 'N',
'privateResidenceFlag': 'N',
'street1': '245 SUMMER STREET, V2A',
'city': 'BOSTON',
'cityAlias': ['BOSTON'],
'state': 'MA',
'country': 'United States',
'zipCode': '02210',
'latitude': '42.346571',
'longitude': '-71.039563',
'geoLocation': '42.346571,-71.039563',
'nonRegisteredOfficeFlag': 'Y',
'elaBeginDate': '07/13/2018'}]}],
'previousEmployments': [{'iaOnly': 'N',
'bdSECNumber': '35097',
'firmId': 17507,
'firmName': 'FIDELITY INVESTMENTS INSTITUTIONAL SERVICES COMPANY, INC.',
'street1': '49 NORTH 400 WEST',
'city': 'SALT LAKE CITY',
'state': 'UT',
'zipCode': '84101',
'registrationBeginDate': '1/3/2001',
'registrationEndDate': '1/1/2008',
'firmBCScope': 'ACTIVE',
'firmIAScope': 'NOTINSCOPE'},
{'iaOnly': 'N',
'bdSECNumber': '23292',
'firmId': 7784,
'firmName': 'FIDELITY BROKERAGE SERVICES LLC',
'street1': '900 SALEM STREET',
'city': 'SMITHFIELD',
'state': 'RI',
'country': 'UNITED STATES',
'zipCode': '02917',
'registrationBeginDate': '9/18/1995',
'registrationEndDate': '1/4/2001',
'firmBCScope': 'ACTIVE',
'firmIAScope': 'NOTINSCOPE'}],
'previousIAEmployments': [{'iaOnly': 'Y',
'iaSECNumber': '13243',
'iaSECNumberType': '801',
'firmId': 104555,
'firmName': 'STRATEGIC ADVISERS LLC',
'street1': '49 NORTH 400 WEST',
'city': 'SALT LAKE CITY',
'state': 'UT',
'country': 'United States',
'zipCode': '84101',
'registrationBeginDate': '2/15/2008',
'registrationEndDate': '7/13/2018',
'firmBCScope': 'NOTINSCOPE',
'firmIAScope': 'ACTIVE'}],
'disclosureFlag': 'N',
'iaDisclosureFlag': 'N',
'disclosures': [],
'examsCount': {'stateExamCount': 2,
'principalExamCount': 0,
'productExamCount': 2},
'stateExamCategory': [{'examCategory': 'Series 66',
'examName': 'Uniform Combined State Law Examination',
'examTakenDate': '2/26/2008',
'examScope': 'BOTH'},
{'examCategory': 'Series 63',
'examName': 'Uniform Securities Agent State Law Examination',
'examTakenDate': '9/7/1995',
'examScope': 'BC'}],
'principalExamCategory': [],
'productExamCategory': [{'examCategory': 'SIE',
'examName': 'Securities Industry Essentials Examination',
'examTakenDate': '10/1/2018',
'examScope': 'BC'},
{'examCategory': 'Series 7',
'examName': 'General Securities Representative Examination',
'examTakenDate': '9/16/1995',
'examScope': 'BC'}],
'registrationCount': {'approvedSRORegistrationCount': 2,
'approvedFinraRegistrationCount': 1,
'approvedStateRegistrationCount': 52,
'approvedIAStateRegistrationCount': 2},
'registeredStates': [{'state': 'Alabama',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Alaska',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Arizona',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Arkansas',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'California',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Colorado',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Connecticut',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Delaware',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'District of Columbia',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Florida',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Georgia',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Hawaii',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Idaho',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Illinois',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Indiana',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Iowa',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Kansas',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Kentucky',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Louisiana',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Maine',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Maryland',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Massachusetts',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Michigan',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Minnesota',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Mississippi',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Missouri',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Montana',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Nebraska',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Nevada',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'New Hampshire',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'New Jersey',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'New Mexico',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'New York',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'North Carolina',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'North Dakota',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Ohio',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Oklahoma',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Oregon',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Pennsylvania',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Puerto Rico',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Rhode Island',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'South Carolina',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'South Dakota',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Tennessee',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Texas',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Texas',
'regScope': 'IA',
'status': 'APPROVED_RES',
'regDate': '7/13/2018'},
{'state': 'Utah',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Utah',
'regScope': 'IA',
'status': 'APPROVED',
'regDate': '7/13/2018'},
{'state': 'Vermont',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Virginia',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Washington',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'West Virginia',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Wisconsin',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'},
{'state': 'Wyoming',
'regScope': 'BC',
'status': 'APPROVED',
'regDate': '1/1/2008'}],
'registeredSROs': [{'sro': 'FINRA', 'status': 'APPROVED'},
{'sro': 'New York Stock Exchange', 'status': 'APPROVED'}],
'brokerDetails': {'hasBCComments': 'N',
'hasIAComments': 'N',
'legacyReportStatusDescription': 'Not Requested'}}
What I have tried to do is to implement JSON normalize and JSON flatten. I have modified the code like this for person 1 and person 2
import pandas as pds
from flatten_json import flatten
import json
#person_json is what I stored each person JSON. There are 2
#persons here. Thus, I do this two times to flatten the nested
#dictionary.
person_temp = pds.json_normalize(flatten(person_json))
# This line of the code is credited to Mr.Timeless
data_frame = (person_temp.set_axis(person_temp.columns.str.split("_", n=1,
expand=True), axis=1).stack(1).droplevel(0))
data_frame
Edited 1: Adding the captured photo of data_frame
The sample data_frame looks like this. I show only some parts of data_frame because the dimension is equal to 111 rows-by-16 columns.
[![enter image description here][2]][2]
What I get from the above code is a data frame. However, I try to manage to construct the panel data like the first captured photo that I presented. The issue I found here is to extract 'Year' and 'City' and to construct them into the (unbalanced) panel data set.
How should I do this?
Any suggestions/comments are welcome.
Thank you very much
[1]: https://i.stack.imgur.com/QQjPP.png
[2]: https://i.stack.imgur.com/Wkr1a.png
</details>
# 答案1
**得分**: 1
我建议采用不同的方法。
首先,定义以下辅助函数:
```python
import pandas as pd
def flatten(data, new_data):
for key, value in data.items():
if isinstance(value, dict):
flatten(value, new_data)
if isinstance(value, str) or isinstance(value, int) or isinstance(value, list):
new_data[key] = value
return new_data
def deal_with_dicts(df, columns):
for col in columns:
df = pd.concat([df, pd.json_normalize(df[col])], axis=1)
df = df.drop(columns=col)
return df
def deal_with_duplicated_column_names(df):
duplicates = {k: 1 for k in df.columns}
new_cols = []
for col in df.columns:
if col in new_cols:
new_cols.append(col + f"_{duplicates[col]}")
duplicates[col] += 1
else:
new_cols.append(col)
df.columns = new_cols
return df
然后:
from collections import defaultdict
person1_data = flatten(person1, defaultdict(list))
df = pd.json_normalize(person1_data)
# ROUND 1
for col in df.columns:
df = df.explode(col) # 处理包含字典列表的列
df = df.reset_index(drop=True)
df = deal_with_dicts(
df,
[
"currentEmployments",
"currentIAEmployments",
"stateExamCategory",
"productExamCategory",
"registeredStates",
"registeredSROs",
],
)
df = deal_with_duplicated_column_names(df)
# ROUND 2
for col in df.columns:
df = df.explode(col) # 处理包含字典列表的列
df = df.reset_index(drop=True)
df = deal_with_dicts(df, ["branchOfficeLocations", "branchOfficeLocations_1"])
df = deal_with_duplicated_column_names(df)
# ROUND 3
for col in df.columns:
df = df.explode(col) # 处理包含字典列表的列
df = df.reset_index(drop=True)
这将为您提供来自 person1
字典的所有数据,并将其展平为一个数据框:
print(df.info())
# 输出结果
[38880 行 x 88 列]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38880 entries, 0 to 38879
Data columns (total 88 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 individualId 38880 non-null int64
1 firstName 38880 non-null object
2 middleName 38880 non-null object
3 lastName 38880 non-null object
4 otherNames 38880 non-null object
...
# 其他列信息
...
87 elaBeginDate_1 38880 non-null object
dtypes: int64(10), object(78)
memory usage: 26.1+ MB
英文:
I suggest a different approach.
First, define the following helper functions:
import pandas as pd
def flatten(data, new_data):
for key, value in data.items():
if isinstance(value, dict):
flatten(value, new_data)
if isinstance(value, str) or isinstance(value, int) or isinstance(value, list):
new_data[key] = value
return new_data
def deal_with_dicts(df, columns):
for col in columns:
df = pd.concat([df, pd.json_normalize(df[col])], axis=1)
df = df.drop(columns=col)
return df
def deal_with_duplicated_column_names(df):
duplicates = {k: 1 for k in df.columns}
new_cols = []
for col in df.columns:
if col in new_cols:
new_cols.append(col + f"_{duplicates[col]}")
duplicates[col] += 1
else:
new_cols.append(col)
df.columns = new_cols
return df
Then:
from collections import defaultdict
person1_data = flatten(person1, defaultdict(list))
df = pd.json_normalize(person1_data)
# ROUND 1
for col in df.columns:
df = df.explode(col) # Deal with lists of dicts
df = df.reset_index(drop=True)
df = deal_with_dicts(
df,
[
"currentEmployments",
"currentIAEmployments",
"stateExamCategory",
"productExamCategory",
"registeredStates",
"registeredSROs",
],
)
df = deal_with_duplicated_column_names(df)
# ROUND 2
for col in df.columns:
df = df.explode(col) # Deal with lists of dicts
df = df.reset_index(drop=True)
df = deal_with_dicts(df, ["branchOfficeLocations", "branchOfficeLocations_1"])
df = deal_with_duplicated_column_names(df)
# ROUND 3
for col in df.columns:
df = df.explode(col) # Deal with lists of dicts
df = df.reset_index(drop=True)
Which gives you all the data from person1
dictionary as a flattened dataframe:
print(df.info())
# Output
[38880 rows x 88 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38880 entries, 0 to 38879
Data columns (total 88 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 individualId 38880 non-null int64
1 firstName 38880 non-null object
2 middleName 38880 non-null object
3 lastName 38880 non-null object
4 otherNames 38880 non-null object
5 bcScope 38880 non-null object
6 iaScope 38880 non-null object
7 daysInIndustryCalculatedDate 38880 non-null object
8 previousEmployments 0 non-null object
9 previousIAEmployments 0 non-null object
10 disclosureFlag 38880 non-null object
11 iaDisclosureFlag 38880 non-null object
12 disclosures 0 non-null object
13 stateExamCount 38880 non-null int64
14 principalExamCount 38880 non-null int64
15 productExamCount 38880 non-null int64
16 principalExamCategory 0 non-null object
17 approvedSRORegistrationCount 38880 non-null int64
18 approvedFinraRegistrationCount 38880 non-null int64
19 approvedStateRegistrationCount 38880 non-null int64
20 approvedIAStateRegistrationCount 38880 non-null int64
21 hasBCComments 38880 non-null object
22 hasIAComments 38880 non-null object
23 legacyReportStatusDescription 38880 non-null object
24 firmId 38880 non-null int64
25 firmName 38880 non-null object
26 iaOnly 38880 non-null object
27 registrationBeginDate 38880 non-null object
28 firmBCScope 38880 non-null object
29 firmIAScope 38880 non-null object
30 iaSECNumber 38880 non-null object
31 iaSECNumberType 38880 non-null object
32 bdSECNumber 38880 non-null object
33 firmId_1 38880 non-null int64
34 firmName_1 38880 non-null object
35 iaOnly_1 38880 non-null object
36 registrationBeginDate_1 38880 non-null object
37 firmBCScope_1 38880 non-null object
38 firmIAScope_1 38880 non-null object
39 iaSECNumber_1 38880 non-null object
40 iaSECNumberType_1 38880 non-null object
41 bdSECNumber_1 38880 non-null object
42 examCategory 38880 non-null object
43 examName 38880 non-null object
44 examTakenDate 38880 non-null object
45 examScope 38880 non-null object
46 examCategory_1 38880 non-null object
47 examName_1 38880 non-null object
48 examTakenDate_1 38880 non-null object
49 examScope_1 38880 non-null object
50 state 38880 non-null object
51 regScope 38880 non-null object
52 status 38880 non-null object
53 regDate 38880 non-null object
54 sro 38880 non-null object
55 status_1 38880 non-null object
56 locatedAtFlag 38880 non-null object
57 supervisedFromFlag 38880 non-null object
58 privateResidenceFlag 38880 non-null object
59 branchOfficeId 38880 non-null object
60 street1 38880 non-null object
61 street2 38880 non-null object
62 city 38880 non-null object
63 cityAlias 38880 non-null object
64 state_1 38880 non-null object
65 country 38880 non-null object
66 zipCode 38880 non-null object
67 latitude 38880 non-null object
68 longitude 38880 non-null object
69 geoLocation 38880 non-null object
70 nonRegisteredOfficeFlag 38880 non-null object
71 elaBeginDate 38880 non-null object
72 locatedAtFlag_1 38880 non-null object
73 supervisedFromFlag_1 38880 non-null object
74 privateResidenceFlag_1 38880 non-null object
75 branchOfficeId_1 38880 non-null object
76 street1_1 38880 non-null object
77 street2_1 38880 non-null object
78 city_1 38880 non-null object
79 cityAlias_1 38880 non-null object
80 state_2 38880 non-null object
81 country_1 38880 non-null object
82 zipCode_1 38880 non-null object
83 latitude_1 38880 non-null object
84 longitude_1 38880 non-null object
85 geoLocation_1 38880 non-null object
86 nonRegisteredOfficeFlag_1 38880 non-null object
87 elaBeginDate_1 38880 non-null object
dtypes: int64(10), object(78)
memory usage: 26.1+ MB
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论