英文:
how to transform dataframe into data set/object
问题
我有一个包含近900万行和30列的数据集。随着列数增加,数据变得更具体,导致前几列的数据非常重复。请参见示例:
公园编码 | 露营地 | 停车场 |
---|---|---|
acad | 露营地1 | 停车场1 |
acad | 露营地1 | 停车场2 |
acad | 露营地2 | 停车场3 |
bisc | 露营地3 | 停车场4 |
我想将这些信息提取到一个结果集中,类似于一个对象,例如:
公园编码: acad
露营地: 露营地1, 露营地2
停车场: 停车场1, 停车场2, 停车场3
公园编码: bisc
露营地: 露营地3, ....
.......
等等。
我完全不知道如何使用pandas来做到这一点,因为我习惯于在SQL和数据库中工作,而不是用pandas。如果你想看到让我做到这一步的代码,这里是它:
函数调用:
data_handler.fetch_results(['Wildlife Watching', 'Arts and Culture'], ['Restroom'], ['Acadia National Park'], ['ME'])
def fetch_results(self, activities_selection, amenities_selection, parks_selection, states_selection):
activities_selection_df = self.activities_df['park_code'][self.activities_df['activity_name'].
isin(activities_selection)].drop_duplicates()
amenities_selection_df = self.amenities_parks_df['park_code'][self.amenities_parks_df['amenity_name'].
isin(amenities_selection)].drop_duplicates()
states_selection_df = self.activities_df['park_code'][self.activities_df['park_states'].
isin(states_selection)].drop_duplicates()
parks_selection_df = self.activities_df['park_code'][self.activities_df['park_name'].
isin(parks_selection)].drop_duplicates()
data = activities_selection_df[activities_selection_df.isin(amenities_selection_df) &
activities_selection_df.isin(states_selection_df) & activities_selection_df.
isin(parks_selection_df)].drop_duplicates()
pandas_select_df = pd.DataFrame(data, columns=['park_code'])
results_df = pd.merge(pandas_select_df, self.activities_df, on='park_code', how='left')
results_df = pd.merge(results_df, self.amenities_parks_df[['park_code', 'amenity_name', 'amenity_url']],
on='park_code', how='left')
results_df = pd.merge(results_df, self.campgrounds_df[['park_code', 'campground_name', 'campground_url',
'campground_road', 'campground_classification',
'campground_general_ADA',
'campground_wheelchair_access',
'campground_rv_info', 'campground_description',
'campground_cell_reception', 'campground_camp_store',
'campground_internet', 'campground_potable_water',
'campground_toilets',
'campground_campsites_electric',
'campground_staff_volunteer']], on='park_code',
how='left')
results_df = pd.merge(results_df, self.places_df[['park_code', 'places_title', 'places_url']],
on='park_code', how='left')
results_df = pd.merge(results_df, self.parking_lot_df[
['park_code', "parking_lots_name", "parking_lots_ADA_facility_description",
"parking_lots_is_lot_accessible", "parking_lots_number_oversized_spaces",
"parking_lots_number_ADA_spaces",
"parking_lots_number_ADA_Step_Free_Spaces", "parking_lots_number_ADA_van_spaces",
"parking_lots_description"]], on='park_code', how='left')
# print(self.campgrounds_df.to_string(max_rows=20))
print(results_df.to_string(max_rows=40))
任何帮助将不胜感激。
英文:
I have a data set in a dataframe that's almost 9 million rows and 30 columns. As the columns count up, the data becomes more specific thus leading the data in the first columns to be very repetitive. See example:
park_code | camp_ground | parking_lot |
---|---|---|
acad | campground1 | parking_lot1 |
acad | campground1 | parking_lot2 |
acad | campground2 | parking_lot3 |
bisc | campground3 | parking_lot4 |
I'm looking to feed that information in to a result set like an object for example:
park code: acad <br>
campgrounds: campground 1, campground 2 <br>
parking lots: parking_lot1, parking_lot2, parking_lot3 <br>
<br>
park code: bisc <br>
campgrounds: campground3, .... <br>
....... <br>
<br>
etc.
I'm completely at a loss how to do this with pandas, and I'm learning as I go as I'm used to working in SQL and databases not with pandas. If you want to see the code that's gotten me this far, here it is:
function call:
data_handler.fetch_results(['Wildlife Watching', 'Arts and Culture'], ['Restroom'], ['Acadia National
Park'], ['ME'])
def fetch_results(self, activities_selection, amenities_selection, parks_selection, states_selection):
activities_selection_df = self.activities_df['park_code'][self.activities_df['activity_name'].
isin(activities_selection)].drop_duplicates()
amenities_selection_df = self.amenities_parks_df['park_code'][self.amenities_parks_df['amenity_name'].
isin(amenities_selection)].drop_duplicates()
states_selection_df = self.activities_df['park_code'][self.activities_df['park_states'].
isin(states_selection)].drop_duplicates()
parks_selection_df = self.activities_df['park_code'][self.activities_df['park_name'].
isin(parks_selection)].drop_duplicates()
data = activities_selection_df[activities_selection_df.isin(amenities_selection_df) &
activities_selection_df.isin(states_selection_df) & activities_selection_df.
isin(parks_selection_df)].drop_duplicates()
pandas_select_df = pd.DataFrame(data, columns=['park_code'])
results_df = pd.merge(pandas_select_df, self.activities_df, on='park_code', how='left')
results_df = pd.merge(results_df, self.amenities_parks_df[['park_code', 'amenity_name', 'amenity_url']],
on='park_code', how='left')
results_df = pd.merge(results_df, self.campgrounds_df[['park_code', 'campground_name', 'campground_url',
'campground_road', 'campground_classification',
'campground_general_ADA',
'campground_wheelchair_access',
'campground_rv_info', 'campground_description',
'campground_cell_reception', 'campground_camp_store',
'campground_internet', 'campground_potable_water',
'campground_toilets',
'campground_campsites_electric',
'campground_staff_volunteer']], on='park_code',
how='left')
results_df = pd.merge(results_df, self.places_df[['park_code', 'places_title', 'places_url']],
on='park_code', how='left')
results_df = pd.merge(results_df, self.parking_lot_df[
['park_code', "parking_lots_name", "parking_lots_ADA_facility_description",
"parking_lots_is_lot_accessible", "parking_lots_number_oversized_spaces",
"parking_lots_number_ADA_spaces",
"parking_lots_number_ADA_Step_Free_Spaces", "parking_lots_number_ADA_van_spaces",
"parking_lots_description"]], on='park_code', how='left')
# print(self.campgrounds_df.to_string(max_rows=20))
print(results_df.to_string(max_rows=40))
Any help will be appreciated.
答案1
得分: 1
通常,您可以按park_code
分组,并将其他列收集到列表中,然后将其转换为字典:
df.groupby('park_code').agg({'camp_ground': list, 'parking_lot': list}).to_dict(orient='index')
示例结果:
{'acad': {'camp_ground': ['campground1', 'campground1', 'campground2'],
'parking_lot': ['parking_lot1', 'parking_lot2', 'parking_lot3']},
'bisc': {'camp_ground': ['campground3'], 'parking_lot': ['parking_lot4']}}
英文:
In general, you can group by park_code
and collect other columns into lists, then - transform to a dictionary:
df.groupby('park_code').agg({'camp_ground': list, 'parking_lot': list}).to_dict(orient='index')
Sample result:
{'acad ': {'camp_ground': ['campground1 ', 'campground1 ', 'campground2 '],
'parking_lot': ['parking_lot1', 'parking_lot2', 'parking_lot3']},
'bisc ': {'camp_ground': ['campground3 '], 'parking_lot': ['parking_lot4']}}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论