2023年1月9日 04:17:08go评论102阅读模式

英文:

how to transform dataframe into data set/object

问题

我有一个包含近900万行和30列的数据集。随着列数增加，数据变得更具体，导致前几列的数据非常重复。请参见示例：

公园编码	露营地	停车场
acad	露营地1	停车场1
acad	露营地1	停车场2
acad	露营地2	停车场3
bisc	露营地3	停车场4

我想将这些信息提取到一个结果集中，类似于一个对象，例如：

公园编码: acad
露营地: 露营地1, 露营地2
停车场: 停车场1, 停车场2, 停车场3

公园编码: bisc
露营地: 露营地3, ....
.......

等等。

我完全不知道如何使用pandas来做到这一点，因为我习惯于在SQL和数据库中工作，而不是用pandas。如果你想看到让我做到这一步的代码，这里是它：

函数调用：

data_handler.fetch_results(['Wildlife Watching', 'Arts and Culture'], ['Restroom'], ['Acadia National Park'], ['ME'])

def fetch_results(self, activities_selection, amenities_selection, parks_selection, states_selection):
    activities_selection_df = self.activities_df['park_code'][self.activities_df['activity_name'].
    isin(activities_selection)].drop_duplicates()
    amenities_selection_df = self.amenities_parks_df['park_code'][self.amenities_parks_df['amenity_name'].
    isin(amenities_selection)].drop_duplicates()
    states_selection_df = self.activities_df['park_code'][self.activities_df['park_states'].
    isin(states_selection)].drop_duplicates()
    parks_selection_df = self.activities_df['park_code'][self.activities_df['park_name'].
    isin(parks_selection)].drop_duplicates()
    data = activities_selection_df[activities_selection_df.isin(amenities_selection_df) &
                                   activities_selection_df.isin(states_selection_df) & activities_selection_df.
                                   isin(parks_selection_df)].drop_duplicates()
    pandas_select_df = pd.DataFrame(data, columns=['park_code'])
    results_df = pd.merge(pandas_select_df, self.activities_df, on='park_code', how='left')
    results_df = pd.merge(results_df, self.amenities_parks_df[['park_code', 'amenity_name', 'amenity_url']],
                          on='park_code', how='left')
    results_df = pd.merge(results_df, self.campgrounds_df[['park_code', 'campground_name', 'campground_url',
                                                           'campground_road', 'campground_classification',
                                                           'campground_general_ADA',
                                                           'campground_wheelchair_access',
                                                           'campground_rv_info', 'campground_description',
                                                           'campground_cell_reception', 'campground_camp_store',
                                                           'campground_internet', 'campground_potable_water',
                                                           'campground_toilets',
                                                           'campground_campsites_electric',
                                                           'campground_staff_volunteer']], on='park_code',
                            how='left')
    results_df = pd.merge(results_df, self.places_df[['park_code', 'places_title', 'places_url']],
                          on='park_code', how='left')
    results_df = pd.merge(results_df, self.parking_lot_df[
        ['park_code', "parking_lots_name", "parking_lots_ADA_facility_description",
         "parking_lots_is_lot_accessible", "parking_lots_number_oversized_spaces",
         "parking_lots_number_ADA_spaces",
         "parking_lots_number_ADA_Step_Free_Spaces", "parking_lots_number_ADA_van_spaces",
         "parking_lots_description"]], on='park_code', how='left')
    # print(self.campgrounds_df.to_string(max_rows=20))
    print(results_df.to_string(max_rows=40))

任何帮助将不胜感激。

英文:

I have a data set in a dataframe that's almost 9 million rows and 30 columns. As the columns count up, the data becomes more specific thus leading the data in the first columns to be very repetitive. See example:

park_code	camp_ground	parking_lot
acad	campground1	parking_lot1
acad	campground1	parking_lot2
acad	campground2	parking_lot3
bisc	campground3	parking_lot4

I'm looking to feed that information in to a result set like an object for example:

park code: acad 
campgrounds: campground 1, campground 2 
parking lots: parking_lot1, parking_lot2, parking_lot3 
 
park code: bisc 
campgrounds: campground3, .... 
....... 
 
etc.

I'm completely at a loss how to do this with pandas, and I'm learning as I go as I'm used to working in SQL and databases not with pandas. If you want to see the code that's gotten me this far, here it is:

function call:

data_handler.fetch_results([&#39;Wildlife Watching&#39;, &#39;Arts and Culture&#39;], [&#39;Restroom&#39;], [&#39;Acadia National 
Park&#39;], [&#39;ME&#39;])

def fetch_results(self, activities_selection, amenities_selection, parks_selection, states_selection):
activities_selection_df = self.activities_df[&#39;park_code&#39;][self.activities_df[&#39;activity_name&#39;].
isin(activities_selection)].drop_duplicates()
amenities_selection_df = self.amenities_parks_df[&#39;park_code&#39;][self.amenities_parks_df[&#39;amenity_name&#39;].
isin(amenities_selection)].drop_duplicates()
states_selection_df = self.activities_df[&#39;park_code&#39;][self.activities_df[&#39;park_states&#39;].
isin(states_selection)].drop_duplicates()
parks_selection_df = self.activities_df[&#39;park_code&#39;][self.activities_df[&#39;park_name&#39;].
isin(parks_selection)].drop_duplicates()
data = activities_selection_df[activities_selection_df.isin(amenities_selection_df) &amp;
activities_selection_df.isin(states_selection_df) &amp; activities_selection_df.
isin(parks_selection_df)].drop_duplicates()
pandas_select_df = pd.DataFrame(data, columns=[&#39;park_code&#39;])
results_df = pd.merge(pandas_select_df, self.activities_df, on=&#39;park_code&#39;, how=&#39;left&#39;)
results_df = pd.merge(results_df, self.amenities_parks_df[[&#39;park_code&#39;, &#39;amenity_name&#39;, &#39;amenity_url&#39;]],
on=&#39;park_code&#39;, how=&#39;left&#39;)
results_df = pd.merge(results_df, self.campgrounds_df[[&#39;park_code&#39;, &#39;campground_name&#39;, &#39;campground_url&#39;,
&#39;campground_road&#39;, &#39;campground_classification&#39;,
&#39;campground_general_ADA&#39;,
&#39;campground_wheelchair_access&#39;,
&#39;campground_rv_info&#39;, &#39;campground_description&#39;,
&#39;campground_cell_reception&#39;, &#39;campground_camp_store&#39;,
&#39;campground_internet&#39;, &#39;campground_potable_water&#39;,
&#39;campground_toilets&#39;,
&#39;campground_campsites_electric&#39;,
&#39;campground_staff_volunteer&#39;]], on=&#39;park_code&#39;,
how=&#39;left&#39;)
results_df = pd.merge(results_df, self.places_df[[&#39;park_code&#39;, &#39;places_title&#39;, &#39;places_url&#39;]],
on=&#39;park_code&#39;, how=&#39;left&#39;)
results_df = pd.merge(results_df, self.parking_lot_df[
[&#39;park_code&#39;, &quot;parking_lots_name&quot;, &quot;parking_lots_ADA_facility_description&quot;,
&quot;parking_lots_is_lot_accessible&quot;, &quot;parking_lots_number_oversized_spaces&quot;,
&quot;parking_lots_number_ADA_spaces&quot;,
&quot;parking_lots_number_ADA_Step_Free_Spaces&quot;, &quot;parking_lots_number_ADA_van_spaces&quot;,
&quot;parking_lots_description&quot;]], on=&#39;park_code&#39;, how=&#39;left&#39;)
# print(self.campgrounds_df.to_string(max_rows=20))
print(results_df.to_string(max_rows=40))

Any help will be appreciated.

答案1

得分: 1

通常，您可以按park_code分组，并将其他列收集到列表中，然后将其转换为字典：

df.groupby('park_code').agg({'camp_ground': list, 'parking_lot': list}).to_dict(orient='index')

示例结果：

{'acad': {'camp_ground': ['campground1', 'campground1', 'campground2'],
  'parking_lot': ['parking_lot1', 'parking_lot2', 'parking_lot3']},
 'bisc': {'camp_ground': ['campground3'], 'parking_lot': ['parking_lot4']}}

英文:

In general, you can group by park_code and collect other columns into lists, then - transform to a dictionary:

df.groupby(&#39;park_code&#39;).agg({&#39;camp_ground&#39;: list, &#39;parking_lot&#39;: list}).to_dict(orient=&#39;index&#39;)

Sample result:

{&#39;acad &#39;: {&#39;camp_ground&#39;: [&#39;campground1 &#39;, &#39;campground1 &#39;, &#39;campground2 &#39;],
&#39;parking_lot&#39;: [&#39;parking_lot1&#39;, &#39;parking_lot2&#39;, &#39;parking_lot3&#39;]},
&#39;bisc &#39;: {&#39;camp_ground&#39;: [&#39;campground3 &#39;], &#39;parking_lot&#39;: [&#39;parking_lot4&#39;]}}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将数据框转换为数据集/对象

问题

答案1

代码不总是返回最大和最小数之间不在列表中的最小数。

绕过Python中的交互模式检测

Save file with same label with different type in python.

在使用 pandas 进行分组后出现的计算问题。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。