英文:
How to correctly pivot a dataframe so the values of the first column are my new columns?
问题
I have a file with some random census data, in essence multiple lines of the following:
age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary>50K
I want to transform this into a csv that looks like this:
senior Self-emp-not-inc Bachelors ... >50K
I created the following script that I was hoping would do what I want:
for i in range(df.shape[1]):
temp_df = df.loc[i].str.split(" ", expand=True)
temp_df = temp_df[0].str.split("=", expand=True)
temp_df.columns = ['column_names', 'column_values']
temp_df = temp_df.reset_index(drop=True)
temp_df = temp_df.pivot(index=temp_df.index, columns='column_names', values='column_values')
The last line though is throwing an error, specifically:
KeyError: 0
How can I either fix my pivot
or if this is not correct, what would be a better way to achieve what I want?
英文:
I have a file with some random census data, in essence multiple lines of the following:
age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary>50K
I want to transform this into a csv that looks like this:
senior Self-emp-not-inc Bachelors ... >50K
I created the following script that I was hoping would do what I want:
for i in range(df.shape[1]):
temp_df = df.loc[i].str.split(" ", expand=True)
temp_df = temp_df[0].str.split("=", expand=True)
temp_df.columns = ['column_names', 'column_values']
temp_df = temp_df.reset_index(drop=True)
temp_df = temp_df.pivot(index=temp_df.index, columns='column_names', values='column_values')
The last line though is throwing an error, specifically:
KeyError: 0
How can I either fix my pivot
or if this is not correct, what would be a better way to achieve what I want?
答案1
得分: 1
Assuming your dataframe does indeed look like this.
print(df)
0 1 2 ... 11 12 13
0 age=senior workclass=Self-emp-not-inc education=Bachelors ... hours=half-time country=United-States salary>50K
you could stack
, split
and unstack
to get the columns you need.
df1 = df.stack()
.str.split('=',expand=True)
.reset_index(1,drop=True)
.set_index(0,append=True)
.unstack(1)
print(df1)
0 age country edu_num education gain ... race relationship salary sex workclass
0 senior United-States 13 Bachelors high ... White Not-in-family 50K Male Self-emp-not-inc
英文:
Assuming your dataframe does indeed look like this.
print(df)
0 1 2 ... 11 12 13
0 age=senior workclass=Self-emp-not-inc education=Bachelors ... hours=half-time country=United-States salary>50K
you could stack
, split
and unstack
to get the columns you need.
df1 = df.stack()\
.str.split('=|>|<=|<|>=',expand=True)\
.reset_index(1,drop=True)\
.set_index(0,append=True)\
.unstack(1)
print(df1)
0 age country edu_num education gain ... race relationship salary sex workclass
0 senior United-States 13 Bachelors high ... White Not-in-family 50K Male Self-emp-not-inc
答案2
得分: 0
创建一个处理分割的函数,返回一个列名:列值的字典,然后从这些字典的列表创建一个数据框。
def gen_dic_from_val(x):
l = [re.split('=|>|<', v) for v in x.split(' ')]
return {k[0]: k[1] for k in l}
pd.DataFrame(df.val.apply(lambda x: gen_dic_from_val(x)).to_list())
(这假设df
是一个pd.DataFrame
,其中包含在名为val
的列中的长字符串值。)
英文:
Maybe just create a function that handles the split, returning a dictionary of column_name:column_value pairs, and then create a dataframe from that a list of those dictionaries.
def gen_dic_from_val(x):
l = [re.split('=|>|<',v) for v in x.split(' ')]
return {k[0]:k[1] for k in l}
pd.DataFrame(df.val.apply(lambda x: gen_dic_from_val(x)).to_list())
(This assumes df
is a pd.DataFrame
holding your long string values in a column named val
)
答案3
得分: 0
给定一个带有ID列的数据框:
>>> df = pd.DataFrame({'id': [0], 's': 'age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary>50K'})
从数据的列表版本创建一个新的数据框:
>>> df['sl'] = df['s'].str.split(' ', expand=False)
>>> df1 = df.explode('sl')
从第二次拆分中创建名称和值:
>>> df1[['n', 'v']] = df1['sl'].str.split('=', expand=True)
>>> df1[['id', 'n', 'v']]
然后将其适当地旋转:
>>> df1.pivot(index='id', columns='n', values='v')
由于每个观察值(在df.pivot
中的index
)都是唯一标识的,这在整个数据框上都有效。如果您还没有id
列,可以在最开始通过df.reset_index().rename(columns={'index': 'id'})
来创建一个。
英文:
Given a data frame with an ID column:
>>> df = pd.DataFrame({'id': [0], 's': 'age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary>50K'})
Create a new data frame from the list version of the data:
>>> df['sl'] = df['s'].str.split(' ', expand=False)
>>> df1 = df.explode('sl')
id ... sl
0 0 ... age=senior
0 0 ... workclass=Self-emp-not-inc
0 0 ... education=Bachelors
0 0 ... edu_num=13
0 0 ... marital=Divorced
0 0 ... occupation=Craft-repair
0 0 ... relationship=Not-in-family
0 0 ... race=White
0 0 ... sex=Male
0 0 ... gain=high
0 0 ... loss=none
0 0 ... hours=half-time
0 0 ... country=United-States
0 0 ... salary>50K
[14 rows x 3 columns]
Create your names and values from your second split.
>>> df1[['n', 'v']] = df1['sl'].str.split('=', expand=True)
>>> df1[['id', 'n', 'v']]
id n v
0 0 age senior
0 0 workclass Self-emp-not-inc
0 0 education Bachelors
0 0 edu_num 13
0 0 marital Divorced
0 0 occupation Craft-repair
0 0 relationship Not-in-family
0 0 race White
0 0 sex Male
0 0 gain high
0 0 loss none
0 0 hours half-time
0 0 country United-States
0 0 salary>50K None
Then just pivot into place.
>>> df1.pivot(index='id', columns='n', values='v')
n age country edu_num ... salary>50K sex workclass
id ...
0 senior United-States 13 ... None Male Self-emp-not-inc
Because each observation (the index
in df.pivot
) is uniquely identified, this works over the entire data frame as a whole. If you don't have an id
column already, create one by df.reset_index().rename(columns={'index': 'id'})
at the very start.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论