如何正确地将数据框进行旋转,使第一列的值成为我的新列?

huangapple go评论74阅读模式
英文:

How to correctly pivot a dataframe so the values of the first column are my new columns?

问题

I have a file with some random census data, in essence multiple lines of the following:

age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary>50K

I want to transform this into a csv that looks like this:

senior Self-emp-not-inc Bachelors ... >50K

I created the following script that I was hoping would do what I want:

for i in range(df.shape[1]):
    temp_df = df.loc[i].str.split(" ", expand=True)
    temp_df = temp_df[0].str.split("=", expand=True)

    temp_df.columns = ['column_names', 'column_values']
    temp_df = temp_df.reset_index(drop=True)

    temp_df = temp_df.pivot(index=temp_df.index, columns='column_names', values='column_values')

The last line though is throwing an error, specifically:

KeyError: 0

How can I either fix my pivot or if this is not correct, what would be a better way to achieve what I want?

英文:

I have a file with some random census data, in essence multiple lines of the following:

age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary>50K

I want to transform this into a csv that looks like this:

senior Self-emp-not-inc Bachelors ... >50K

I created the following script that I was hoping would do what I want:

 for i in range(df.shape[1]):
    temp_df = df.loc[i].str.split(" ", expand=True)
    temp_df = temp_df[0].str.split("=", expand=True)    

    temp_df.columns = ['column_names', 'column_values']
    temp_df = temp_df.reset_index(drop=True)

    temp_df = temp_df.pivot(index=temp_df.index, columns='column_names', values='column_values')

The last line though is throwing an error, specifically:

KeyError: 0

How can I either fix my pivot or if this is not correct, what would be a better way to achieve what I want?

答案1

得分: 1

Assuming your dataframe does indeed look like this.

print(df)
0 1 2 ... 11 12 13
0 age=senior workclass=Self-emp-not-inc education=Bachelors ... hours=half-time country=United-States salary>50K

you could stack, split and unstack to get the columns you need.

df1 = df.stack()
.str.split('=',expand=True)
.reset_index(1,drop=True)
.set_index(0,append=True)
.unstack(1)

print(df1)
0 age country edu_num education gain ... race relationship salary sex workclass
0 senior United-States 13 Bachelors high ... White Not-in-family 50K Male Self-emp-not-inc

英文:

Assuming your dataframe does indeed look like this.

print(df)
               0                           1                    2   ...               11                     12          13
0  age=senior  workclass=Self-emp-not-inc  education=Bachelors  ...  hours=half-time  country=United-States  salary>50K

you could stack, split and unstack to get the columns you need.

df1 = df.stack()\
.str.split('=|>|<=|<|>=',expand=True)\
.reset_index(1,drop=True)\
.set_index(0,append=True)\
.unstack(1)

print(df1)
0     age        country edu_num  education  gain  ...   race   relationship salary   sex         workclass
0  senior  United-States      13  Bachelors  high  ...  White  Not-in-family    50K  Male  Self-emp-not-inc

答案2

得分: 0

创建一个处理分割的函数,返回一个列名:列值的字典,然后从这些字典的列表创建一个数据框。

def gen_dic_from_val(x):
    l = [re.split('=|>|<', v) for v in x.split(' ')]
    return {k[0]: k[1] for k in l}

pd.DataFrame(df.val.apply(lambda x: gen_dic_from_val(x)).to_list())

(这假设df是一个pd.DataFrame,其中包含在名为val的列中的长字符串值。)

英文:

Maybe just create a function that handles the split, returning a dictionary of column_name:column_value pairs, and then create a dataframe from that a list of those dictionaries.

def gen_dic_from_val(x):
    l = [re.split(&#39;=|&gt;|&lt;&#39;,v) for v in x.split(&#39; &#39;)]
    return {k[0]:k[1] for k in l}

pd.DataFrame(df.val.apply(lambda x: gen_dic_from_val(x)).to_list())

(This assumes df is a pd.DataFrame holding your long string values in a column named val)

答案3

得分: 0

给定一个带有ID列的数据框:

>>> df = pd.DataFrame({'id': [0], 's': 'age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary&gt;50K'})

从数据的列表版本创建一个新的数据框:

>>> df['sl'] = df['s'].str.split(' ', expand=False)
>>> df1 = df.explode('sl')

从第二次拆分中创建名称和值:

>>> df1[['n', 'v']] = df1['sl'].str.split('=', expand=True)
>>> df1[['id', 'n', 'v']]

然后将其适当地旋转:

>>> df1.pivot(index='id', columns='n', values='v')

由于每个观察值(在df.pivot中的index)都是唯一标识的,这在整个数据框上都有效。如果您还没有id列,可以在最开始通过df.reset_index().rename(columns={'index': 'id'})来创建一个。

英文:

Given a data frame with an ID column:

&gt;&gt;&gt; df = pd.DataFrame({&#39;id&#39;: [0], &#39;s&#39;: &#39;age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary&gt;50K&#39;})

Create a new data frame from the list version of the data:

&gt;&gt;&gt; df[&#39;sl&#39;] = df[&#39;s&#39;].str.split(&#39; &#39;, expand=False)
&gt;&gt;&gt; df1 = df.explode(&#39;sl&#39;)
   id  ...                          sl
0   0  ...                  age=senior
0   0  ...  workclass=Self-emp-not-inc
0   0  ...         education=Bachelors
0   0  ...                  edu_num=13
0   0  ...            marital=Divorced
0   0  ...     occupation=Craft-repair
0   0  ...  relationship=Not-in-family
0   0  ...                  race=White
0   0  ...                    sex=Male
0   0  ...                   gain=high
0   0  ...                   loss=none
0   0  ...             hours=half-time
0   0  ...       country=United-States
0   0  ...                  salary&gt;50K

[14 rows x 3 columns]

Create your names and values from your second split.

&gt;&gt;&gt; df1[[&#39;n&#39;, &#39;v&#39;]] = df1[&#39;sl&#39;].str.split(&#39;=&#39;, expand=True)
&gt;&gt;&gt; df1[[&#39;id&#39;, &#39;n&#39;, &#39;v&#39;]]
   id             n                 v
0   0           age            senior
0   0     workclass  Self-emp-not-inc
0   0     education         Bachelors
0   0       edu_num                13
0   0       marital          Divorced
0   0    occupation      Craft-repair
0   0  relationship     Not-in-family
0   0          race             White
0   0           sex              Male
0   0          gain              high
0   0          loss              none
0   0         hours         half-time
0   0       country     United-States
0   0    salary&gt;50K              None

Then just pivot into place.

&gt;&gt;&gt; df1.pivot(index=&#39;id&#39;, columns=&#39;n&#39;, values=&#39;v&#39;)
n      age        country edu_num  ... salary&gt;50K   sex         workclass
id                                 ...                                   
0   senior  United-States      13  ...       None  Male  Self-emp-not-inc

Because each observation (the index in df.pivot) is uniquely identified, this works over the entire data frame as a whole. If you don't have an id column already, create one by df.reset_index().rename(columns={&#39;index&#39;: &#39;id&#39;}) at the very start.

huangapple
  • 本文由 发表于 2023年2月14日 04:45:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75441003.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定