2023年2月14日 04:45:12go评论74阅读模式

英文:

How to correctly pivot a dataframe so the values of the first column are my new columns?

问题

I have a file with some random census data, in essence multiple lines of the following:

age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary&gt;50K

I want to transform this into a csv that looks like this:

senior Self-emp-not-inc Bachelors ... &gt;50K

I created the following script that I was hoping would do what I want:

for i in range(df.shape[1]):
    temp_df = df.loc[i].str.split(" ", expand=True)
    temp_df = temp_df[0].str.split("=", expand=True)

    temp_df.columns = ['column_names', 'column_values']
    temp_df = temp_df.reset_index(drop=True)

    temp_df = temp_df.pivot(index=temp_df.index, columns='column_names', values='column_values')

The last line though is throwing an error, specifically:

KeyError: 0

How can I either fix my pivot or if this is not correct, what would be a better way to achieve what I want?

英文:

I have a file with some random census data, in essence multiple lines of the following:

age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary&gt;50K

I want to transform this into a csv that looks like this:

senior Self-emp-not-inc Bachelors ... &gt;50K

I created the following script that I was hoping would do what I want:

 for i in range(df.shape[1]):
    temp_df = df.loc[i].str.split(&quot; &quot;, expand=True)
    temp_df = temp_df[0].str.split(&quot;=&quot;, expand=True)    

    temp_df.columns = [&#39;column_names&#39;, &#39;column_values&#39;]
    temp_df = temp_df.reset_index(drop=True)

    temp_df = temp_df.pivot(index=temp_df.index, columns=&#39;column_names&#39;, values=&#39;column_values&#39;)

The last line though is throwing an error, specifically:

KeyError: 0

How can I either fix my pivot or if this is not correct, what would be a better way to achieve what I want?

答案1

得分: 1

Assuming your dataframe does indeed look like this.

print(df)
0 1 2 ... 11 12 13
0 age=senior workclass=Self-emp-not-inc education=Bachelors ... hours=half-time country=United-States salary>50K

you could stack, split and unstack to get the columns you need.

df1 = df.stack()
.str.split('=',expand=True)
.reset_index(1,drop=True)
.set_index(0,append=True)
.unstack(1)

print(df1)
0 age country edu_num education gain ... race relationship salary sex workclass
0 senior United-States 13 Bachelors high ... White Not-in-family 50K Male Self-emp-not-inc

英文:

Assuming your dataframe does indeed look like this.

print(df)
               0                           1                    2   ...               11                     12          13
0  age=senior  workclass=Self-emp-not-inc  education=Bachelors  ...  hours=half-time  country=United-States  salary&gt;50K

you could stack, split and unstack to get the columns you need.

df1 = df.stack()\
.str.split(&#39;=|&gt;|&lt;=|&lt;|&gt;=&#39;,expand=True)\
.reset_index(1,drop=True)\
.set_index(0,append=True)\
.unstack(1)

print(df1)
0     age        country edu_num  education  gain  ...   race   relationship salary   sex         workclass
0  senior  United-States      13  Bachelors  high  ...  White  Not-in-family    50K  Male  Self-emp-not-inc

答案2

得分: 0

创建一个处理分割的函数，返回一个列名:列值的字典，然后从这些字典的列表创建一个数据框。

def gen_dic_from_val(x):
    l = [re.split('=|>|<', v) for v in x.split(' ')]
    return {k[0]: k[1] for k in l}

pd.DataFrame(df.val.apply(lambda x: gen_dic_from_val(x)).to_list())

（这假设df是一个pd.DataFrame，其中包含在名为val的列中的长字符串值。）

英文:

Maybe just create a function that handles the split, returning a dictionary of column_name:column_value pairs, and then create a dataframe from that a list of those dictionaries.

def gen_dic_from_val(x):
    l = [re.split(&#39;=|&gt;|&lt;&#39;,v) for v in x.split(&#39; &#39;)]
    return {k[0]:k[1] for k in l}

pd.DataFrame(df.val.apply(lambda x: gen_dic_from_val(x)).to_list())

(This assumes df is a pd.DataFrame holding your long string values in a column named val)

答案3

得分: 0

给定一个带有ID列的数据框：

>>> df = pd.DataFrame({'id': [0], 's': 'age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary&gt;50K'})

从数据的列表版本创建一个新的数据框：

>>> df['sl'] = df['s'].str.split(' ', expand=False)
>>> df1 = df.explode('sl')

从第二次拆分中创建名称和值：

>>> df1[['n', 'v']] = df1['sl'].str.split('=', expand=True)
>>> df1[['id', 'n', 'v']]

然后将其适当地旋转：

>>> df1.pivot(index='id', columns='n', values='v')

由于每个观察值（在df.pivot中的index）都是唯一标识的，这在整个数据框上都有效。如果您还没有id列，可以在最开始通过df.reset_index().rename(columns={'index': 'id'})来创建一个。

英文:

Given a data frame with an ID column:

&gt;&gt;&gt; df = pd.DataFrame({&#39;id&#39;: [0], &#39;s&#39;: &#39;age=senior workclass=Self-emp-not-inc education=Bachelors edu_num=13 marital=Divorced occupation=Craft-repair relationship=Not-in-family race=White sex=Male gain=high loss=none hours=half-time country=United-States salary&gt;50K&#39;})

Create a new data frame from the list version of the data:

&gt;&gt;&gt; df[&#39;sl&#39;] = df[&#39;s&#39;].str.split(&#39; &#39;, expand=False)
&gt;&gt;&gt; df1 = df.explode(&#39;sl&#39;)
   id  ...                          sl
0   0  ...                  age=senior
0   0  ...  workclass=Self-emp-not-inc
0   0  ...         education=Bachelors
0   0  ...                  edu_num=13
0   0  ...            marital=Divorced
0   0  ...     occupation=Craft-repair
0   0  ...  relationship=Not-in-family
0   0  ...                  race=White
0   0  ...                    sex=Male
0   0  ...                   gain=high
0   0  ...                   loss=none
0   0  ...             hours=half-time
0   0  ...       country=United-States
0   0  ...                  salary&gt;50K

[14 rows x 3 columns]

Create your names and values from your second split.

&gt;&gt;&gt; df1[[&#39;n&#39;, &#39;v&#39;]] = df1[&#39;sl&#39;].str.split(&#39;=&#39;, expand=True)
&gt;&gt;&gt; df1[[&#39;id&#39;, &#39;n&#39;, &#39;v&#39;]]
   id             n                 v
0   0           age            senior
0   0     workclass  Self-emp-not-inc
0   0     education         Bachelors
0   0       edu_num                13
0   0       marital          Divorced
0   0    occupation      Craft-repair
0   0  relationship     Not-in-family
0   0          race             White
0   0           sex              Male
0   0          gain              high
0   0          loss              none
0   0         hours         half-time
0   0       country     United-States
0   0    salary&gt;50K              None

Then just pivot into place.

&gt;&gt;&gt; df1.pivot(index=&#39;id&#39;, columns=&#39;n&#39;, values=&#39;v&#39;)
n      age        country edu_num  ... salary&gt;50K   sex         workclass
id                                 ...                                   
0   senior  United-States      13  ...       None  Male  Self-emp-not-inc

Because each observation (the index in df.pivot) is uniquely identified, this works over the entire data frame as a whole. If you don't have an id column already, create one by df.reset_index().rename(columns={'index': 'id'}) at the very start.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何正确地将数据框进行旋转，使第一列的值成为我的新列？

问题

答案1

df1 = df.stack()
.str.split('=',expand=True)
.reset_index(1,drop=True)
.set_index(0,append=True)
.unstack(1)

答案2

答案3

如何创建一个具有多级索引的torch数据集？

pip install -r requirements.txt 在虚拟环境中不起作用

“Getting ‘Nonetype is not callable’ error but I don’t know why.”

如何修复 Python 正则表达式的 if 语句？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论

问题

答案1

df1 = df.stack() .str.split('=',expand=True) .reset_index(1,drop=True) .set_index(0,append=True) .unstack(1)

答案2

答案3

发表评论

df1 = df.stack()
.str.split('=',expand=True)
.reset_index(1,drop=True)
.set_index(0,append=True)
.unstack(1)