Pandas DataFrame:将字符串列转换为列表列

huangapple go评论89阅读模式
英文:

Pandas DataFrame: Converting Column of String into Column of Lists

问题

我目前有一个包含以下列的数据框:

print(df.WIN_COUNTRY_CODE[180:200])

       WIN_COUNTRY_CODE

180 IT
181 IT
182 ES
183 DE---UK---UK---UK---UK
184 UK---UK---UK---UK
185 DE---UK---UK---UK
186 UK---UK---DE---UK---UK
187 SI
188 UK
189 FR


该列的每个单元格包含国家代码,每个记录可以有多个国家代码。
由于我想将国家代码从2字母转换为3字母ISO代码,并计算该国家的出现频率,我应用了以下代码:


###1. 我通过3个短横线分隔字符串,将国家代码转换为列表:###

df['WIN_COUNTRY_CODE_2'] = df['WIN_COUNTRY_CODE'].str.split("---")

这将导致该列如下所示:

print(df.WIN_COUNTRY_CODE[180:200])

       WIN_COUNTRY_CODE

180 ['IT']
181 ['IT']
182 ['ES']
183 ['DE', 'UK', 'UK', 'UK', 'UK']
184 ['UK', 'UK', 'UK', 'UK']
185 ['DE', 'UK', 'UK', 'UK']
186 ['UK', 'UK', 'DE', 'UK', 'UK']
187 ['SI']
188 ['UK']
189 ['FR']


###2. 我应用映射方法,从转换表(cattable)将2字母转换为3字母国家代码,并将其转换为字典类型(catdict)###

catdict= dict([(iso2,iso3) for iso2,iso3 in zip(cattable['iso_2_codes'], cattable['iso_3_codes'])])
df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE_2])

然而,每当我应用映射时,它总是返回以下语句:

TypeError Traceback (most recent call last)
<ipython-input-13-df7aad8ca868> in <module>
1 cattable = pd.ExcelFile('D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx').parse('WIN_COUNTRY_CODE')
2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable['WIN_COUNTRY_CODE'], cattable['Description'])])
----> 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])

<ipython-input-13-df7aad8ca868> in <listcomp>(.0)
1 cattable = pd.ExcelFile('D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx').parse('WIN_COUNTRY_CODE')
2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable['WIN_COUNTRY_CODE'], cattable['Description'])])
----> 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])

TypeError: 'float' object is not iterable

&lt;br /&gt;
似乎代码返回错误,因为WIN_COUNTRY_CODE列中的条目仍然处于字符串格式,而不是字符串列表。通过此代码检查列表中的对象后,我了解到:

df.WIN_COUNTRY_CODE_2[183][0]

它总是返回一个字符,而不是预期的2字母代码作为字符串对象。

'['

而我期望该代码返回&#39;DE&#39;。
&lt;br/ &gt;
&lt;br/ &gt;


###问题:###
如何将```WIN_COUNTRY_CODE```列从列表列转换为列表列?如何找到整个列中出现最频繁的国家?谢谢。

英文:

I currently have a dataframe which contains several columns like this below:

print(df.WIN_COUNTRY_CODE[180:200])

           WIN_COUNTRY_CODE
180                        IT
181                        IT
182                        ES
183    DE---UK---UK---UK---UK
184         UK---UK---UK---UK
185         DE---UK---UK---UK
186    UK---UK---DE---UK---UK
187                        SI
188                        UK
189                        FR

Each cells of the column contain country codes, which can be more than one for each record.
Since I would like to convert the country code from 2-letter into 3-letter iso code and also calculate the appearance frequency for this country, i apply this code:

###1. I split the string by the 3-dash that separates the countrycodes to convert from string to list:###

df[&#39;WIN_COUNTRY_CODE_2&#39;] = df[&#39;WIN_COUNTRY_CODE&#39;].str.split(&quot;---&quot;)

This results in the column to be like this:

print(df.WIN_COUNTRY_CODE[180:200])

           WIN_COUNTRY_CODE
180                            [&#39;IT&#39;]
181                            [&#39;IT&#39;]
182                            [&#39;ES&#39;]
183    [&#39;DE&#39;, &#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;]
184          [&#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;]
185          [&#39;DE&#39;, &#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;]
186    [&#39;UK&#39;, &#39;UK&#39;, &#39;DE&#39;, &#39;UK&#39;, &#39;UK&#39;]
187                            [&#39;SI&#39;]
188                            [&#39;UK&#39;]
189                            [&#39;FR&#39;]

###2. I apply the mapping method to convert from 2-letter to 3-letter country codes from conversion table that (cattable) and make it a dictionary type (catdict)###

catdict= dict([(iso2,iso3) for iso2,iso3 in zip(cattable[&#39;iso_2_codes&#39;], cattable[&#39;iso_3_codes&#39;])])
df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE_2])

However whenever I apply the mapping it always return me this statement:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
&lt;ipython-input-13-df7aad8ca868&gt; in &lt;module&gt;
      1 cattable = pd.ExcelFile(&#39;D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx&#39;).parse(&#39;WIN_COUNTRY_CODE&#39;)
      2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable[&#39;WIN_COUNTRY_CODE&#39;], cattable[&#39;Description&#39;])])
----&gt; 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])

&lt;ipython-input-13-df7aad8ca868&gt; in &lt;listcomp&gt;(.0)
      1 cattable = pd.ExcelFile(&#39;D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx&#39;).parse(&#39;WIN_COUNTRY_CODE&#39;)
      2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable[&#39;WIN_COUNTRY_CODE&#39;], cattable[&#39;Description&#39;])])
----&gt; 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])

TypeError: &#39;float&#39; object is not iterable

<br />
It seems likely that the code returns an error as the entries in the WIN_COUNTRY_CODE column are still in a string format, instead of a list of strings. This I learn after inspecting the objects within the list by this code:

df.WIN_COUNTRY_CODE_2[183][0]

it always return one character instead of the 2-letter code as a string-object.

&#39;[&#39;

whereas I expect the code to return a 'DE' object.
<br/ >
<br/ >

###Question:###
How to convert the WIN_COUNTRY_CODE column from a column of list into a column of list? And how can I find the most frequent country in the entire column? Thank you.

答案1

得分: 1

The code you provided appears to be in Python and involves DataFrame manipulation. Here's the translated code part:

df1 = df.copy()
df1["WIN_COUNTRY_CODE"] = df['WIN_COUNTRY_CODE'].str.split('---')
df1["Max_code"] = df1["WIN_COUNTRY_CODE"].apply(lambda x: max(set(x), key=x.count))

The provided image link appears to be related to the code output, but I cannot view or describe the image content. If you have any specific questions about the code or need further assistance, please feel free to ask.

英文:
df1=df.copy()
df1[&quot;WIN_COUNTRY_CODE&quot;]=df[&#39;WIN_COUNTRY_CODE&#39;].str.split(&#39;---&#39;)
df1[&quot;Max_code&quot;]=df1[&quot;WIN_COUNTRY_CODE&quot;].apply(lambda x: max(set(x), key = x.count))

output

Pandas DataFrame:将字符串列转换为列表列

答案2

得分: 0

df['new_WIN_COUNTRY_CODE'] = df['WIN_COUNTRY_CODE'].map(lambda x: x.split("---") if "---" in x else [x])

print(df)

英文:

This might help.

df[&#39;new_WIN_COUNTRY_CODE&#39;]=df[&#39;WIN_COUNTRY_CODE&#39;].map(lambda x: x.split(&quot;---&quot;) if &quot;---&quot; in x else [x])

print(df)

huangapple
  • 本文由 发表于 2020年1月3日 21:30:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/59579466.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定