Pandas DataFrame:将字符串列转换为列表列

huangapple go评论121阅读模式
英文:

Pandas DataFrame: Converting Column of String into Column of Lists

问题

  1. 我目前有一个包含以下列的数据框:

print(df.WIN_COUNTRY_CODE[180:200])

  1. WIN_COUNTRY_CODE

180 IT
181 IT
182 ES
183 DE---UK---UK---UK---UK
184 UK---UK---UK---UK
185 DE---UK---UK---UK
186 UK---UK---DE---UK---UK
187 SI
188 UK
189 FR

  1. 该列的每个单元格包含国家代码,每个记录可以有多个国家代码。
  2. 由于我想将国家代码从2字母转换为3字母ISO代码,并计算该国家的出现频率,我应用了以下代码:
  3. ###1. 我通过3个短横线分隔字符串,将国家代码转换为列表:###

df['WIN_COUNTRY_CODE_2'] = df['WIN_COUNTRY_CODE'].str.split("---")

  1. 这将导致该列如下所示:

print(df.WIN_COUNTRY_CODE[180:200])

  1. WIN_COUNTRY_CODE

180 ['IT']
181 ['IT']
182 ['ES']
183 ['DE', 'UK', 'UK', 'UK', 'UK']
184 ['UK', 'UK', 'UK', 'UK']
185 ['DE', 'UK', 'UK', 'UK']
186 ['UK', 'UK', 'DE', 'UK', 'UK']
187 ['SI']
188 ['UK']
189 ['FR']

  1. ###2. 我应用映射方法,从转换表(cattable)将2字母转换为3字母国家代码,并将其转换为字典类型(catdict)###

catdict= dict([(iso2,iso3) for iso2,iso3 in zip(cattable['iso_2_codes'], cattable['iso_3_codes'])])
df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE_2])

  1. 然而,每当我应用映射时,它总是返回以下语句:

TypeError Traceback (most recent call last)
<ipython-input-13-df7aad8ca868> in <module>
1 cattable = pd.ExcelFile('D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx').parse('WIN_COUNTRY_CODE')
2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable['WIN_COUNTRY_CODE'], cattable['Description'])])
----> 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])

<ipython-input-13-df7aad8ca868> in <listcomp>(.0)
1 cattable = pd.ExcelFile('D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx').parse('WIN_COUNTRY_CODE')
2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable['WIN_COUNTRY_CODE'], cattable['Description'])])
----> 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])

TypeError: 'float' object is not iterable

  1. &lt;br /&gt;
  2. 似乎代码返回错误,因为WIN_COUNTRY_CODE列中的条目仍然处于字符串格式,而不是字符串列表。通过此代码检查列表中的对象后,我了解到:

df.WIN_COUNTRY_CODE_2[183][0]

  1. 它总是返回一个字符,而不是预期的2字母代码作为字符串对象。

'['

  1. 而我期望该代码返回&#39;DE&#39;。
  2. &lt;br/ &gt;
  3. &lt;br/ &gt;
  4. ###问题:###
  5. 如何将```WIN_COUNTRY_CODE```列从列表列转换为列表列?如何找到整个列中出现最频繁的国家?谢谢。
英文:

I currently have a dataframe which contains several columns like this below:

  1. print(df.WIN_COUNTRY_CODE[180:200])
  2. WIN_COUNTRY_CODE
  3. 180 IT
  4. 181 IT
  5. 182 ES
  6. 183 DE---UK---UK---UK---UK
  7. 184 UK---UK---UK---UK
  8. 185 DE---UK---UK---UK
  9. 186 UK---UK---DE---UK---UK
  10. 187 SI
  11. 188 UK
  12. 189 FR

Each cells of the column contain country codes, which can be more than one for each record.
Since I would like to convert the country code from 2-letter into 3-letter iso code and also calculate the appearance frequency for this country, i apply this code:

###1. I split the string by the 3-dash that separates the countrycodes to convert from string to list:###

  1. df[&#39;WIN_COUNTRY_CODE_2&#39;] = df[&#39;WIN_COUNTRY_CODE&#39;].str.split(&quot;---&quot;)

This results in the column to be like this:

  1. print(df.WIN_COUNTRY_CODE[180:200])
  2. WIN_COUNTRY_CODE
  3. 180 [&#39;IT&#39;]
  4. 181 [&#39;IT&#39;]
  5. 182 [&#39;ES&#39;]
  6. 183 [&#39;DE&#39;, &#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;]
  7. 184 [&#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;]
  8. 185 [&#39;DE&#39;, &#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;]
  9. 186 [&#39;UK&#39;, &#39;UK&#39;, &#39;DE&#39;, &#39;UK&#39;, &#39;UK&#39;]
  10. 187 [&#39;SI&#39;]
  11. 188 [&#39;UK&#39;]
  12. 189 [&#39;FR&#39;]

###2. I apply the mapping method to convert from 2-letter to 3-letter country codes from conversion table that (cattable) and make it a dictionary type (catdict)###

  1. catdict= dict([(iso2,iso3) for iso2,iso3 in zip(cattable[&#39;iso_2_codes&#39;], cattable[&#39;iso_3_codes&#39;])])
  2. df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE_2])

However whenever I apply the mapping it always return me this statement:

  1. ---------------------------------------------------------------------------
  2. TypeError Traceback (most recent call last)
  3. &lt;ipython-input-13-df7aad8ca868&gt; in &lt;module&gt;
  4. 1 cattable = pd.ExcelFile(&#39;D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx&#39;).parse(&#39;WIN_COUNTRY_CODE&#39;)
  5. 2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable[&#39;WIN_COUNTRY_CODE&#39;], cattable[&#39;Description&#39;])])
  6. ----&gt; 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])
  7. &lt;ipython-input-13-df7aad8ca868&gt; in &lt;listcomp&gt;(.0)
  8. 1 cattable = pd.ExcelFile(&#39;D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx&#39;).parse(&#39;WIN_COUNTRY_CODE&#39;)
  9. 2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable[&#39;WIN_COUNTRY_CODE&#39;], cattable[&#39;Description&#39;])])
  10. ----&gt; 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])
  11. TypeError: &#39;float&#39; object is not iterable

<br />
It seems likely that the code returns an error as the entries in the WIN_COUNTRY_CODE column are still in a string format, instead of a list of strings. This I learn after inspecting the objects within the list by this code:

  1. df.WIN_COUNTRY_CODE_2[183][0]

it always return one character instead of the 2-letter code as a string-object.

  1. &#39;[&#39;

whereas I expect the code to return a 'DE' object.
<br/ >
<br/ >

###Question:###
How to convert the WIN_COUNTRY_CODE column from a column of list into a column of list? And how can I find the most frequent country in the entire column? Thank you.

答案1

得分: 1

The code you provided appears to be in Python and involves DataFrame manipulation. Here's the translated code part:

  1. df1 = df.copy()
  2. df1["WIN_COUNTRY_CODE"] = df['WIN_COUNTRY_CODE'].str.split('---')
  3. df1["Max_code"] = df1["WIN_COUNTRY_CODE"].apply(lambda x: max(set(x), key=x.count))

The provided image link appears to be related to the code output, but I cannot view or describe the image content. If you have any specific questions about the code or need further assistance, please feel free to ask.

英文:
  1. df1=df.copy()
  2. df1[&quot;WIN_COUNTRY_CODE&quot;]=df[&#39;WIN_COUNTRY_CODE&#39;].str.split(&#39;---&#39;)
  3. df1[&quot;Max_code&quot;]=df1[&quot;WIN_COUNTRY_CODE&quot;].apply(lambda x: max(set(x), key = x.count))

output

Pandas DataFrame:将字符串列转换为列表列

答案2

得分: 0

df['new_WIN_COUNTRY_CODE'] = df['WIN_COUNTRY_CODE'].map(lambda x: x.split("---") if "---" in x else [x])

print(df)

英文:

This might help.

  1. df[&#39;new_WIN_COUNTRY_CODE&#39;]=df[&#39;WIN_COUNTRY_CODE&#39;].map(lambda x: x.split(&quot;---&quot;) if &quot;---&quot; in x else [x])
  2. print(df)

huangapple
  • 本文由 发表于 2020年1月3日 21:30:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/59579466.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定