2020年1月3日 21:30:13go评论121阅读模式

英文:

Pandas DataFrame: Converting Column of String into Column of Lists

问题

我目前有一个包含以下列的数据框：

print(df.WIN_COUNTRY_CODE[180:200])

       WIN_COUNTRY_CODE

180 IT
181 IT
182 ES
183 DE---UK---UK---UK---UK
184 UK---UK---UK---UK
185 DE---UK---UK---UK
186 UK---UK---DE---UK---UK
187 SI
188 UK
189 FR


该列的每个单元格包含国家代码，每个记录可以有多个国家代码。
由于我想将国家代码从2字母转换为3字母ISO代码，并计算该国家的出现频率，我应用了以下代码：
###1. 我通过3个短横线分隔字符串，将国家代码转换为列表：###

df['WIN_COUNTRY_CODE_2'] = df['WIN_COUNTRY_CODE'].str.split("---")

这将导致该列如下所示：

print(df.WIN_COUNTRY_CODE[180:200])

       WIN_COUNTRY_CODE

180 ['IT']
181 ['IT']
182 ['ES']
183 ['DE', 'UK', 'UK', 'UK', 'UK']
184 ['UK', 'UK', 'UK', 'UK']
185 ['DE', 'UK', 'UK', 'UK']
186 ['UK', 'UK', 'DE', 'UK', 'UK']
187 ['SI']
188 ['UK']
189 ['FR']


###2. 我应用映射方法，从转换表（cattable）将2字母转换为3字母国家代码，并将其转换为字典类型（catdict）###

catdict= dict([(iso2,iso3) for iso2,iso3 in zip(cattable['iso_2_codes'], cattable['iso_3_codes'])])
df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE_2])

然而，每当我应用映射时，它总是返回以下语句：

TypeError Traceback (most recent call last)
<ipython-input-13-df7aad8ca868> in <module>
1 cattable = pd.ExcelFile('D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx').parse('WIN_COUNTRY_CODE')
2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable['WIN_COUNTRY_CODE'], cattable['Description'])])
----> 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])

<ipython-input-13-df7aad8ca868> in <listcomp>(.0)
1 cattable = pd.ExcelFile('D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx').parse('WIN_COUNTRY_CODE')
2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable['WIN_COUNTRY_CODE'], cattable['Description'])])
----> 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])

TypeError: 'float' object is not iterable

&lt;br /&gt;
似乎代码返回错误，因为WIN_COUNTRY_CODE列中的条目仍然处于字符串格式，而不是字符串列表。通过此代码检查列表中的对象后，我了解到：

df.WIN_COUNTRY_CODE_2[183][0]

它总是返回一个字符，而不是预期的2字母代码作为字符串对象。

'['

而我期望该代码返回&#39;DE&#39;。
&lt;br/ &gt;
&lt;br/ &gt;
###问题：###
如何将```WIN_COUNTRY_CODE```列从列表列转换为列表列？如何找到整个列中出现最频繁的国家？谢谢。

英文:

I currently have a dataframe which contains several columns like this below:

print(df.WIN_COUNTRY_CODE[180:200])
           WIN_COUNTRY_CODE
180                        IT
181                        IT
182                        ES
183    DE---UK---UK---UK---UK
184         UK---UK---UK---UK
185         DE---UK---UK---UK
186    UK---UK---DE---UK---UK
187                        SI
188                        UK
189                        FR

Each cells of the column contain country codes, which can be more than one for each record.
Since I would like to convert the country code from 2-letter into 3-letter iso code and also calculate the appearance frequency for this country, i apply this code:

###1. I split the string by the 3-dash that separates the countrycodes to convert from string to list:###

df[&#39;WIN_COUNTRY_CODE_2&#39;] = df[&#39;WIN_COUNTRY_CODE&#39;].str.split(&quot;---&quot;)

This results in the column to be like this:

print(df.WIN_COUNTRY_CODE[180:200])
           WIN_COUNTRY_CODE
180                            [&#39;IT&#39;]
181                            [&#39;IT&#39;]
182                            [&#39;ES&#39;]
183    [&#39;DE&#39;, &#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;]
184          [&#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;]
185          [&#39;DE&#39;, &#39;UK&#39;, &#39;UK&#39;, &#39;UK&#39;]
186    [&#39;UK&#39;, &#39;UK&#39;, &#39;DE&#39;, &#39;UK&#39;, &#39;UK&#39;]
187                            [&#39;SI&#39;]
188                            [&#39;UK&#39;]
189                            [&#39;FR&#39;]

###2. I apply the mapping method to convert from 2-letter to 3-letter country codes from conversion table that (cattable) and make it a dictionary type (catdict)###

catdict= dict([(iso2,iso3) for iso2,iso3 in zip(cattable[&#39;iso_2_codes&#39;], cattable[&#39;iso_3_codes&#39;])])
df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE_2])

However whenever I apply the mapping it always return me this statement:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
&lt;ipython-input-13-df7aad8ca868&gt; in &lt;module&gt;
      1 cattable = pd.ExcelFile(&#39;D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx&#39;).parse(&#39;WIN_COUNTRY_CODE&#39;)
      2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable[&#39;WIN_COUNTRY_CODE&#39;], cattable[&#39;Description&#39;])])
----&gt; 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])
&lt;ipython-input-13-df7aad8ca868&gt; in &lt;listcomp&gt;(.0)
      1 cattable = pd.ExcelFile(&#39;D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx&#39;).parse(&#39;WIN_COUNTRY_CODE&#39;)
      2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable[&#39;WIN_COUNTRY_CODE&#39;], cattable[&#39;Description&#39;])])
----&gt; 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])
TypeError: &#39;float&#39; object is not iterable

<br />
It seems likely that the code returns an error as the entries in the WIN_COUNTRY_CODE column are still in a string format, instead of a list of strings. This I learn after inspecting the objects within the list by this code:

df.WIN_COUNTRY_CODE_2[183][0]

it always return one character instead of the 2-letter code as a string-object.

&#39;[&#39;

whereas I expect the code to return a 'DE' object.
<br/ >
<br/ >

###Question:###
How to convert the WIN_COUNTRY_CODE column from a column of list into a column of list? And how can I find the most frequent country in the entire column? Thank you.

答案1

得分: 1

The code you provided appears to be in Python and involves DataFrame manipulation. Here's the translated code part:

df1 = df.copy()
df1["WIN_COUNTRY_CODE"] = df['WIN_COUNTRY_CODE'].str.split('---')
df1["Max_code"] = df1["WIN_COUNTRY_CODE"].apply(lambda x: max(set(x), key=x.count))

The provided image link appears to be related to the code output, but I cannot view or describe the image content. If you have any specific questions about the code or need further assistance, please feel free to ask.

英文:

df1=df.copy()
df1[&quot;WIN_COUNTRY_CODE&quot;]=df[&#39;WIN_COUNTRY_CODE&#39;].str.split(&#39;---&#39;)
df1[&quot;Max_code&quot;]=df1[&quot;WIN_COUNTRY_CODE&quot;].apply(lambda x: max(set(x), key = x.count))

output

答案2

得分: 0

df['new_WIN_COUNTRY_CODE'] = df['WIN_COUNTRY_CODE'].map(lambda x: x.split("---") if "---" in x else [x])

print(df)

英文:

This might help.

df[&#39;new_WIN_COUNTRY_CODE&#39;]=df[&#39;WIN_COUNTRY_CODE&#39;].map(lambda x: x.split(&quot;---&quot;) if &quot;---&quot; in x else [x])
print(df)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas DataFrame：将字符串列转换为列表列

问题

答案1

output

答案2

请问有人可以帮助我理解这个冒号在这里是做什么的吗？

在一个NumPy数组中找到最接近网格的元素的索引。

Value Error (At least one stride in the given numpy array is negative) Remains After Copying Array

如何在Django中添加活动类？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。