英文:
Pandas DataFrame: Converting Column of String into Column of Lists
问题
我目前有一个包含以下列的数据框:
print(df.WIN_COUNTRY_CODE[180:200])
WIN_COUNTRY_CODE
180 IT
181 IT
182 ES
183 DE---UK---UK---UK---UK
184 UK---UK---UK---UK
185 DE---UK---UK---UK
186 UK---UK---DE---UK---UK
187 SI
188 UK
189 FR
该列的每个单元格包含国家代码,每个记录可以有多个国家代码。
由于我想将国家代码从2字母转换为3字母ISO代码,并计算该国家的出现频率,我应用了以下代码:
###1. 我通过3个短横线分隔字符串,将国家代码转换为列表:###
df['WIN_COUNTRY_CODE_2'] = df['WIN_COUNTRY_CODE'].str.split("---")
这将导致该列如下所示:
print(df.WIN_COUNTRY_CODE[180:200])
WIN_COUNTRY_CODE
180 ['IT']
181 ['IT']
182 ['ES']
183 ['DE', 'UK', 'UK', 'UK', 'UK']
184 ['UK', 'UK', 'UK', 'UK']
185 ['DE', 'UK', 'UK', 'UK']
186 ['UK', 'UK', 'DE', 'UK', 'UK']
187 ['SI']
188 ['UK']
189 ['FR']
###2. 我应用映射方法,从转换表(cattable)将2字母转换为3字母国家代码,并将其转换为字典类型(catdict)###
catdict= dict([(iso2,iso3) for iso2,iso3 in zip(cattable['iso_2_codes'], cattable['iso_3_codes'])])
df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE_2])
然而,每当我应用映射时,它总是返回以下语句:
TypeError Traceback (most recent call last)
<ipython-input-13-df7aad8ca868> in <module>
1 cattable = pd.ExcelFile('D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx').parse('WIN_COUNTRY_CODE')
2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable['WIN_COUNTRY_CODE'], cattable['Description'])])
----> 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])
<ipython-input-13-df7aad8ca868> in <listcomp>(.0)
1 cattable = pd.ExcelFile('D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx').parse('WIN_COUNTRY_CODE')
2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable['WIN_COUNTRY_CODE'], cattable['Description'])])
----> 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])
TypeError: 'float' object is not iterable
<br />
似乎代码返回错误,因为WIN_COUNTRY_CODE列中的条目仍然处于字符串格式,而不是字符串列表。通过此代码检查列表中的对象后,我了解到:
df.WIN_COUNTRY_CODE_2[183][0]
它总是返回一个字符,而不是预期的2字母代码作为字符串对象。
'['
而我期望该代码返回'DE'。
<br/ >
<br/ >
###问题:###
如何将```WIN_COUNTRY_CODE```列从列表列转换为列表列?如何找到整个列中出现最频繁的国家?谢谢。
英文:
I currently have a dataframe which contains several columns like this below:
print(df.WIN_COUNTRY_CODE[180:200])
WIN_COUNTRY_CODE
180 IT
181 IT
182 ES
183 DE---UK---UK---UK---UK
184 UK---UK---UK---UK
185 DE---UK---UK---UK
186 UK---UK---DE---UK---UK
187 SI
188 UK
189 FR
Each cells of the column contain country codes, which can be more than one for each record.
Since I would like to convert the country code from 2-letter into 3-letter iso code and also calculate the appearance frequency for this country, i apply this code:
###1. I split the string by the 3-dash that separates the countrycodes to convert from string to list:###
df['WIN_COUNTRY_CODE_2'] = df['WIN_COUNTRY_CODE'].str.split("---")
This results in the column to be like this:
print(df.WIN_COUNTRY_CODE[180:200])
WIN_COUNTRY_CODE
180 ['IT']
181 ['IT']
182 ['ES']
183 ['DE', 'UK', 'UK', 'UK', 'UK']
184 ['UK', 'UK', 'UK', 'UK']
185 ['DE', 'UK', 'UK', 'UK']
186 ['UK', 'UK', 'DE', 'UK', 'UK']
187 ['SI']
188 ['UK']
189 ['FR']
###2. I apply the mapping method to convert from 2-letter to 3-letter country codes from conversion table that (cattable) and make it a dictionary type (catdict)###
catdict= dict([(iso2,iso3) for iso2,iso3 in zip(cattable['iso_2_codes'], cattable['iso_3_codes'])])
df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE_2])
However whenever I apply the mapping it always return me this statement:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-df7aad8ca868> in <module>
1 cattable = pd.ExcelFile('D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx').parse('WIN_COUNTRY_CODE')
2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable['WIN_COUNTRY_CODE'], cattable['Description'])])
----> 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])
<ipython-input-13-df7aad8ca868> in <listcomp>(.0)
1 cattable = pd.ExcelFile('D:/ROBERT LIBRARIES/Documents/ISD - LKPP Project/vardesc2.xlsx').parse('WIN_COUNTRY_CODE')
2 catdict= dict([(catnum,catdesc) for catnum,catdesc in zip(cattable['WIN_COUNTRY_CODE'], cattable['Description'])])
----> 3 df.assign(mapped=[[catdict[k] for k in row if catdict.get(k)] for row in df.WIN_COUNTRY_CODE])
TypeError: 'float' object is not iterable
<br />
It seems likely that the code returns an error as the entries in the WIN_COUNTRY_CODE column are still in a string format, instead of a list of strings. This I learn after inspecting the objects within the list by this code:
df.WIN_COUNTRY_CODE_2[183][0]
it always return one character instead of the 2-letter code as a string-object.
'['
whereas I expect the code to return a 'DE' object.
<br/ >
<br/ >
###Question:###
How to convert the WIN_COUNTRY_CODE
column from a column of list into a column of list? And how can I find the most frequent country in the entire column? Thank you.
答案1
得分: 1
The code you provided appears to be in Python and involves DataFrame manipulation. Here's the translated code part:
df1 = df.copy()
df1["WIN_COUNTRY_CODE"] = df['WIN_COUNTRY_CODE'].str.split('---')
df1["Max_code"] = df1["WIN_COUNTRY_CODE"].apply(lambda x: max(set(x), key=x.count))
The provided image link appears to be related to the code output, but I cannot view or describe the image content. If you have any specific questions about the code or need further assistance, please feel free to ask.
英文:
df1=df.copy()
df1["WIN_COUNTRY_CODE"]=df['WIN_COUNTRY_CODE'].str.split('---')
df1["Max_code"]=df1["WIN_COUNTRY_CODE"].apply(lambda x: max(set(x), key = x.count))
output
答案2
得分: 0
df['new_WIN_COUNTRY_CODE'] = df['WIN_COUNTRY_CODE'].map(lambda x: x.split("---") if "---" in x else [x])
print(df)
英文:
This might help.
df['new_WIN_COUNTRY_CODE']=df['WIN_COUNTRY_CODE'].map(lambda x: x.split("---") if "---" in x else [x])
print(df)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论