如何检查 pandas 数据框的列是否包含列表中的任何值,并返回该值。

huangapple go评论76阅读模式
英文:

How to check if a pandas data frame column contains any value from a list and return that value

问题

我有一个国家列表:

countries = ["Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Antigua and Barbuda", "Argentina", "Armenia", "Austria", "Azerbaijan", "Bahrain", "Bangladesh", "Barbados", "Belarus", "Belgium", "Belize", "Benin", "Bhutan", "Bolivia", "Bosnia and Herzegovina", "Botswana", "Brazil", "Brunei", "Bulgaria", "Burkina Faso", "Burundi", "Cabo Verde", "Cambodia", "Cameroon", "Canada", "Central African Republic", "Chad", "Channel Islands", "Chile", "China", "Colombia", "Comoros", "Congo", "Costa Rica", "Côte d'Ivoire", "Croatia", "Cuba", "Cyprus", "Czech Republic", "Denmark", "Djibouti", "Dominica", "Dominican Republic", "DR Congo", "Ecuador", "Egypt", "El Salvador", "Equatorial Guinea", "Eritrea", "Estonia", "Eswatini", "Ethiopia", "Faeroe Islands", "Finland", "France", "French Guiana", "Gabon", "Gambia", "Georgia", "Germany", "Ghana", "Gibraltar", "Greece", "Grenada", "Guatemala", "Guinea", "Guinea-Bissau", "Guyana", "Haiti", "Holy See", "Honduras", "Hong Kong", "Hungary", "Iceland", "India", "Indonesia", "Iran", "Iraq", "Ireland", "Isle of Man", "Israel", "Italy", "Jamaica", "Japan", "Jordan", "Kazakhstan", "Kenya", "Kuwait", "Kyrgyzstan", "Laos", "Latvia", "Lebanon", "Lesotho", "Liberia", "Libya", "Liechtenstein", "Lithuania", "Luxembourg", "Macao", "Madagascar", "Malawi", "Malaysia", "Maldives", "Mali", "Malta", "Mauritania", "Mauritius", "Mayotte", "Mexico", "Moldova", "Monaco", "Mongolia", "Montenegro", "Morocco", "Mozambique", "Myanmar", "Namibia", "Nepal", "Netherlands", "Nicaragua", "Niger", "Nigeria", "North Korea", "North Macedonia", "Norway", "Oman", "Pakistan", "Panama", "Paraguay", "Peru", "Philippines", "Poland", "Portugal", "Qatar", "Réunion", "Romania", "Russia", "Rwanda", "Saint Helena", "Saint Kitts and Nevis", "Saint Lucia", "Saint Vincent and the Grenadines", "San Marino", "Sao Tome & Principe", "Saudi Arabia", "Senegal", "Serbia", "Seychelles", "Sierra Leone", "Singapore", "Slovakia", "Slovenia", "Somalia", "South Africa", "South Korea", "South Sudan", "Spain", "Sri Lanka", "State of Palestine", "Sudan", "Suriname", "Sweden", "Switzerland", "Syria", "Taiwan", "Tajikistan", "Tanzania", "Thailand", "The Bahamas", "Timor-Leste", "Togo", "Trinidad and Tobago", "Tunisia", "Turkey", "Turkmenistan", "Uganda", "Ukraine", "United Arab Emirates", "United Kingdom", "United States", "Uruguay", "Uzbekistan", "Venezuela", "Vietnam", "Western Sahara", "Yemen", "Zambia", "Zimbabwe"]

我还有这个pandas数据帧:

import pandas as pd
import numpy as np

ds1 = {'remarks':["DOB 21 Mar 1974; POB Baghdad, Iraq.","DOB 26 Mar 1969; POB Tunis, Tunisia; Italian Fiscal Code TLLLHR69C26Z352G.","DOB 10 Jun 1970; POB Tunis, Tunisia; nationality Tunisia; Passport L550681 issued 23 Sep 1997 expires 22 Sep 2002; Italian Fiscal Code WDDHBB70H10Z352O."], "Latitude" : [-23.69057,-23.41165,-23.51482]}
df1 = pd.DataFrame(data=ds1)

数据帧看起来像这样:

print(df1)

我需要:

  • 检查列remarks是否包含在名为countries的列表中的任何国家
  • 如果是这样,创建一个新列(称为country),其中包含匹配的国家名称。

从上面的示例中,结果数据帧将如下所示:

英文:

I have list of countries:

countries = ["Afghanistan",	"Albania",	"Algeria",	"Andorra",	"Angola",	"Antigua and Barbuda",	"Argentina",	"Armenia",	"Austria",	"Azerbaijan",	"Bahrain",	"Bangladesh",	"Barbados",	"Belarus",	"Belgium",	"Belize",	"Benin",	"Bhutan",	"Bolivia",	"Bosnia and Herzegovina",	"Botswana",	"Brazil",	"Brunei",	"Bulgaria",	"Burkina Faso",	"Burundi",	"Cabo Verde",	"Cambodia",	"Cameroon",	"Canada",	"Central African Republic",	"Chad",	"Channel Islands",	"Chile",	"China",	"Colombia",	"Comoros",	"Congo",	"Costa Rica",	"Côte d'Ivoire",	"Croatia",	"Cuba",	"Cyprus",	"Czech Republic",	"Denmark",	"Djibouti",	"Dominica",	"Dominican Republic",	"DR Congo",	"Ecuador",	"Egypt",	"El Salvador",	"Equatorial Guinea",	"Eritrea",	"Estonia",	"Eswatini",	"Ethiopia",	"Faeroe Islands",	"Finland",	"France",	"French Guiana",	"Gabon",	"Gambia",	"Georgia",	"Germany",	"Ghana",	"Gibraltar",	"Greece",	"Grenada",	"Guatemala",	"Guinea",	"Guinea-Bissau",	"Guyana",	"Haiti",	"Holy See",	"Honduras",	"Hong Kong",	"Hungary",	"Iceland",	"India",	"Indonesia",	"Iran",	"Iraq",	"Ireland",	"Isle of Man",	"Israel",	"Italy",	"Jamaica",	"Japan",	"Jordan",	"Kazakhstan",	"Kenya",	"Kuwait",	"Kyrgyzstan",	"Laos",	"Latvia",	"Lebanon",	"Lesotho",	"Liberia",	"Libya",	"Liechtenstein",	"Lithuania",	"Luxembourg",	"Macao",	"Madagascar",	"Malawi",	"Malaysia",	"Maldives",	"Mali",	"Malta",	"Mauritania",	"Mauritius",	"Mayotte",	"Mexico",	"Moldova",	"Monaco",	"Mongolia",	"Montenegro",	"Morocco",	"Mozambique",	"Myanmar",	"Namibia",	"Nepal",	"Netherlands",	"Nicaragua",	"Niger",	"Nigeria",	"North Korea",	"North Macedonia",	"Norway",	"Oman",	"Pakistan",	"Panama",	"Paraguay",	"Peru",	"Philippines",	"Poland",	"Portugal",	"Qatar",	"Réunion",	"Romania",	"Russia",	"Rwanda",	"Saint Helena",	"Saint Kitts and Nevis",	"Saint Lucia",	"Saint Vincent and the Grenadines",	"San Marino",	"Sao Tome & Principe",	"Saudi Arabia",	"Senegal",	"Serbia",	"Seychelles",	"Sierra Leone",	"Singapore",	"Slovakia",	"Slovenia",	"Somalia",	"South Africa",	"South Korea",	"South Sudan",	"Spain",	"Sri Lanka",	"State of Palestine",	"Sudan",	"Suriname",	"Sweden",	"Switzerland",	"Syria",	"Taiwan",	"Tajikistan",	"Tanzania",	"Thailand",	"The Bahamas",	"Timor-Leste",	"Togo",	"Trinidad and Tobago",	"Tunisia",	"Turkey",	"Turkmenistan",	"Uganda",	"Ukraine",	"United Arab Emirates",	"United Kingdom",	"United States",	"Uruguay",	"Uzbekistan",	"Venezuela",	"Vietnam",	"Western Sahara",	"Yemen",	"Zambia",	"Zimbabwe"]

I also have this pandas dataframe:

import pandas as pd
import numpy as np

ds1 = {'remarks':["DOB 21 Mar 1974; POB Baghdad, Iraq.","DOB 26 Mar 1969; POB Tunis, Tunisia; Italian Fiscal Code TLLLHR69C26Z352G.","DOB 10 Jun 1970; POB Tunis, Tunisia; nationality Tunisia; Passport L550681 issued 23 Sep 1997 expires 22 Sep 2002; Italian Fiscal Code WDDHBB70H10Z352O."], "Latitude" : [-23.69057,-23.41165,-23.51482]}
df1 = pd.DataFrame(data=ds1)

The dataframe looks like this:

print(df1)

如何检查 pandas 数据框的列是否包含列表中的任何值,并返回该值。

I need to:

  • check whether the column remarks contains any of the countries included in the list called countries
  • if so,create a new column (called country) which contains the name of the matched country.

From the example above, the resulting dataframe would look like this:

如何检查 pandas 数据框的列是否包含列表中的任何值,并返回该值。

Can anyone help me please?

答案1

得分: 1

str.extract() 的威力在处理这类问题时表现得非常出色。

df1['country'] = df1['remarks'].str.extract("(\(" + "|".join(countries) + "\))", expand=False)
print(df1)
                                                 remarks  Latitude  country
0                DOB 21 Mar 1974; POB Baghdad, Iraq. -23.69057     Iraq
1  DOB 26 Mar 1969; POB Tunis, FakeCountry; Itali... -23.41165      NaN #我在这里替换了国家以展示一个测试案例
2  DOB 10 Jun 1970; POB Tunis, Tunisia; nationali... -23.51482  Tunisia
英文:

The power of str.extract() really shines for problems like these

df1['country'] = df1['remarks'].str.extract(("(" + "|".join(countries) +")"), expand=False)
print(df1)

                                             remarks  Latitude  country
0                DOB 21 Mar 1974; POB Baghdad, Iraq. -23.69057     Iraq
1  DOB 26 Mar 1969; POB Tunis, FakeCountry; Itali... -23.41165      NaN #I replaced the Country here to show a test case
2  DOB 10 Jun 1970; POB Tunis, Tunisia; nationali... -23.51482  Tunisia

huangapple
  • 本文由 发表于 2023年6月29日 02:43:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76575902.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定