2023年7月17日 17:58:22go评论93阅读模式

英文:

Extracting two letter state abbreviation from string in column output to new column

问题

我需要从列full_address中提取最后两个州的缩写。

states = ['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
           'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
           'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
           'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
           'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']

但是我在列state中得到了N/A。

如何才能获得列state中正确的州缩写值？

我希望得到：
state

0  AL
1  AL
2  AL
3  AL
4  AL
5  AL

英文:

I need to extract the last two state abbreviation from the column full_address

 id  position                                            name  score  \
0   1        19               PJ Fresh (224 Daniel Payne Drive)    NaN   
1   2         9                  J&#39; ti`&#39;z Smoothie-N-Coffee Bar    NaN   
2   3         6  Philly Fresh Cheesesteaks (541-B Graymont Ave)    NaN   
3   4        17         Papa Murphy&#39;s (1580 Montgomery Highway)    NaN   
4   5       162                Nelson Brothers Cafe (17th St N)    4.7   
   ratings                                          category price_range  \
0      NaN                     Burgers, American, Sandwiches           $   
1      NaN  Coffee and Tea, Breakfast and Brunch, Bubble Tea         NaN   
2      NaN        American, Cheesesteak, Sandwiches, Alcohol           $   
3      NaN                                             Pizza           $   
4     22.0         Breakfast and Brunch, Burgers, Sandwiches         NaN   
                                        full_address zip_code        lat  \
0      224 Daniel Payne Drive, Birmingham, AL, 35207    35207  33.562365   
1  1521 Pinson Valley Parkway, Birmingham, AL, 35217    35217  33.583640   
2          541-B Graymont Ave, Birmingham, AL, 35204    35204  33.509800   
3         1580 Montgomery Highway, Hoover, AL, 35226    35226  33.404439   
4               314 17th St N, Birmingham, AL, 35203    35203  33.514730

states = [ &#39;AK&#39;, &#39;AL&#39;, &#39;AR&#39;, &#39;AZ&#39;, &#39;CA&#39;, &#39;CO&#39;, &#39;CT&#39;, &#39;DC&#39;, &#39;DE&#39;, &#39;FL&#39;, &#39;GA&#39;,
           &#39;HI&#39;, &#39;IA&#39;, &#39;ID&#39;, &#39;IL&#39;, &#39;IN&#39;, &#39;KS&#39;, &#39;KY&#39;, &#39;LA&#39;, &#39;MA&#39;, &#39;MD&#39;, &#39;ME&#39;,
           &#39;MI&#39;, &#39;MN&#39;, &#39;MO&#39;, &#39;MS&#39;, &#39;MT&#39;, &#39;NC&#39;, &#39;ND&#39;, &#39;NE&#39;, &#39;NH&#39;, &#39;NJ&#39;, &#39;NM&#39;,
           &#39;NV&#39;, &#39;NY&#39;, &#39;OH&#39;, &#39;OK&#39;, &#39;OR&#39;, &#39;PA&#39;, &#39;RI&#39;, &#39;SC&#39;, &#39;SD&#39;, &#39;TN&#39;, &#39;TX&#39;,
           &#39;UT&#39;, &#39;VA&#39;, &#39;VT&#39;, &#39;WA&#39;, &#39;WI&#39;, &#39;WV&#39;, &#39;WY&#39;]

df[&#39;state&#39;]=df[&#39;full_address&#39;].apply(lambda x: x if x in states  else &#39;N/A&#39;)

But I get N/A in the column state

state  
0  N/A  
1 N/A  
2 N/A  
3 N/A  
4 N/A  
5 N/A  
6 N/A  
7 N/A  
8 N/A  
9 N/A

How do I get the correct values of the state abbreviation in the column state?

I'm aiming for:
state

0  AL 
1 AL  
2 AL  
3 AL  
4 AL 
5 AL

答案1

得分: 0

你可以只是这样做，无需使用正则表达式：

df["State"] = df["full_address"].str.split(", ").str[-2]

或者这样：

df["full_address"].str.extract(r"(?<=, )(\w{2})(?=, )")

英文:

You could just do this, no need for regex:

df[&quot;State&quot;] = df[&quot;full_address&quot;].str.split(&quot;, &quot;).str[-2]

Or this:

df[&quot;full_address&quot;].str.extract(r&quot;(?&lt;=, )(\w{2})(?=, )&quot;)

答案2

得分: 0

鉴于它看起来像是完整地址的自由文本，我会避免尝试解析它，因为由于不规则的输入，肯定会出现一些错误。

我会使用类似于 https://pypi.org/project/uszipcode/ 的工具，并利用邮政编码字段来确定州的缩写。

英文:

given that it looks like the full address is free form text i would steer clear of attempting to parse it because some errors are sure to come up just because of irregular inputs.

i would use something like https://pypi.org/project/uszipcode/ and leverage the zipcode field to determine the state abbreviation instead

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从列输出中提取两个字母的州缩写到新列中。

问题

答案1

答案2

计算网格的顶点距离

Uploading files to Google Drive using python with Docker – googleapiclient.errors.UnknownFileType

快速排序的随机化基本情况不起作用（有时）

在Python中链接类型边界对。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。