从列输出中提取两个字母的州缩写到新列中。

huangapple go评论93阅读模式
英文:

Extracting two letter state abbreviation from string in column output to new column

问题

我需要从列full_address中提取最后两个州的缩写。

  1. states = ['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
  2. 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
  3. 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
  4. 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
  5. 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']

但是我在列state中得到了N/A。

如何才能获得列state中正确的州缩写值?

我希望得到:
state

  1. 0 AL
  2. 1 AL
  3. 2 AL
  4. 3 AL
  5. 4 AL
  6. 5 AL
英文:

I need to extract the last two state abbreviation from the column full_address

  1. id position name score \
  2. 0 1 19 PJ Fresh (224 Daniel Payne Drive) NaN
  3. 1 2 9 J' ti`'z Smoothie-N-Coffee Bar NaN
  4. 2 3 6 Philly Fresh Cheesesteaks (541-B Graymont Ave) NaN
  5. 3 4 17 Papa Murphy's (1580 Montgomery Highway) NaN
  6. 4 5 162 Nelson Brothers Cafe (17th St N) 4.7
  7. ratings category price_range \
  8. 0 NaN Burgers, American, Sandwiches $
  9. 1 NaN Coffee and Tea, Breakfast and Brunch, Bubble Tea NaN
  10. 2 NaN American, Cheesesteak, Sandwiches, Alcohol $
  11. 3 NaN Pizza $
  12. 4 22.0 Breakfast and Brunch, Burgers, Sandwiches NaN
  13. full_address zip_code lat \
  14. 0 224 Daniel Payne Drive, Birmingham, AL, 35207 35207 33.562365
  15. 1 1521 Pinson Valley Parkway, Birmingham, AL, 35217 35217 33.583640
  16. 2 541-B Graymont Ave, Birmingham, AL, 35204 35204 33.509800
  17. 3 1580 Montgomery Highway, Hoover, AL, 35226 35226 33.404439
  18. 4 314 17th St N, Birmingham, AL, 35203 35203 33.514730
  1. states = [ 'AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
  2. 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
  3. 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
  4. 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
  5. 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']
  1. df['state']=df['full_address'].apply(lambda x: x if x in states else 'N/A')

But I get N/A in the column state

  1. state
  2. 0 N/A
  3. 1 N/A
  4. 2 N/A
  5. 3 N/A
  6. 4 N/A
  7. 5 N/A
  8. 6 N/A
  9. 7 N/A
  10. 8 N/A
  11. 9 N/A

How do I get the correct values of the state abbreviation in the column state?

I'm aiming for:
state

  1. 0 AL
  2. 1 AL
  3. 2 AL
  4. 3 AL
  5. 4 AL
  6. 5 AL

答案1

得分: 0

你可以只是这样做,无需使用正则表达式:

  1. df["State"] = df["full_address"].str.split(", ").str[-2]

或者这样:

  1. df["full_address"].str.extract(r"(?<=, )(\w{2})(?=, )")
英文:

You could just do this, no need for regex:

  1. df[&quot;State&quot;] = df[&quot;full_address&quot;].str.split(&quot;, &quot;).str[-2]

Or this:

  1. df[&quot;full_address&quot;].str.extract(r&quot;(?&lt;=, )(\w{2})(?=, )&quot;)

答案2

得分: 0

鉴于它看起来像是完整地址的自由文本,我会避免尝试解析它,因为由于不规则的输入,肯定会出现一些错误。

我会使用类似于 https://pypi.org/project/uszipcode/ 的工具,并利用邮政编码字段来确定州的缩写。

英文:

given that it looks like the full address is free form text i would steer clear of attempting to parse it because some errors are sure to come up just because of irregular inputs.

i would use something like https://pypi.org/project/uszipcode/ and leverage the zipcode field to determine the state abbreviation instead

huangapple
  • 本文由 发表于 2023年7月17日 17:58:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/76703329.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定