Read in data from web (xml format) but then need to separate fields.

huangapple go评论102阅读模式
英文:

Read in data from web (xml format) but then need to separate fields

问题

我正在从这个链接中读取数据:

  1. url = "https://www.bis.doc.gov/dpl/dpl.txt"

这是我读取数据的方式(如果我将其读取为CSV,我会得到Forbidden错误->因此使用了requests库):

  1. import requests
  2. test_URL = url
  3. def get_data(link):
  4. hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
  5. req = requests.get(link, headers=hdr)
  6. content = req.content
  7. return content
  8. data = get_data(test_URL)

我读取的数据如下:

  1. print(data)
  2. b'"Name"\t"Street_Address"\t"City"\t"State"\t"Country"\t"Postal_Code"\t"Effective_Date"\t"Expiration_Date"\t"Standard_Order"\t"Last_Update"\t"Action"\t"FR_Citation"\n"I. ASH"\t"UPON THE DATE OF THE ORDER INCARCERATED AT USM NO: 26265-177, FCI SEAGOVILLE, 2113 NORTH HIGHWAY 175"\t"SEAGOVILLE"\t"TX"\t"US"\t"75159"\t"06/19/2003"\t"06/29/2056"\t"Y"\t"2007-01-31"\t"FEDERAL REGISTER NOTICE UPDATED"\t"68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72 F.R. 4236 1/30/07"\n"AARON ABRAHAM VILLA"\t"3415 RIVERA AVENUE"\t"EL PASO"\t"TX"\t""\t"79905"\t"08/24/2022"\t"01/14/2026"\t"Y"\t"2022-08-29"\t"F.R. NOTICE ADDED"\t"87 F.R. 52741 8/29/2022"\n"ABDIEL PADRON MADRID"\t"INMATE NUMBER: 42167-480, FCI LA TUNA FEDERAL CORRECTIONAL INSTITUTION, P.O. BOX 3000"\t"ANTHONY"\t"NM"\t""\t"88201"\t"02/10/2022"\t"06/17/2030"\t"Y"\t"2022-02-22"\t"F.R. NOTICE ADDED"\t"87 F.R. 9030 2/17/2022"\n...

有人知道如何将上面的数据转换成一个漂亮格式的Pandas数据框吗?

英文:

I am reading in a dataset from this link:

  1. url = "https://www.bis.doc.gov/dpl/dpl.txt"

This is how I have read it in (if I read it in as a csv I get the Forbidden error -> hence the use of requests):
import requests

  1. test_URL = url
  2. def get_data(link):
  3. hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
  4. req = requests.get(link,headers=hdr)
  5. content = req.content
  6. return content
  7. data = get_data(test_URL)

The data I have read in look like this:

  1. print(data)
  2. b'"Name"\t"Street_Address"\t"City"\t"State"\t"Country"\t"Postal_Code"\t"Effective_Date"\t"Expiration_Date"\t"Standard_Order"\t"Last_Update"\t"Action"\t"FR_Citation"\n" I. ASH"\t"UPON THE DATE OF THE ORDER INCARCERATED AT USM NO: 26265-177, FCI SEAGOVILLE, 2113 NORTH HIGHWAY 175"\t"SEAGOVILLE"\t"TX"\t"US"\t"75159"\t"06/19/2003"\t"06/29/2056"\t"Y"\t"2007-01-31"\t"FEDERAL REGISTER NOTICE UPDATED"\t"68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72 F.R. 4236 1/30/07"\n"AARON ABRAHAM VILLA"\t"3415 RIVERA AVENUE"\t"EL PASO"\t"TX"\t""\t"79905"\t"08/24/2022"\t"01/14/2026"\t"Y"\t"2022-08-29"\t"F.R. NOTICE ADDED"\t"87 F.R. 52741 8/29/2022"\n"ABDIEL PADRON MADRID"\t"INMATE NUMBER: 42167-480, FCI LA TUNA FEDERAL CORRECTIONAL INSTITUTION, P.O. BOX 3000"\t"ANTHONY"\t"NM"\t""\t"88201"\t"02/10/2022"\t"06/17/2030"\t"Y"\t"2022-02-22"\t"F.R. NOTICE ADDED"\t"87 F.R. 9030 2/17/2022"\n"ABDUL MAJID SAIDI"\t"2948 PEASE DRIVE, APT. 201"\t"ROCKY RIVER"\t"OH"\t""\t"44116"\t"10/30/2020"\t"03/13/2026"\t"Y"\t"2020-11-05"\t"F.R. NOTICE ADDED"\t"85 F.R. 70581 11/5/2020"\n"ABDULAH AL NASSER"\t"605 TRAIL LAKE DRIVE"\t"RICHARDSON"\t"TX"\t"US"\t"75081"\t"03/04/2002"\t"06/29/2056"\t"Y"\t"2006-07-11"\t"50 YEAR DENIAL"\t"67 F.R. 56530 9/4/02 67 F.R. 10890 3/11/02 71 F.R. 38843 7/10/06"\n"ABDULAH AL NASSER"\t"908 AUDELIA ROAD, SUIE 200, PMB #245"\t"RICHARDSON"\t"TX"\t"US"\t"75081"\t"03/04/2002"\t"06/29/2056"\t"Y"\t"2006-07-11"\t"50 YEAR DENIAL"\t"67 F.R. 56530 9/4/02 67 F.R. 10890 3/11/02"\n"ABDULAMIR MAHDI"\t"20 HUNTINGWOOD DRIVE"\t"SCARBOROUGH, ONTARIO"\t""\t"CA"\t"M1W1A2"\t"10/03/2003"\t"10/03/2023"\t"N"\t"2003-10-06"\t"NON STANDARD DENIAL"\t"68 F.R. 57406 10/3/03"\n"ABDULLAH AL NASSER"\t"UPON THE DATE OF THE ORDER INCARCERATED AT USM NO: 26265-177, FCI SEAGOVILLE, 2113 NORTH HIGHWAY 175"\t"SEAGOVILLE"\t"TX"\t"US"\t"75159"\t"06/19/2003"\t"06/29/2056"\t"Y"\t"2007-01-31"\t"FEDERAL REGISTER NOTICE UPDATED"\t"68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72 F.R. 4236 1/30/07"\n"ABEL HERNANDEZ, JR."\t"120 SAINT JOHN DRIVE"\t"PHARR"\t"TX"\t""\t"78577"\t"04/30/2021"\t"08/29/2029"\t"Y"\t"2021-05-05"\t"F.R. NOTICE ADDED"\t"86 F.R. 23920 5/5/2021"\n"ABU AL-JUD"\t"INMATE NUMBER: 87450-083, FCI VICTORVILLE MEDIUM II FEDERAL CORRECTIONAL INSTITUTION, P.O. BOX 3850"\t"ADELANTO"\t"CA"\t""\t"92301"\t"03/31/2017"\t"06/13/2026"\t"Y"\t"2017-04-06"\t"FR NOTICE ADDED"\t"82 F.R. 16788, 16789 4/6/2017"\n"ADAM AL HERZ"\t"INMATE NUMBER: 13991-029, FMC ROCHESTER, P.O. BOX 4000"\t"ROCHESTER"\t"MN"\t""\t"55903"\t"08/13/2019"\t"10/13/2026"\t"Y"\t"2019-08-22"\t"FR NOTICE ADDED"\t"84 F.R. 43787 8/22/2019"\n"ADRIANA GABRIELA GUAJARDO-CAVAZOS"\t"CALLE MANUEL OTIZ #49, MATAMOROS, TAMAULIPAS"\t"MEXICO"\t""\t"MX"\t"87394"\t"05/08/2023"\t"11/12/2027"\t"Y"\t"2023-05-12"\t"ADDITION, F.R. NOTICE ADDED "\t"88 F.R. 30721 5/12/2023"\n"ADT ANALOG AND DIGITAL TECHNIK"\t"8019 NIEDERSEEON, HOUSE

Does anyone know how to convert the data above into a nice looking, well formatted pandas dataframe?

答案1

得分: 1

你可以使用 io.StringIOpandas.read_csv

  1. import io
  2. df = pd.read_csv(io.StringIO(data.decode('utf-8')), sep='\t')

请注意,你也可以通过 read_csvstorage_options 参数向 request 传递参数:

  1. hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
  2. url = "https://www.bis.doc.gov/dpl/dpl.txt"
  3. df = pd.read_csv(url, sep='\t', storage_options=hdr)

输出结果:

  1. Name Street_Address City State Country Postal_Code Effective_Date Expiration_Date Standard_Order Last_Update \
  2. 0 I. ASH UPON THE DATE OF THE ORDER INCARCERATED AT USM... SEAGOVILLE TX US 75159 06/19/2003 06/29/2056 Y 2007-01-31
  3. 1 AARON ABRAHAM VILLA 3415 RIVERA AVENUE EL PASO TX NaN 79905 08/24/2022 01/14/2026 Y 2022-08-29
  4. 2 ABDIEL PADRON MADRID INMATE NUMBER: 42167-480, FCI LA TUNA FEDERAL ... ANTHONY NM NaN 88201 02/10/2022 06/17/2030 Y 2022-02-22
  5. 3 ABDUL MAJID SAIDI 2948 PEASE DRIVE, APT. 201 ROCKY RIVER OH NaN 44116 10/30/2020 03/13/2026 Y 2020-11-05
  6. 4 ABDULAH AL NASSER 605 TRAIL LAKE DRIVE RICHARDSON TX US 75081 03/04/2002 06/29/2056 Y 2006-07-11
  7. ...
  8. 652 YURI I. MONTGOMERY 2912 10TH PLACE WEST SEATTLE WA US 98119 12/21/2010 12/21/2040 Y 2011-01-04
  9. 653 ZHIFU LIN INMATE NUMBER: 08295-087, CI MOSHANNON VALLEY,... PHILIPSBURG PA US 16866 12/22/2014 11/15/2023 Y 2014-12-22
  10. 654 ZHONGDA JIN 1895 DOBBIN DRIVE, SUITE B SAN JOSE CA US 95133 07/31/2001 07/31/2026 Y 2001-08-01
  11. 655 ZIMO SHENG 3975 N. CRAMER STREET, UNIT 204 MILWAUKEE WI NaN 53211 03/16/2020 12/13/2028 Y 2020-03-20
  12. 656 ZIMO SHENG JINXIUYUAN 17-403, CHANGSHU JIANGSU NaN CN 215500 03/16/2020 12/13/2028 Y 2020-03-20
  13. Action FR_Citation
  14. 0 FEDERAL REGISTER NOTICE UPDATED 68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72...
  15. 1 F.R. NOTICE ADDED 87 F.R. 52741 8/29/2022
  16. 2 F.R. NOTICE ADDED 87 F.R. 9030 2/17/2022
  17. 3 F.R. NOTICE ADDED 85 F.R. 70581 11/5/2020
  18. 4 50 YEAR DENIAL 67 F.R. 56530 9/4/02 67 F.R. 10890 3/11/02 71 ...
  19. ...
  20. 652 NEW & FR NOTICE 75 F.R. 82464 12/30/10
  21. 653 FR NOTICE ADDED 79 F.R. 78394 12/30/14
  22. 654 NEW 66 F.R. 40971 8/6/01
  23. 655 FR NOTICE ADDED 85 F.R. 16054 3/20/2020
  24. 656 FR NOTICE ADDED 85 F.R. 16054 3/20/2020
英文:

You can use io.StringIO and pandas.read_csv:

  1. import io
  2. df = pd.read_csv(io.StringIO(data.decode('utf-8')), sep='\t')

Note that you can also pass parameters to request through read_csv's storage_options:

  1. hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
  2. url = "https://www.bis.doc.gov/dpl/dpl.txt"
  3. df = pd.read_csv(url, sep='\t', storage_options=hdr)

Output:

  1. Name Street_Address City State Country Postal_Code Effective_Date Expiration_Date Standard_Order Last_Update \
  2. 0 I. ASH UPON THE DATE OF THE ORDER INCARCERATED AT USM... SEAGOVILLE TX US 75159 06/19/2003 06/29/2056 Y 2007-01-31
  3. 1 AARON ABRAHAM VILLA 3415 RIVERA AVENUE EL PASO TX NaN 79905 08/24/2022 01/14/2026 Y 2022-08-29
  4. 2 ABDIEL PADRON MADRID INMATE NUMBER: 42167-480, FCI LA TUNA FEDERAL ... ANTHONY NM NaN 88201 02/10/2022 06/17/2030 Y 2022-02-22
  5. 3 ABDUL MAJID SAIDI 2948 PEASE DRIVE, APT. 201 ROCKY RIVER OH NaN 44116 10/30/2020 03/13/2026 Y 2020-11-05
  6. 4 ABDULAH AL NASSER 605 TRAIL LAKE DRIVE RICHARDSON TX US 75081 03/04/2002 06/29/2056 Y 2006-07-11
  7. .. ... ... ... ... ... ... ... ... ... ...
  8. 652 YURI I. MONTGOMERY 2912 10TH PLACE WEST SEATTLE WA US 98119 12/21/2010 12/21/2040 Y 2011-01-04
  9. 653 ZHIFU LIN INMATE NUMBER: 08295-087, CI MOSHANNON VALLEY,... PHILIPSBURG PA US 16866 12/22/2014 11/15/2023 Y 2014-12-22
  10. 654 ZHONGDA JIN 1895 DOBBIN DRIVE, SUITE B SAN JOSE CA US 95133 07/31/2001 07/31/2026 Y 2001-08-01
  11. 655 ZIMO SHENG 3975 N. CRAMER STREET, UNIT 204 MILWAUKEE WI NaN 53211 03/16/2020 12/13/2028 Y 2020-03-20
  12. 656 ZIMO SHENG JINXIUYUAN 17-403, CHANGSHU JIANGSU NaN CN 215500 03/16/2020 12/13/2028 Y 2020-03-20
  13. Action FR_Citation
  14. 0 FEDERAL REGISTER NOTICE UPDATED 68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72...
  15. 1 F.R. NOTICE ADDED 87 F.R. 52741 8/29/2022
  16. 2 F.R. NOTICE ADDED 87 F.R. 9030 2/17/2022
  17. 3 F.R. NOTICE ADDED 85 F.R. 70581 11/5/2020
  18. 4 50 YEAR DENIAL 67 F.R. 56530 9/4/02 67 F.R. 10890 3/11/02 71 ...
  19. .. ... ...
  20. 652 NEW & FR NOTICE 75 F.R. 82464 12/30/10
  21. 653 FR NOTICE ADDED 79 F.R. 78394 12/30/14
  22. 654 NEW 66 F.R. 40971 8/6/01
  23. 655 FR NOTICE ADDED 85 F.R. 16054 3/20/2020
  24. 656 FR NOTICE ADDED 85 F.R. 16054 3/20/2020
  25. [657 rows x 12 columns]

huangapple
  • 本文由 发表于 2023年6月27日 20:12:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76564739.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定