英文:
Read in data from web (xml format) but then need to separate fields
问题
我正在从这个链接中读取数据:
url = "https://www.bis.doc.gov/dpl/dpl.txt"
这是我读取数据的方式(如果我将其读取为CSV,我会得到Forbidden
错误->因此使用了requests
库):
import requests
test_URL = url
def get_data(link):
hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
req = requests.get(link, headers=hdr)
content = req.content
return content
data = get_data(test_URL)
我读取的数据如下:
print(data)
b'"Name"\t"Street_Address"\t"City"\t"State"\t"Country"\t"Postal_Code"\t"Effective_Date"\t"Expiration_Date"\t"Standard_Order"\t"Last_Update"\t"Action"\t"FR_Citation"\n"I. ASH"\t"UPON THE DATE OF THE ORDER INCARCERATED AT USM NO: 26265-177, FCI SEAGOVILLE, 2113 NORTH HIGHWAY 175"\t"SEAGOVILLE"\t"TX"\t"US"\t"75159"\t"06/19/2003"\t"06/29/2056"\t"Y"\t"2007-01-31"\t"FEDERAL REGISTER NOTICE UPDATED"\t"68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72 F.R. 4236 1/30/07"\n"AARON ABRAHAM VILLA"\t"3415 RIVERA AVENUE"\t"EL PASO"\t"TX"\t""\t"79905"\t"08/24/2022"\t"01/14/2026"\t"Y"\t"2022-08-29"\t"F.R. NOTICE ADDED"\t"87 F.R. 52741 8/29/2022"\n"ABDIEL PADRON MADRID"\t"INMATE NUMBER: 42167-480, FCI LA TUNA FEDERAL CORRECTIONAL INSTITUTION, P.O. BOX 3000"\t"ANTHONY"\t"NM"\t""\t"88201"\t"02/10/2022"\t"06/17/2030"\t"Y"\t"2022-02-22"\t"F.R. NOTICE ADDED"\t"87 F.R. 9030 2/17/2022"\n...
有人知道如何将上面的数据转换成一个漂亮格式的Pandas数据框吗?
英文:
I am reading in a dataset from this link:
url = "https://www.bis.doc.gov/dpl/dpl.txt"
This is how I have read it in (if I read it in as a csv I get the Forbidden
error -> hence the use of requests
):
import requests
test_URL = url
def get_data(link):
hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
req = requests.get(link,headers=hdr)
content = req.content
return content
data = get_data(test_URL)
The data I have read in look like this:
print(data)
b'"Name"\t"Street_Address"\t"City"\t"State"\t"Country"\t"Postal_Code"\t"Effective_Date"\t"Expiration_Date"\t"Standard_Order"\t"Last_Update"\t"Action"\t"FR_Citation"\n" I. ASH"\t"UPON THE DATE OF THE ORDER INCARCERATED AT USM NO: 26265-177, FCI SEAGOVILLE, 2113 NORTH HIGHWAY 175"\t"SEAGOVILLE"\t"TX"\t"US"\t"75159"\t"06/19/2003"\t"06/29/2056"\t"Y"\t"2007-01-31"\t"FEDERAL REGISTER NOTICE UPDATED"\t"68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72 F.R. 4236 1/30/07"\n"AARON ABRAHAM VILLA"\t"3415 RIVERA AVENUE"\t"EL PASO"\t"TX"\t""\t"79905"\t"08/24/2022"\t"01/14/2026"\t"Y"\t"2022-08-29"\t"F.R. NOTICE ADDED"\t"87 F.R. 52741 8/29/2022"\n"ABDIEL PADRON MADRID"\t"INMATE NUMBER: 42167-480, FCI LA TUNA FEDERAL CORRECTIONAL INSTITUTION, P.O. BOX 3000"\t"ANTHONY"\t"NM"\t""\t"88201"\t"02/10/2022"\t"06/17/2030"\t"Y"\t"2022-02-22"\t"F.R. NOTICE ADDED"\t"87 F.R. 9030 2/17/2022"\n"ABDUL MAJID SAIDI"\t"2948 PEASE DRIVE, APT. 201"\t"ROCKY RIVER"\t"OH"\t""\t"44116"\t"10/30/2020"\t"03/13/2026"\t"Y"\t"2020-11-05"\t"F.R. NOTICE ADDED"\t"85 F.R. 70581 11/5/2020"\n"ABDULAH AL NASSER"\t"605 TRAIL LAKE DRIVE"\t"RICHARDSON"\t"TX"\t"US"\t"75081"\t"03/04/2002"\t"06/29/2056"\t"Y"\t"2006-07-11"\t"50 YEAR DENIAL"\t"67 F.R. 56530 9/4/02 67 F.R. 10890 3/11/02 71 F.R. 38843 7/10/06"\n"ABDULAH AL NASSER"\t"908 AUDELIA ROAD, SUIE 200, PMB #245"\t"RICHARDSON"\t"TX"\t"US"\t"75081"\t"03/04/2002"\t"06/29/2056"\t"Y"\t"2006-07-11"\t"50 YEAR DENIAL"\t"67 F.R. 56530 9/4/02 67 F.R. 10890 3/11/02"\n"ABDULAMIR MAHDI"\t"20 HUNTINGWOOD DRIVE"\t"SCARBOROUGH, ONTARIO"\t""\t"CA"\t"M1W1A2"\t"10/03/2003"\t"10/03/2023"\t"N"\t"2003-10-06"\t"NON STANDARD DENIAL"\t"68 F.R. 57406 10/3/03"\n"ABDULLAH AL NASSER"\t"UPON THE DATE OF THE ORDER INCARCERATED AT USM NO: 26265-177, FCI SEAGOVILLE, 2113 NORTH HIGHWAY 175"\t"SEAGOVILLE"\t"TX"\t"US"\t"75159"\t"06/19/2003"\t"06/29/2056"\t"Y"\t"2007-01-31"\t"FEDERAL REGISTER NOTICE UPDATED"\t"68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72 F.R. 4236 1/30/07"\n"ABEL HERNANDEZ, JR."\t"120 SAINT JOHN DRIVE"\t"PHARR"\t"TX"\t""\t"78577"\t"04/30/2021"\t"08/29/2029"\t"Y"\t"2021-05-05"\t"F.R. NOTICE ADDED"\t"86 F.R. 23920 5/5/2021"\n"ABU AL-JUD"\t"INMATE NUMBER: 87450-083, FCI VICTORVILLE MEDIUM II FEDERAL CORRECTIONAL INSTITUTION, P.O. BOX 3850"\t"ADELANTO"\t"CA"\t""\t"92301"\t"03/31/2017"\t"06/13/2026"\t"Y"\t"2017-04-06"\t"FR NOTICE ADDED"\t"82 F.R. 16788, 16789 4/6/2017"\n"ADAM AL HERZ"\t"INMATE NUMBER: 13991-029, FMC ROCHESTER, P.O. BOX 4000"\t"ROCHESTER"\t"MN"\t""\t"55903"\t"08/13/2019"\t"10/13/2026"\t"Y"\t"2019-08-22"\t"FR NOTICE ADDED"\t"84 F.R. 43787 8/22/2019"\n"ADRIANA GABRIELA GUAJARDO-CAVAZOS"\t"CALLE MANUEL OTIZ #49, MATAMOROS, TAMAULIPAS"\t"MEXICO"\t""\t"MX"\t"87394"\t"05/08/2023"\t"11/12/2027"\t"Y"\t"2023-05-12"\t"ADDITION, F.R. NOTICE ADDED "\t"88 F.R. 30721 5/12/2023"\n"ADT ANALOG AND DIGITAL TECHNIK"\t"8019 NIEDERSEEON, HOUSE
Does anyone know how to convert the data above into a nice looking, well formatted pandas dataframe?
答案1
得分: 1
你可以使用 io.StringIO
和 pandas.read_csv
:
import io
df = pd.read_csv(io.StringIO(data.decode('utf-8')), sep='\t')
请注意,你也可以通过 read_csv
的 storage_options
参数向 request
传递参数:
hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
url = "https://www.bis.doc.gov/dpl/dpl.txt"
df = pd.read_csv(url, sep='\t', storage_options=hdr)
输出结果:
Name Street_Address City State Country Postal_Code Effective_Date Expiration_Date Standard_Order Last_Update \
0 I. ASH UPON THE DATE OF THE ORDER INCARCERATED AT USM... SEAGOVILLE TX US 75159 06/19/2003 06/29/2056 Y 2007-01-31
1 AARON ABRAHAM VILLA 3415 RIVERA AVENUE EL PASO TX NaN 79905 08/24/2022 01/14/2026 Y 2022-08-29
2 ABDIEL PADRON MADRID INMATE NUMBER: 42167-480, FCI LA TUNA FEDERAL ... ANTHONY NM NaN 88201 02/10/2022 06/17/2030 Y 2022-02-22
3 ABDUL MAJID SAIDI 2948 PEASE DRIVE, APT. 201 ROCKY RIVER OH NaN 44116 10/30/2020 03/13/2026 Y 2020-11-05
4 ABDULAH AL NASSER 605 TRAIL LAKE DRIVE RICHARDSON TX US 75081 03/04/2002 06/29/2056 Y 2006-07-11
...
652 YURI I. MONTGOMERY 2912 10TH PLACE WEST SEATTLE WA US 98119 12/21/2010 12/21/2040 Y 2011-01-04
653 ZHIFU LIN INMATE NUMBER: 08295-087, CI MOSHANNON VALLEY,... PHILIPSBURG PA US 16866 12/22/2014 11/15/2023 Y 2014-12-22
654 ZHONGDA JIN 1895 DOBBIN DRIVE, SUITE B SAN JOSE CA US 95133 07/31/2001 07/31/2026 Y 2001-08-01
655 ZIMO SHENG 3975 N. CRAMER STREET, UNIT 204 MILWAUKEE WI NaN 53211 03/16/2020 12/13/2028 Y 2020-03-20
656 ZIMO SHENG JINXIUYUAN 17-403, CHANGSHU JIANGSU NaN CN 215500 03/16/2020 12/13/2028 Y 2020-03-20
Action FR_Citation
0 FEDERAL REGISTER NOTICE UPDATED 68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72...
1 F.R. NOTICE ADDED 87 F.R. 52741 8/29/2022
2 F.R. NOTICE ADDED 87 F.R. 9030 2/17/2022
3 F.R. NOTICE ADDED 85 F.R. 70581 11/5/2020
4 50 YEAR DENIAL 67 F.R. 56530 9/4/02 67 F.R. 10890 3/11/02 71 ...
...
652 NEW & FR NOTICE 75 F.R. 82464 12/30/10
653 FR NOTICE ADDED 79 F.R. 78394 12/30/14
654 NEW 66 F.R. 40971 8/6/01
655 FR NOTICE ADDED 85 F.R. 16054 3/20/2020
656 FR NOTICE ADDED 85 F.R. 16054 3/20/2020
英文:
You can use io.StringIO
and pandas.read_csv
:
import io
df = pd.read_csv(io.StringIO(data.decode('utf-8')), sep='\t')
Note that you can also pass parameters to request
through read_csv
's storage_options
:
hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
url = "https://www.bis.doc.gov/dpl/dpl.txt"
df = pd.read_csv(url, sep='\t', storage_options=hdr)
Output:
Name Street_Address City State Country Postal_Code Effective_Date Expiration_Date Standard_Order Last_Update \
0 I. ASH UPON THE DATE OF THE ORDER INCARCERATED AT USM... SEAGOVILLE TX US 75159 06/19/2003 06/29/2056 Y 2007-01-31
1 AARON ABRAHAM VILLA 3415 RIVERA AVENUE EL PASO TX NaN 79905 08/24/2022 01/14/2026 Y 2022-08-29
2 ABDIEL PADRON MADRID INMATE NUMBER: 42167-480, FCI LA TUNA FEDERAL ... ANTHONY NM NaN 88201 02/10/2022 06/17/2030 Y 2022-02-22
3 ABDUL MAJID SAIDI 2948 PEASE DRIVE, APT. 201 ROCKY RIVER OH NaN 44116 10/30/2020 03/13/2026 Y 2020-11-05
4 ABDULAH AL NASSER 605 TRAIL LAKE DRIVE RICHARDSON TX US 75081 03/04/2002 06/29/2056 Y 2006-07-11
.. ... ... ... ... ... ... ... ... ... ...
652 YURI I. MONTGOMERY 2912 10TH PLACE WEST SEATTLE WA US 98119 12/21/2010 12/21/2040 Y 2011-01-04
653 ZHIFU LIN INMATE NUMBER: 08295-087, CI MOSHANNON VALLEY,... PHILIPSBURG PA US 16866 12/22/2014 11/15/2023 Y 2014-12-22
654 ZHONGDA JIN 1895 DOBBIN DRIVE, SUITE B SAN JOSE CA US 95133 07/31/2001 07/31/2026 Y 2001-08-01
655 ZIMO SHENG 3975 N. CRAMER STREET, UNIT 204 MILWAUKEE WI NaN 53211 03/16/2020 12/13/2028 Y 2020-03-20
656 ZIMO SHENG JINXIUYUAN 17-403, CHANGSHU JIANGSU NaN CN 215500 03/16/2020 12/13/2028 Y 2020-03-20
Action FR_Citation
0 FEDERAL REGISTER NOTICE UPDATED 68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72...
1 F.R. NOTICE ADDED 87 F.R. 52741 8/29/2022
2 F.R. NOTICE ADDED 87 F.R. 9030 2/17/2022
3 F.R. NOTICE ADDED 85 F.R. 70581 11/5/2020
4 50 YEAR DENIAL 67 F.R. 56530 9/4/02 67 F.R. 10890 3/11/02 71 ...
.. ... ...
652 NEW & FR NOTICE 75 F.R. 82464 12/30/10
653 FR NOTICE ADDED 79 F.R. 78394 12/30/14
654 NEW 66 F.R. 40971 8/6/01
655 FR NOTICE ADDED 85 F.R. 16054 3/20/2020
656 FR NOTICE ADDED 85 F.R. 16054 3/20/2020
[657 rows x 12 columns]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论