Read in data from web (xml format) but then need to separate fields.

huangapple go评论77阅读模式
英文:

Read in data from web (xml format) but then need to separate fields

问题

我正在从这个链接中读取数据:

url = "https://www.bis.doc.gov/dpl/dpl.txt"

这是我读取数据的方式(如果我将其读取为CSV,我会得到Forbidden错误->因此使用了requests库):

import requests

test_URL = url

def get_data(link):
    hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}

    req = requests.get(link, headers=hdr)
    content = req.content

    return content

data = get_data(test_URL)

我读取的数据如下:

print(data)
b'"Name"\t"Street_Address"\t"City"\t"State"\t"Country"\t"Postal_Code"\t"Effective_Date"\t"Expiration_Date"\t"Standard_Order"\t"Last_Update"\t"Action"\t"FR_Citation"\n"I. ASH"\t"UPON THE DATE OF THE ORDER INCARCERATED AT USM NO: 26265-177, FCI SEAGOVILLE, 2113 NORTH HIGHWAY 175"\t"SEAGOVILLE"\t"TX"\t"US"\t"75159"\t"06/19/2003"\t"06/29/2056"\t"Y"\t"2007-01-31"\t"FEDERAL REGISTER NOTICE UPDATED"\t"68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72 F.R. 4236 1/30/07"\n"AARON ABRAHAM VILLA"\t"3415 RIVERA AVENUE"\t"EL PASO"\t"TX"\t""\t"79905"\t"08/24/2022"\t"01/14/2026"\t"Y"\t"2022-08-29"\t"F.R. NOTICE ADDED"\t"87 F.R. 52741 8/29/2022"\n"ABDIEL PADRON MADRID"\t"INMATE NUMBER: 42167-480, FCI LA TUNA FEDERAL CORRECTIONAL INSTITUTION, P.O. BOX 3000"\t"ANTHONY"\t"NM"\t""\t"88201"\t"02/10/2022"\t"06/17/2030"\t"Y"\t"2022-02-22"\t"F.R. NOTICE ADDED"\t"87 F.R. 9030 2/17/2022"\n...

有人知道如何将上面的数据转换成一个漂亮格式的Pandas数据框吗?

英文:

I am reading in a dataset from this link:

url = "https://www.bis.doc.gov/dpl/dpl.txt"

This is how I have read it in (if I read it in as a csv I get the Forbidden error -> hence the use of requests):
import requests

test_URL = url

def get_data(link):
    hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}

    req = requests.get(link,headers=hdr)
    content = req.content
  
    return content

data = get_data(test_URL)

The data I have read in look like this:

print(data)
b'"Name"\t"Street_Address"\t"City"\t"State"\t"Country"\t"Postal_Code"\t"Effective_Date"\t"Expiration_Date"\t"Standard_Order"\t"Last_Update"\t"Action"\t"FR_Citation"\n" I. ASH"\t"UPON THE DATE OF THE ORDER INCARCERATED AT USM NO: 26265-177, FCI SEAGOVILLE, 2113 NORTH HIGHWAY 175"\t"SEAGOVILLE"\t"TX"\t"US"\t"75159"\t"06/19/2003"\t"06/29/2056"\t"Y"\t"2007-01-31"\t"FEDERAL REGISTER NOTICE UPDATED"\t"68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72 F.R. 4236 1/30/07"\n"AARON ABRAHAM VILLA"\t"3415 RIVERA AVENUE"\t"EL PASO"\t"TX"\t""\t"79905"\t"08/24/2022"\t"01/14/2026"\t"Y"\t"2022-08-29"\t"F.R. NOTICE ADDED"\t"87 F.R. 52741 8/29/2022"\n"ABDIEL PADRON MADRID"\t"INMATE NUMBER: 42167-480, FCI LA TUNA FEDERAL CORRECTIONAL INSTITUTION, P.O. BOX 3000"\t"ANTHONY"\t"NM"\t""\t"88201"\t"02/10/2022"\t"06/17/2030"\t"Y"\t"2022-02-22"\t"F.R. NOTICE ADDED"\t"87 F.R. 9030 2/17/2022"\n"ABDUL MAJID SAIDI"\t"2948 PEASE DRIVE, APT. 201"\t"ROCKY RIVER"\t"OH"\t""\t"44116"\t"10/30/2020"\t"03/13/2026"\t"Y"\t"2020-11-05"\t"F.R. NOTICE ADDED"\t"85 F.R. 70581 11/5/2020"\n"ABDULAH AL NASSER"\t"605 TRAIL LAKE DRIVE"\t"RICHARDSON"\t"TX"\t"US"\t"75081"\t"03/04/2002"\t"06/29/2056"\t"Y"\t"2006-07-11"\t"50 YEAR DENIAL"\t"67 F.R. 56530 9/4/02 67 F.R. 10890 3/11/02 71 F.R. 38843 7/10/06"\n"ABDULAH AL NASSER"\t"908 AUDELIA ROAD, SUIE 200, PMB #245"\t"RICHARDSON"\t"TX"\t"US"\t"75081"\t"03/04/2002"\t"06/29/2056"\t"Y"\t"2006-07-11"\t"50 YEAR DENIAL"\t"67 F.R. 56530 9/4/02 67 F.R. 10890 3/11/02"\n"ABDULAMIR MAHDI"\t"20 HUNTINGWOOD DRIVE"\t"SCARBOROUGH, ONTARIO"\t""\t"CA"\t"M1W1A2"\t"10/03/2003"\t"10/03/2023"\t"N"\t"2003-10-06"\t"NON STANDARD DENIAL"\t"68 F.R. 57406 10/3/03"\n"ABDULLAH AL NASSER"\t"UPON THE DATE OF THE ORDER INCARCERATED AT USM NO: 26265-177, FCI SEAGOVILLE, 2113 NORTH HIGHWAY 175"\t"SEAGOVILLE"\t"TX"\t"US"\t"75159"\t"06/19/2003"\t"06/29/2056"\t"Y"\t"2007-01-31"\t"FEDERAL REGISTER NOTICE UPDATED"\t"68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72 F.R. 4236 1/30/07"\n"ABEL HERNANDEZ, JR."\t"120 SAINT JOHN DRIVE"\t"PHARR"\t"TX"\t""\t"78577"\t"04/30/2021"\t"08/29/2029"\t"Y"\t"2021-05-05"\t"F.R. NOTICE ADDED"\t"86 F.R. 23920 5/5/2021"\n"ABU AL-JUD"\t"INMATE NUMBER: 87450-083, FCI VICTORVILLE MEDIUM II FEDERAL CORRECTIONAL INSTITUTION, P.O. BOX 3850"\t"ADELANTO"\t"CA"\t""\t"92301"\t"03/31/2017"\t"06/13/2026"\t"Y"\t"2017-04-06"\t"FR NOTICE ADDED"\t"82 F.R. 16788, 16789 4/6/2017"\n"ADAM AL HERZ"\t"INMATE NUMBER: 13991-029, FMC ROCHESTER, P.O. BOX 4000"\t"ROCHESTER"\t"MN"\t""\t"55903"\t"08/13/2019"\t"10/13/2026"\t"Y"\t"2019-08-22"\t"FR NOTICE ADDED"\t"84 F.R. 43787 8/22/2019"\n"ADRIANA GABRIELA GUAJARDO-CAVAZOS"\t"CALLE MANUEL OTIZ #49, MATAMOROS, TAMAULIPAS"\t"MEXICO"\t""\t"MX"\t"87394"\t"05/08/2023"\t"11/12/2027"\t"Y"\t"2023-05-12"\t"ADDITION, F.R. NOTICE ADDED "\t"88 F.R. 30721 5/12/2023"\n"ADT ANALOG AND DIGITAL TECHNIK"\t"8019 NIEDERSEEON, HOUSE

Does anyone know how to convert the data above into a nice looking, well formatted pandas dataframe?

答案1

得分: 1

你可以使用 io.StringIOpandas.read_csv

import io

df = pd.read_csv(io.StringIO(data.decode('utf-8')), sep='\t')

请注意,你也可以通过 read_csvstorage_options 参数向 request 传递参数:

hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
url = "https://www.bis.doc.gov/dpl/dpl.txt"

df = pd.read_csv(url, sep='\t', storage_options=hdr)

输出结果:

                     Name                                     Street_Address         City State Country Postal_Code Effective_Date Expiration_Date Standard_Order Last_Update  \
0                  I. ASH  UPON THE DATE OF THE ORDER INCARCERATED AT USM...   SEAGOVILLE    TX      US       75159     06/19/2003      06/29/2056              Y  2007-01-31   
1     AARON ABRAHAM VILLA                                 3415 RIVERA AVENUE      EL PASO    TX     NaN       79905     08/24/2022      01/14/2026              Y  2022-08-29   
2    ABDIEL PADRON MADRID  INMATE NUMBER: 42167-480, FCI LA TUNA FEDERAL ...      ANTHONY    NM     NaN       88201     02/10/2022      06/17/2030              Y  2022-02-22   
3       ABDUL MAJID SAIDI                         2948 PEASE DRIVE, APT. 201  ROCKY RIVER    OH     NaN       44116     10/30/2020      03/13/2026              Y  2020-11-05   
4       ABDULAH AL NASSER                               605 TRAIL LAKE DRIVE   RICHARDSON    TX      US       75081     03/04/2002      06/29/2056              Y  2006-07-11   
...
652    YURI I. MONTGOMERY                               2912 10TH PLACE WEST      SEATTLE    WA      US       98119     12/21/2010      12/21/2040              Y  2011-01-04   
653             ZHIFU LIN  INMATE NUMBER: 08295-087, CI MOSHANNON VALLEY,...  PHILIPSBURG    PA      US       16866     12/22/2014      11/15/2023              Y  2014-12-22   
654           ZHONGDA JIN                         1895 DOBBIN DRIVE, SUITE B     SAN JOSE    CA      US       95133     07/31/2001      07/31/2026              Y  2001-08-01   
655            ZIMO SHENG                    3975 N. CRAMER STREET, UNIT 204    MILWAUKEE    WI     NaN       53211     03/16/2020      12/13/2028              Y  2020-03-20   
656            ZIMO SHENG                        JINXIUYUAN 17-403, CHANGSHU      JIANGSU   NaN      CN      215500     03/16/2020      12/13/2028              Y  2020-03-20   

                              Action                                        FR_Citation  
0    FEDERAL REGISTER NOTICE UPDATED  68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72...  
1                  F.R. NOTICE ADDED                            87 F.R. 52741 8/29/2022  
2                  F.R. NOTICE ADDED                             87 F.R. 9030 2/17/2022  
3                  F.R. NOTICE ADDED                            85 F.R. 70581 11/5/2020  
4                     50 YEAR DENIAL  67 F.R. 56530 9/4/02 67 F.R. 10890 3/11/02 71 ...  
...
652                  NEW & FR NOTICE                             75 F.R. 82464 12/30/10  
653                  FR NOTICE ADDED                             79 F.R. 78394 12/30/14  
654                              NEW                               66 F.R. 40971 8/6/01  
655                  FR NOTICE ADDED                            85 F.R. 16054 3/20/2020  
656                  FR NOTICE ADDED                            85 F.R. 16054 3/20/2020  
英文:

You can use io.StringIO and pandas.read_csv:

import io

df = pd.read_csv(io.StringIO(data.decode('utf-8')), sep='\t')

Note that you can also pass parameters to request through read_csv's storage_options:

hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
url = "https://www.bis.doc.gov/dpl/dpl.txt"

df = pd.read_csv(url, sep='\t', storage_options=hdr)

Output:

                     Name                                     Street_Address         City State Country Postal_Code Effective_Date Expiration_Date Standard_Order Last_Update  \
0                  I. ASH  UPON THE DATE OF THE ORDER INCARCERATED AT USM...   SEAGOVILLE    TX      US       75159     06/19/2003      06/29/2056              Y  2007-01-31   
1     AARON ABRAHAM VILLA                                 3415 RIVERA AVENUE      EL PASO    TX     NaN       79905     08/24/2022      01/14/2026              Y  2022-08-29   
2    ABDIEL PADRON MADRID  INMATE NUMBER: 42167-480, FCI LA TUNA FEDERAL ...      ANTHONY    NM     NaN       88201     02/10/2022      06/17/2030              Y  2022-02-22   
3       ABDUL MAJID SAIDI                         2948 PEASE DRIVE, APT. 201  ROCKY RIVER    OH     NaN       44116     10/30/2020      03/13/2026              Y  2020-11-05   
4       ABDULAH AL NASSER                               605 TRAIL LAKE DRIVE   RICHARDSON    TX      US       75081     03/04/2002      06/29/2056              Y  2006-07-11   
..                    ...                                                ...          ...   ...     ...         ...            ...             ...            ...         ...   
652    YURI I. MONTGOMERY                               2912 10TH PLACE WEST      SEATTLE    WA      US       98119     12/21/2010      12/21/2040              Y  2011-01-04   
653             ZHIFU LIN  INMATE NUMBER: 08295-087, CI MOSHANNON VALLEY,...  PHILIPSBURG    PA      US       16866     12/22/2014      11/15/2023              Y  2014-12-22   
654           ZHONGDA JIN                         1895 DOBBIN DRIVE, SUITE B     SAN JOSE    CA      US       95133     07/31/2001      07/31/2026              Y  2001-08-01   
655            ZIMO SHENG                    3975 N. CRAMER STREET, UNIT 204    MILWAUKEE    WI     NaN       53211     03/16/2020      12/13/2028              Y  2020-03-20   
656            ZIMO SHENG                        JINXIUYUAN 17-403, CHANGSHU      JIANGSU   NaN      CN      215500     03/16/2020      12/13/2028              Y  2020-03-20   

                              Action                                        FR_Citation  
0    FEDERAL REGISTER NOTICE UPDATED  68 F.R. 38290 6/27/03 71 F.R. 38843 7/10/06 72...  
1                  F.R. NOTICE ADDED                            87 F.R. 52741 8/29/2022  
2                  F.R. NOTICE ADDED                             87 F.R. 9030 2/17/2022  
3                  F.R. NOTICE ADDED                            85 F.R. 70581 11/5/2020  
4                     50 YEAR DENIAL  67 F.R. 56530 9/4/02 67 F.R. 10890 3/11/02 71 ...  
..                               ...                                                ...  
652                  NEW & FR NOTICE                             75 F.R. 82464 12/30/10  
653                  FR NOTICE ADDED                             79 F.R. 78394 12/30/14  
654                              NEW                               66 F.R. 40971 8/6/01  
655                  FR NOTICE ADDED                            85 F.R. 16054 3/20/2020  
656                  FR NOTICE ADDED                            85 F.R. 16054 3/20/2020  

[657 rows x 12 columns]

huangapple
  • 本文由 发表于 2023年6月27日 20:12:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76564739.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定