2023年6月26日 15:46:16go评论152阅读模式

英文:

Unreadable test with list of encoding in pandas

问题

我正在尝试读取这个数据集。

使用以下代码（我看到有很多关于这个问题的讨论/建议解决方案，但以下这个似乎是最合理的一个）：

encoding_list = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737'
                 , 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862'
                 , 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950'
                 , 'cp1006', 'cp1026', 'cp1125', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254'
                 , 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr'
                 , 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2'
                 , 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2'
                 , 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9'
                 , 'iso8859_10', 'iso8859_11', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab'
                 , 'koi8_r', 'koi8_t', 'koi8_u', 'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2'
                 , 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32'
                 , 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig']

for encoding in encoding_list:
    worked = True
    try:
        df = pd.read_csv(path, encoding=encoding, nrows=5)
        print(df)
    except:
        worked = False
    if worked:
        print(encoding, ':\n', df.head())

但是，当我打印数据框时，结果看起来不可读，像这样。

有谁知道我怎么能读取它吗？

英文:

I am trying to read in this dataset

path = &quot;https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1165013/UK_Sanctions_List.ods&quot;

By using this code (I have seen there are quite a lot of threads/suggested solutions to this around, but the following one seems to be the most reasonable one):

encoding_list = [&#39;ascii&#39;, &#39;big5&#39;, &#39;big5hkscs&#39;, &#39;cp037&#39;, &#39;cp273&#39;, &#39;cp424&#39;, &#39;cp437&#39;, &#39;cp500&#39;, &#39;cp720&#39;, &#39;cp737&#39;
                 , &#39;cp775&#39;, &#39;cp850&#39;, &#39;cp852&#39;, &#39;cp855&#39;, &#39;cp856&#39;, &#39;cp857&#39;, &#39;cp858&#39;, &#39;cp860&#39;, &#39;cp861&#39;, &#39;cp862&#39;
                 , &#39;cp863&#39;, &#39;cp864&#39;, &#39;cp865&#39;, &#39;cp866&#39;, &#39;cp869&#39;, &#39;cp874&#39;, &#39;cp875&#39;, &#39;cp932&#39;, &#39;cp949&#39;, &#39;cp950&#39;
                 , &#39;cp1006&#39;, &#39;cp1026&#39;, &#39;cp1125&#39;, &#39;cp1140&#39;, &#39;cp1250&#39;, &#39;cp1251&#39;, &#39;cp1252&#39;, &#39;cp1253&#39;, &#39;cp1254&#39;
                 , &#39;cp1255&#39;, &#39;cp1256&#39;, &#39;cp1257&#39;, &#39;cp1258&#39;, &#39;euc_jp&#39;, &#39;euc_jis_2004&#39;, &#39;euc_jisx0213&#39;, &#39;euc_kr&#39;
                 , &#39;gb2312&#39;, &#39;gbk&#39;, &#39;gb18030&#39;, &#39;hz&#39;, &#39;iso2022_jp&#39;, &#39;iso2022_jp_1&#39;, &#39;iso2022_jp_2&#39;
                 , &#39;iso2022_jp_2004&#39;, &#39;iso2022_jp_3&#39;, &#39;iso2022_jp_ext&#39;, &#39;iso2022_kr&#39;, &#39;latin_1&#39;, &#39;iso8859_2&#39;
                 , &#39;iso8859_3&#39;, &#39;iso8859_4&#39;, &#39;iso8859_5&#39;, &#39;iso8859_6&#39;, &#39;iso8859_7&#39;, &#39;iso8859_8&#39;, &#39;iso8859_9&#39;
                 , &#39;iso8859_10&#39;, &#39;iso8859_11&#39;, &#39;iso8859_13&#39;, &#39;iso8859_14&#39;, &#39;iso8859_15&#39;, &#39;iso8859_16&#39;, &#39;johab&#39;
                 , &#39;koi8_r&#39;, &#39;koi8_t&#39;, &#39;koi8_u&#39;, &#39;kz1048&#39;, &#39;mac_cyrillic&#39;, &#39;mac_greek&#39;, &#39;mac_iceland&#39;, &#39;mac_latin2&#39;
                 , &#39;mac_roman&#39;, &#39;mac_turkish&#39;, &#39;ptcp154&#39;, &#39;shift_jis&#39;, &#39;shift_jis_2004&#39;, &#39;shift_jisx0213&#39;, &#39;utf_32&#39;
                 , &#39;utf_32_be&#39;, &#39;utf_32_le&#39;, &#39;utf_16&#39;, &#39;utf_16_be&#39;, &#39;utf_16_le&#39;, &#39;utf_7&#39;, &#39;utf_8&#39;, &#39;utf_8_sig&#39;]

for encoding in encoding_list:
    worked = True
    try:
        df = pd.read_csv(path, encoding=encoding, nrows=5)
        print(df)
    except:
        worked = False
    if worked:
        print(encoding, &#39;:\n&#39;, df.head())

But when I print the dataframe the results look unreadable, like this:

&#203;&#228;j&#241;E&#206;&#220;&#39;g
&#171;sQğ&#248;&#198;&#255;Şm&#255;&#180;;Ğ&#180;&#179;&#181;&#199;&#199;m&#174;&#169;sbH&#171;iw...  &#191;`&#210;&#173;&#236;ş#mxOnBXvF&#238;&amp;&#198;&#170;P&#202;z1&#225;3uoj_g
&#162;x&gt;&#230;i7&#184;}Z&#171;&#164;&#245;&#212;3&#206;&#237;lW|&#249;&#205;x&#161;c;P&#211;&#169;k&#234;+_&#235;&#205;&#170;...                                NaN

                                                          qJ|Hf&#198;z&#214;&#164;c[&#168;&#255;`&#201;Ş` *&#170;
b&#190;?]&#212;&#252;R~
&#190;G&#204;Omx&#220;?=v&#236;&#162;&#166;&#205;`                                                     NaN
D&#190;&#197;&#162;
&#198;&#183;&#228;&#206;Q&#180;
&#251;&#242;&#163;^&#215;%&#243;&#210;&#183;$]q&#211;&#180;&#206;n[l&#39;&#223;                                        NaN
                                                                                &amp;.
&#203;&#228;j&#241;E&#206;&quot;&#39;g
&#171;sQ}&#248;&#198;&#255;@m&#255;&#180;;!&#180;&#179;&#181;&#162;&#162;m&#174;&#169;sbH&#171;iw...  &#191;&#253;&#210;&#173;&#236;&#166;&#214;mxOnBXvF&#238;&amp;&#198;&#170;P&#202;z1&#225;3uoj_g
^x&gt;&#230;i7&#184;&#240;Z&#171;€&#245;&#212;3&#206;&#237;lW]&#249;&#205;x&#161;c;P&#211;&#169;k&#234;+_&#235;&#205;&#170;...                                NaN

                                                          qJ]Hf&#198;z#€c&#199;&#168;&#255;&#253;&#201;@&#253; *&#170;
b&#190;?&#208;&#212;\R&#246;
&#190;G&#204;Omx&quot;?=v&#236;^&#254;&#205;&#253;                                                     NaN
D&#190;&#197;^
&#198;&#183;&#228;&#206;Q&#180;
&#251;&#242;&#163;&#172;&#215;%&#243;&#210;&#183;&#221;&#208;q&#211;&#180;&#206;n&#199;l&#39;&#223;                                        NaN
cp1140 :
                                                                                 &amp;.
&#203;&#228;j&#241;E&#206;&quot;&#39;g
&#171;sQ}&#248;&#198;&#255;@m&#255;&#180;;!&#180;&#179;&#181;&#162;&#162;m&#174;&#169;sbH&#171;iw...  &#191;&#253;&#210;&#173;&#236;&#166;&#214;mxOnBXvF&#238;&amp;&#198;&#170;P&#202;z1&#225;3uoj_g
^x&gt;&#230;i7&#184;&#240;Z&#171;€&#245;&#212;3&#206;&#237;lW]&#249;&#205;x&#161;c;P&#211;&#169;k&#234;+_&#235;&#205;&#170;...                                NaN

                                                          qJ]Hf&#198;z#€c&#199;&#168;&#255;&#253;&#201;@&#253; *&#170;
b&#190;?&#208;&#212;\R&#246;
&#190;G&#204;Omx&quot;?=v&#236;^&#254;&#205;&#253;                                                     NaN
D&#190;&#197;^
&#198;&#183;&#228;&#206;Q&#180;
&#251;&#242;&#163;&#172;&#215;%&#243;&#210;&#183;&#221;&#208;q&#211;&#180;&#206;n&#199;l&#39;&#223;                                        NaN

Does anybody know how I can read it in by any chance?

答案1

得分: 1

这不是一个 CSV 文件，而是一个 ODS (Open Document Spreadsheet) 文件。

您应该使用 pandas.read_excel（确保已安装 odpfy 模块）：

# pip install odfpy

df = pd.read_excel("UK_Sanctions_List_2.ods", skiprows=2)

注意：该过程速度较慢，请耐心等待。原始文件对我来说无法工作，但在LibreOffice中打开并保存它可以解决问题。另一个选项是在LibreOffice中打开数据，然后从那里转换为CSV。

输出（前5行）：

  最后更新日期 唯一标识 OFSI 组织标识 UN 参考编号                                      名称 6  名称 1  名称 2  名称 3  名称 4  名称 5  ... IMO 编号  当前所有者/操作员  \
0   2022-01-12   AFG0001          12703             TAe.010  哈吉·凯鲁拉·哈吉·萨塔尔货币兑换     NaN     NaN     NaN     NaN     NaN  ...        NaN            NaN   
1   2022-01-12   AFG0001          12703             TAe.010  哈吉·凯鲁拉·哈吉·萨塔尔货币兑换     NaN     NaN     NaN     NaN     NaN  ...        NaN            NaN   
2   2022-01-12   AFG0001          12703             TAe.010  哈吉·凯鲁拉·哈吉·萨塔尔货币兑换     NaN     NaN     NaN     NaN     NaN  ...        NaN            NaN   
3   2022-01-12   AFG0001          12703             TAe.010  哈吉·凯鲁拉·哈吉·萨塔尔货币兑换     NaN     NaN     NaN     NaN     NaN  ...        NaN            NaN   
4   2022-01-12   AFG0001          12703             TAe.010  哈吉·凯鲁拉·哈吉·萨塔尔货币兑换     NaN     NaN     NaN     NaN     NaN  ...        NaN            NaN   

   先前所有者/操作员 目前认为的船舶旗帜  先前的旗帜  船舶类型 船舶吨位 船舶长度 建造年份 船体识别号码（HIN）
0           NaN          NaN    NaN   NaN  NaN  NaN  NaN        NaN
1           NaN          NaN    NaN   NaN  NaN  NaN  NaN        NaN
2           NaN          NaN    NaN   NaN  NaN  NaN  NaN        NaN
3           NaN          NaN    NaN   NaN  NaN  NaN  NaN        NaN
4           NaN          NaN    NaN   NaN  NaN  NaN  NaN        NaN

英文:

This is not a CSV file, but rather an ODS (Open Document Spreadsheet) file.

You should use pandas.read_excel (ensuring the odpfy module is installed):

# pip install odfpy

df = pd.read_excel(&quot;UK_Sanctions_List_2.ods&quot;, skiprows=2)

NB. the process is quite slow, so be patient. The original file wasn't working for me but opening and saving it in LibreOffice did the trick. Another option would be to open the data in LibreOffice and to convert to CSV from there.

Output (first 5 rows):

  Last Updated Unique ID  OFSI Group ID UN Reference Number                                      Name 6  Name 1  Name 2  Name 3  Name 4  Name 5  ... IMO number  Current owner/operator (s)  \
0   2022-01-12   AFG0001          12703             TAe.010  HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE     NaN     NaN     NaN     NaN     NaN  ...        NaN                         NaN   
1   2022-01-12   AFG0001          12703             TAe.010  HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE     NaN     NaN     NaN     NaN     NaN  ...        NaN                         NaN   
2   2022-01-12   AFG0001          12703             TAe.010  HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE     NaN     NaN     NaN     NaN     NaN  ...        NaN                         NaN   
3   2022-01-12   AFG0001          12703             TAe.010  HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE     NaN     NaN     NaN     NaN     NaN  ...        NaN                         NaN   
4   2022-01-12   AFG0001          12703             TAe.010  HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE     NaN     NaN     NaN     NaN     NaN  ...        NaN                         NaN   

   Previous owner/operator (s) Current believed flag of ship  Previous flags  Type of ship Tonnage of ship Length of ship Year Built Hull identification number (HIN)  
0                          NaN                           NaN             NaN           NaN             NaN            NaN        NaN                              NaN  
1                          NaN                           NaN             NaN           NaN             NaN            NaN        NaN                              NaN  
2                          NaN                           NaN             NaN           NaN             NaN            NaN        NaN                              NaN  
3                          NaN                           NaN             NaN           NaN             NaN            NaN        NaN                              NaN  
4                          NaN                           NaN             NaN           NaN             NaN            NaN        NaN                              NaN

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Unreadable test with list of encoding in pandas

问题

答案1

从R中的一个较大数据框中提取按国家名称筛选的数据框。

Create tree like data structure in JSON format from Pandas Data frames using python.

将数据转换为文本是在网页抓取时的一项重要任务。

在pandas中格式化一个包含两行的表格。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论