Unreadable test with list of encoding in pandas

huangapple go评论88阅读模式
英文:

Unreadable test with list of encoding in pandas

问题

我正在尝试读取这个数据集。

使用以下代码(我看到有很多关于这个问题的讨论/建议解决方案,但以下这个似乎是最合理的一个):

  1. encoding_list = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737'
  2. , 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862'
  3. , 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950'
  4. , 'cp1006', 'cp1026', 'cp1125', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254'
  5. , 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr'
  6. , 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2'
  7. , 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2'
  8. , 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9'
  9. , 'iso8859_10', 'iso8859_11', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab'
  10. , 'koi8_r', 'koi8_t', 'koi8_u', 'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2'
  11. , 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32'
  12. , 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig']
  13. for encoding in encoding_list:
  14. worked = True
  15. try:
  16. df = pd.read_csv(path, encoding=encoding, nrows=5)
  17. print(df)
  18. except:
  19. worked = False
  20. if worked:
  21. print(encoding, ':\n', df.head())

但是,当我打印数据框时,结果看起来不可读,像这样。

有谁知道我怎么能读取它吗?

英文:

I am trying to read in this dataset

  1. path = "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1165013/UK_Sanctions_List.ods"

By using this code (I have seen there are quite a lot of threads/suggested solutions to this around, but the following one seems to be the most reasonable one):

  1. encoding_list = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737'
  2. , 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862'
  3. , 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950'
  4. , 'cp1006', 'cp1026', 'cp1125', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254'
  5. , 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr'
  6. , 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2'
  7. , 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2'
  8. , 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9'
  9. , 'iso8859_10', 'iso8859_11', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab'
  10. , 'koi8_r', 'koi8_t', 'koi8_u', 'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2'
  11. , 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32'
  12. , 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig']
  13. for encoding in encoding_list:
  14. worked = True
  15. try:
  16. df = pd.read_csv(path, encoding=encoding, nrows=5)
  17. print(df)
  18. except:
  19. worked = False
  20. if worked:
  21. print(encoding, ':\n', df.head())

But when I print the dataframe the results look unreadable, like this:

  1. ËäjñEÎÜ'g
  2. «sžQğøÆÿŞ“mÿ´;Ğ´³µÇ“‰Çm®“©sbH«iw... &#191`Ò­ìş#mxOnBXvFî‡&ƪPÊz1á3uoj_g
  3. ¢x>æi7¸}Z«”¤õÔ3ÎílW|ùÍxˆ”¡‚c;P‹Ó©Škê+_ëͪ... NaN
  4. qJ|HfÆzÖ¤c[¨‡ÿ`ɊŞ` *ª
  5. b¾?]ÔüR~ž
  6. ¾GÌOmxÜ?=v좦Í` NaN
  7. Dž¾ŒÅ¢
  8. Æ·äÎQ´
  9. ûò£^×%óÒ·$]qӘ´&#206n[l'ß NaN
  10. &.œŽ
  11. ËäjñEÎ"'g
  12. «sžQ}øÆÿ@“mÿ´;!´³µ¢“‰¢m®“©sbH«iw... ¿›ýÒ­ì¦ÖmxOnBXvFî‡&ƪPÊz1á3uoj_g
  13. ^x>æi7¸ðZ«”€õÔ3ÎílW]ùÍxˆ”&#161c;P‹Ó&#169kê+_ëͪ... NaN
  14. qJ]HfÆz#€cǍ¨‡ÿýɊ@ý *ª
  15. b¾?ÐÔ\R&#246
  16. ¾GÌOmx"?=vì^þÍý NaN
  17. Dž¾ŒÅ^
  18. Æ·äÎQ´
  19. ûò£¬×%óÒ·ÝÐqӘ´&#206nÇl'ß NaN
  20. cp1140 :
  21. &.œŽ
  22. ËäjñEÎ"'g
  23. «sžQ}øÆÿ@“mÿ´;!´³µ¢“‰¢m®“©sbH«iw... ¿›ýÒ­ì¦ÖmxOnBXvFî‡&ƪPÊz1á3uoj_g
  24. ^x>æi7¸ðZ«”€õÔ3ÎílW]ùÍxˆ”&#161c;P‹Ó&#169kê+_ëͪ... NaN
  25. qJ]HfÆz#€cǍ¨‡ÿýɊ@ý *ª
  26. b¾?ÐÔ\R&#246
  27. ¾GÌOmx"?=vì^þÍý NaN
  28. Dž¾ŒÅ^
  29. Æ·äÎQ´
  30. ûò£¬×%óÒ·ÝÐqӘ´&#206nÇl'ß NaN

Does anybody know how I can read it in by any chance?

答案1

得分: 1

这不是一个 CSV 文件,而是一个 ODS (Open Document Spreadsheet) 文件。

您应该使用 pandas.read_excel(确保已安装 odpfy 模块):

  1. # pip install odfpy
  2. df = pd.read_excel("UK_Sanctions_List_2.ods", skiprows=2)

注意:该过程速度较慢,请耐心等待。原始文件对我来说无法工作,但在LibreOffice中打开并保存它可以解决问题。另一个选项是在LibreOffice中打开数据,然后从那里转换为CSV。

输出(前5行):

  1. 最后更新日期 唯一标识 OFSI 组织标识 UN 参考编号 名称 6 名称 1 名称 2 名称 3 名称 4 名称 5 ... IMO 编号 当前所有者/操作员 \
  2. 0 2022-01-12 AFG0001 12703 TAe.010 哈吉·凯鲁拉·哈吉·萨塔尔货币兑换 NaN NaN NaN NaN NaN ... NaN NaN
  3. 1 2022-01-12 AFG0001 12703 TAe.010 哈吉·凯鲁拉·哈吉·萨塔尔货币兑换 NaN NaN NaN NaN NaN ... NaN NaN
  4. 2 2022-01-12 AFG0001 12703 TAe.010 哈吉·凯鲁拉·哈吉·萨塔尔货币兑换 NaN NaN NaN NaN NaN ... NaN NaN
  5. 3 2022-01-12 AFG0001 12703 TAe.010 哈吉·凯鲁拉·哈吉·萨塔尔货币兑换 NaN NaN NaN NaN NaN ... NaN NaN
  6. 4 2022-01-12 AFG0001 12703 TAe.010 哈吉·凯鲁拉·哈吉·萨塔尔货币兑换 NaN NaN NaN NaN NaN ... NaN NaN
  7. 先前所有者/操作员 目前认为的船舶旗帜 先前的旗帜 船舶类型 船舶吨位 船舶长度 建造年份 船体识别号码HIN
  8. 0 NaN NaN NaN NaN NaN NaN NaN NaN
  9. 1 NaN NaN NaN NaN NaN NaN NaN NaN
  10. 2 NaN NaN NaN NaN NaN NaN NaN NaN
  11. 3 NaN NaN NaN NaN NaN NaN NaN NaN
  12. 4 NaN NaN NaN NaN NaN NaN NaN NaN
英文:

This is not a CSV file, but rather an ODS (Open Document Spreadsheet) file.

You should use pandas.read_excel (ensuring the odpfy module is installed):

  1. # pip install odfpy
  2. df = pd.read_excel("UK_Sanctions_List_2.ods", skiprows=2)

NB. the process is quite slow, so be patient. The original file wasn't working for me but opening and saving it in LibreOffice did the trick. Another option would be to open the data in LibreOffice and to convert to CSV from there.

Output (first 5 rows):

  1. Last Updated Unique ID OFSI Group ID UN Reference Number Name 6 Name 1 Name 2 Name 3 Name 4 Name 5 ... IMO number Current owner/operator (s) \
  2. 0 2022-01-12 AFG0001 12703 TAe.010 HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE NaN NaN NaN NaN NaN ... NaN NaN
  3. 1 2022-01-12 AFG0001 12703 TAe.010 HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE NaN NaN NaN NaN NaN ... NaN NaN
  4. 2 2022-01-12 AFG0001 12703 TAe.010 HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE NaN NaN NaN NaN NaN ... NaN NaN
  5. 3 2022-01-12 AFG0001 12703 TAe.010 HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE NaN NaN NaN NaN NaN ... NaN NaN
  6. 4 2022-01-12 AFG0001 12703 TAe.010 HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE NaN NaN NaN NaN NaN ... NaN NaN
  7. Previous owner/operator (s) Current believed flag of ship Previous flags Type of ship Tonnage of ship Length of ship Year Built Hull identification number (HIN)
  8. 0 NaN NaN NaN NaN NaN NaN NaN NaN
  9. 1 NaN NaN NaN NaN NaN NaN NaN NaN
  10. 2 NaN NaN NaN NaN NaN NaN NaN NaN
  11. 3 NaN NaN NaN NaN NaN NaN NaN NaN
  12. 4 NaN NaN NaN NaN NaN NaN NaN NaN

huangapple
  • 本文由 发表于 2023年6月26日 15:46:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76554567.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定