Json/weird column transformation JSON/奇怪的列转换

huangapple go评论101阅读模式
英文:

Json/weird column transformation

问题

我从Mongo数据库中收到了一些数据。这个表包含多个列,其中一些列的格式非常奇怪。

以下是该列/系列的一行示例:

  1. '[{idEvento.$oid=63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]'

根据我有限的知识,这似乎不是一个 JSON。我不知道如何将每个“事件”(由 {} 项组成)转换成列表。

然后,如何基于每个事件的内容查询/筛选数据?我应该将事件展开为新行并将其作为字符串进行查询吗?

英文:

I am getting some data incoming from a mongo database. Such table contains several columns and some os such columns are composed of a very strange format.

Example of one line of the column/series

  1. '[{idEvento.$oid=63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]'

This does not, at least for my shot knowledge, is a Json. I am struggling how to transform each "event" (composed of a {} item) into a list.

And after that, how could I query/filter data based on the containing of each event? Should I pd.explode the events into new lines and query as strings?

答案1

得分: 2

以下是您要翻译的代码部分:

  1. import re
  2. import json
  3. import pandas as pd
  4. s = "[{idEvento.$oid=63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]"
  5. s = re.sub(r"([^ =,\[\]\{\}]+)=([^ =,\[\]\{\}]+)", r'"":""', s)
  6. data = json.loads(s)
  7. df = pd.DataFrame(data)
  8. print(df)
  1. def fn(x):
  2. x = re.sub(r"([^ =,\[\]\{\}]+)=([^ =,\[\]\{\}]+)", r'"":""', x)
  3. return json.loads(x)
  4. out = df["col1"].apply(fn).explode().apply(pd.Series)
  5. print(out)

希望这对您有所帮助。如果您需要任何进一步的帮助,请随时告诉我。

英文:

You can try to "convert" the string to proper Json (using re) and then use standard json.loads (Regex101 demo):

  1. import re
  2. import json
  3. import pandas as pd
  4. s = "[{idEvento.$oid=63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]"
  5. s = re.sub(r"([^ =,\[\]\{\}]+)=([^ =,\[\]\{\}]+)", r'"\g<1>":"\g<2>"', s)
  6. data = json.loads(s)
  7. df = pd.DataFrame(data)
  8. print(df)

Prints:

  1. idEvento.$oid dataHoraEvento.$date codigoTipoEvento mesAnoReferenciaContabilizacao _class
  2. 0 63ffaec3cdc01e6352729bad 1677690003377 1 032023 NaN
  3. 1 63ffb5c8cdc01e6352729bae 1677691800676 3 032023 NaN
  4. 2 6405cc8711c78c20369b4033 1678090851560 8 032023 NaN
  5. 3 6422b4c97e45dd75abb4f831 1679985307560 6 032023 br.com.bb.rcp.model.vantagens.HistoricoContabil
  6. 4 6422b4c97e45dd75abb4f832 1679985309584 6 032023 br.com.bb.rcp.model.vantagens.HistoricoContabil

NOTE: This works for this example, but the pattern might need to be adjusted for real case.


EDIT: To apply to a dataframe:

Consider following dataframe:

  1. df = pd.DataFrame(
  2. {
  3. "col1": [
  4. "[{idEvento.$oid=01_63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]",
  5. "[{idEvento.$oid=02_63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]",
  6. "[{idEvento.$oid=03_63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]",
  7. ]
  8. }
  9. )
  10. col1
  11. 0 [{idEvento.$oid=01_63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]
  12. 1 [{idEvento.$oid=02_63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]
  13. 2 [{idEvento.$oid=03_63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]

Then:

  1. def fn(x):
  2. x = re.sub(r"([^ =,\[\]\{\}]+)=([^ =,\[\]\{\}]+)", r'"\g<1>":"\g<2>"', x)
  3. return json.loads(x)
  4. out = df["col1"].apply(fn).explode().apply(pd.Series)
  5. print(out)

Prints:

  1. idEvento.$oid dataHoraEvento.$date codigoTipoEvento mesAnoReferenciaContabilizacao _class
  2. 0 01_63ffaec3cdc01e6352729bad 1677690003377 1 032023 NaN
  3. 0 63ffb5c8cdc01e6352729bae 1677691800676 3 032023 NaN
  4. 0 6405cc8711c78c20369b4033 1678090851560 8 032023 NaN
  5. 0 6422b4c97e45dd75abb4f831 1679985307560 6 032023 br.com.bb.rcp.model.vantagens.HistoricoContabil
  6. 0 6422b4c97e45dd75abb4f832 1679985309584 6 032023 br.com.bb.rcp.model.vantagens.HistoricoContabil
  7. 1 02_63ffaec3cdc01e6352729bad 1677690003377 1 032023 NaN
  8. 1 63ffb5c8cdc01e6352729bae 1677691800676 3 032023 NaN
  9. 1 6405cc8711c78c20369b4033 1678090851560 8 032023 NaN
  10. 1 6422b4c97e45dd75abb4f831 1679985307560 6 032023 br.com.bb.rcp.model.vantagens.HistoricoContabil
  11. 1 6422b4c97e45dd75abb4f832 1679985309584 6 032023 br.com.bb.rcp.model.vantagens.HistoricoContabil
  12. 2 03_63ffaec3cdc01e6352729bad 1677690003377 1 032023 NaN
  13. 2 63ffb5c8cdc01e6352729bae 1677691800676 3 032023 NaN
  14. 2 6405cc8711c78c20369b4033 1678090851560 8 032023 NaN
  15. 2 6422b4c97e45dd75abb4f831 1679985307560 6 032023 br.com.bb.rcp.model.vantagens.HistoricoContabil
  16. 2 6422b4c97e45dd75abb4f832 1679985309584 6 032023 br.com.bb.rcp.model.vantagens.HistoricoContabil

huangapple
  • 本文由 发表于 2023年8月4日 02:58:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76830940.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定