在CSV中解析字典数值

huangapple go评论68阅读模式
英文:

Parse the dictionary values in CSV

问题

CSV文件包含

id,type,attributes
1,xx,{'data': { 'attributes': {'aggregations': [{'space': 'sum','time': 'sum'}],'created_at': '2020-03-25T09:48:37.463835Z','include_percentiles': true,'metric_type': 'count','modified_at': '2020-03-25T09:48:37.463835Z','tags': ['app','datacenter']},'id': 'test.metric.latency','type': 'manage_tags'}}

如何使用pandas dataframe从CSV文件中解析属性。

期望输出

id type space created_at                  include_percentiles metric_type tags 
1  xx   sum   020-03-25T09:48:37.463835Z  true                count       app   
1  xx   sum   020-03-25T09:48:37.463835Z  true                count       datacenter     
英文:

CSV File contains

id,type,attributes
1,xx,{'data': { 'attributes': {'aggregations': [{'space': 'sum','time': 'sum'}],'created_at': '2020-03-25T09:48:37.463835Z','include_percentiles': true,'metric_type': 'count','modified_at': '2020-03-25T09:48:37.463835Z','tags': ['app','datacenter']},'id': 'test.metric.latency','type': 'manage_tags'}}

how to parse the attributes from CSV file using pandas dataframe.

Expecting output

id type space created_at                  include_percentiles metric_type tags 
1  xx   sum   020-03-25T09:48:37.463835Z  true                count       app   
1  xx   sum   020-03-25T09:48:37.463835Z  true                count       datacenter     

答案1

得分: 0

这里的挑战是CSV数据的格式不适用于传统解析器。

这是因为数据基本上是逗号分隔的, 'attributes' 值包含了不是值分隔符的逗号。

'attributes' 值既不是Python字典的字符串表示形式,也不是JSON。

如果数据意味着表示一个Python字典,那么唯一的问题(如所示的数据)是值 'true'。我们可以通过将其更改为True来解决这个问题。让我们也假设可能会有一个值 'false',因此我们也会处理它。

from pandas import DataFrame
from ast import literal_eval

alldata = []

with open('/Volumes/G-Drive/foo.csv') as data:
    next(data)  # 跳过列名
    for line in data:
        _id, _type, *dr = line.split(',')
        ds = ','.join(dr).replace('true', 'True').replace('false', 'False')
        attrs = literal_eval(ds)['data']['attributes']
        rd = {
            'id': _id,
            'type': _type,
            'space': attrs['aggregations'][0]['space'],
            'created_at': attrs['created_at'],
            'include_percentiles': attrs['include_percentiles'],
            'metric_type': attrs['metric_type']
        }
        for tag in attrs['tags']:
            rd['tags'] = tag
            alldata.append(rd)

print(DataFrame(alldata))

输出:

  id type space                   created_at  include_percentiles metric_type        tags
0  1   xx   sum  2020-03-25T09:48:37.463835Z                 True       count         app
1  1   xx   sum  2020-03-25T09:48:37.463835Z                 True       count  datacenter
英文:

The challenge here is that the CSV data are not in a format that can be handled by a traditional parser.

This is because the data are essentially comma-delimited but the 'attributes' value contains commas that are not value separators.

The 'attributes' value is neither a string representation of a Python dictionary nor is it JSON.

If the data are meant to represent a Python dictionary then the only issue (with the data as shown) is the value 'true'. We can overcome that by changing it to True. Let's also assume that there might be a value of 'false' so we'll deal with that too.

from pandas import DataFrame
from ast import literal_eval


alldata = []

with open('/Volumes/G-Drive/foo.csv') as data:
    next(data)  # skip column names
    for line in data:
        _id, _type, *dr = line.split(',')
        ds = ','.join(dr).replace('true', 'True').replace('false', 'False')
        attrs = literal_eval(ds)['data']['attributes']
        rd = {
            'id': _id,
            'type': _type,
            'space': attrs['aggregations'][0]['space'],
            'created_at': attrs['created_at'],
            'include_percentiles': attrs['include_percentiles'],
            'metric_type': attrs['metric_type']
        }
        for tag in attrs['tags']:
            rd['tags'] = tag
            alldata.append(rd)

print(DataFrame(alldata))

Output:

  id type space                   created_at  include_percentiles metric_type        tags
0  1   xx   sum  2020-03-25T09:48:37.463835Z                 True       count         app
1  1   xx   sum  2020-03-25T09:48:37.463835Z                 True       count  datacenter

huangapple
  • 本文由 发表于 2023年5月11日 11:05:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76223887.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定