英文:
Parse the dictionary values in CSV
问题
CSV文件包含
id,type,attributes
1,xx,{'data': { 'attributes': {'aggregations': [{'space': 'sum','time': 'sum'}],'created_at': '2020-03-25T09:48:37.463835Z','include_percentiles': true,'metric_type': 'count','modified_at': '2020-03-25T09:48:37.463835Z','tags': ['app','datacenter']},'id': 'test.metric.latency','type': 'manage_tags'}}
如何使用pandas dataframe从CSV文件中解析属性。
期望输出
id type space created_at include_percentiles metric_type tags
1 xx sum 020-03-25T09:48:37.463835Z true count app
1 xx sum 020-03-25T09:48:37.463835Z true count datacenter
英文:
CSV File contains
id,type,attributes
1,xx,{'data': { 'attributes': {'aggregations': [{'space': 'sum','time': 'sum'}],'created_at': '2020-03-25T09:48:37.463835Z','include_percentiles': true,'metric_type': 'count','modified_at': '2020-03-25T09:48:37.463835Z','tags': ['app','datacenter']},'id': 'test.metric.latency','type': 'manage_tags'}}
how to parse the attributes from CSV file using pandas dataframe.
Expecting output
id type space created_at include_percentiles metric_type tags
1 xx sum 020-03-25T09:48:37.463835Z true count app
1 xx sum 020-03-25T09:48:37.463835Z true count datacenter
答案1
得分: 0
这里的挑战是CSV数据的格式不适用于传统解析器。
这是因为数据基本上是逗号分隔的,但 'attributes' 值包含了不是值分隔符的逗号。
'attributes' 值既不是Python字典的字符串表示形式,也不是JSON。
如果数据意味着表示一个Python字典,那么唯一的问题(如所示的数据)是值 'true'。我们可以通过将其更改为True来解决这个问题。让我们也假设可能会有一个值 'false',因此我们也会处理它。
from pandas import DataFrame
from ast import literal_eval
alldata = []
with open('/Volumes/G-Drive/foo.csv') as data:
next(data) # 跳过列名
for line in data:
_id, _type, *dr = line.split(',')
ds = ','.join(dr).replace('true', 'True').replace('false', 'False')
attrs = literal_eval(ds)['data']['attributes']
rd = {
'id': _id,
'type': _type,
'space': attrs['aggregations'][0]['space'],
'created_at': attrs['created_at'],
'include_percentiles': attrs['include_percentiles'],
'metric_type': attrs['metric_type']
}
for tag in attrs['tags']:
rd['tags'] = tag
alldata.append(rd)
print(DataFrame(alldata))
输出:
id type space created_at include_percentiles metric_type tags
0 1 xx sum 2020-03-25T09:48:37.463835Z True count app
1 1 xx sum 2020-03-25T09:48:37.463835Z True count datacenter
英文:
The challenge here is that the CSV data are not in a format that can be handled by a traditional parser.
This is because the data are essentially comma-delimited but the 'attributes' value contains commas that are not value separators.
The 'attributes' value is neither a string representation of a Python dictionary nor is it JSON.
If the data are meant to represent a Python dictionary then the only issue (with the data as shown) is the value 'true'. We can overcome that by changing it to True. Let's also assume that there might be a value of 'false' so we'll deal with that too.
from pandas import DataFrame
from ast import literal_eval
alldata = []
with open('/Volumes/G-Drive/foo.csv') as data:
next(data) # skip column names
for line in data:
_id, _type, *dr = line.split(',')
ds = ','.join(dr).replace('true', 'True').replace('false', 'False')
attrs = literal_eval(ds)['data']['attributes']
rd = {
'id': _id,
'type': _type,
'space': attrs['aggregations'][0]['space'],
'created_at': attrs['created_at'],
'include_percentiles': attrs['include_percentiles'],
'metric_type': attrs['metric_type']
}
for tag in attrs['tags']:
rd['tags'] = tag
alldata.append(rd)
print(DataFrame(alldata))
Output:
id type space created_at include_percentiles metric_type tags
0 1 xx sum 2020-03-25T09:48:37.463835Z True count app
1 1 xx sum 2020-03-25T09:48:37.463835Z True count datacenter
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论