2023年5月11日 11:05:08go评论99阅读模式

英文:

Parse the dictionary values in CSV

问题

CSV文件包含

id,type,attributes
1,xx,{&#39;data&#39;: { &#39;attributes&#39;: {&#39;aggregations&#39;: [{&#39;space&#39;: &#39;sum&#39;,&#39;time&#39;: &#39;sum&#39;}],&#39;created_at&#39;: &#39;2020-03-25T09:48:37.463835Z&#39;,&#39;include_percentiles&#39;: true,&#39;metric_type&#39;: &#39;count&#39;,&#39;modified_at&#39;: &#39;2020-03-25T09:48:37.463835Z&#39;,&#39;tags&#39;: [&#39;app&#39;,&#39;datacenter&#39;]},&#39;id&#39;: &#39;test.metric.latency&#39;,&#39;type&#39;: &#39;manage_tags&#39;}}

如何使用pandas dataframe从CSV文件中解析属性。

期望输出

id type space created_at                  include_percentiles metric_type tags 
1  xx   sum   020-03-25T09:48:37.463835Z  true                count       app   
1  xx   sum   020-03-25T09:48:37.463835Z  true                count       datacenter

英文:

CSV File contains

id,type,attributes
1,xx,{&#39;data&#39;: { &#39;attributes&#39;: {&#39;aggregations&#39;: [{&#39;space&#39;: &#39;sum&#39;,&#39;time&#39;: &#39;sum&#39;}],&#39;created_at&#39;: &#39;2020-03-25T09:48:37.463835Z&#39;,&#39;include_percentiles&#39;: true,&#39;metric_type&#39;: &#39;count&#39;,&#39;modified_at&#39;: &#39;2020-03-25T09:48:37.463835Z&#39;,&#39;tags&#39;: [&#39;app&#39;,&#39;datacenter&#39;]},&#39;id&#39;: &#39;test.metric.latency&#39;,&#39;type&#39;: &#39;manage_tags&#39;}}

how to parse the attributes from CSV file using pandas dataframe.

Expecting output

id type space created_at                  include_percentiles metric_type tags 
1  xx   sum   020-03-25T09:48:37.463835Z  true                count       app   
1  xx   sum   020-03-25T09:48:37.463835Z  true                count       datacenter

答案1

得分: 0

这里的挑战是CSV数据的格式不适用于传统解析器。

这是因为数据基本上是逗号分隔的，但 'attributes' 值包含了不是值分隔符的逗号。

'attributes' 值既不是Python字典的字符串表示形式，也不是JSON。

如果数据意味着表示一个Python字典，那么唯一的问题（如所示的数据）是值 'true'。我们可以通过将其更改为True来解决这个问题。让我们也假设可能会有一个值 'false'，因此我们也会处理它。

from pandas import DataFrame
from ast import literal_eval
alldata = []
with open('/Volumes/G-Drive/foo.csv') as data:
    next(data)  # 跳过列名
    for line in data:
        _id, _type, *dr = line.split(',')
        ds = ','.join(dr).replace('true', 'True').replace('false', 'False')
        attrs = literal_eval(ds)['data']['attributes']
        rd = {
            'id': _id,
            'type': _type,
            'space': attrs['aggregations'][0]['space'],
            'created_at': attrs['created_at'],
            'include_percentiles': attrs['include_percentiles'],
            'metric_type': attrs['metric_type']
        }
        for tag in attrs['tags']:
            rd['tags'] = tag
            alldata.append(rd)
print(DataFrame(alldata))

输出：

  id type space                   created_at  include_percentiles metric_type        tags
0  1   xx   sum  2020-03-25T09:48:37.463835Z                 True       count         app
1  1   xx   sum  2020-03-25T09:48:37.463835Z                 True       count  datacenter

英文:

The challenge here is that the CSV data are not in a format that can be handled by a traditional parser.

This is because the data are essentially comma-delimited but the 'attributes' value contains commas that are not value separators.

The 'attributes' value is neither a string representation of a Python dictionary nor is it JSON.

If the data are meant to represent a Python dictionary then the only issue (with the data as shown) is the value 'true'. We can overcome that by changing it to True. Let's also assume that there might be a value of 'false' so we'll deal with that too.

from pandas import DataFrame
from ast import literal_eval
alldata = []
with open(&#39;/Volumes/G-Drive/foo.csv&#39;) as data:
    next(data)  # skip column names
    for line in data:
        _id, _type, *dr = line.split(&#39;,&#39;)
        ds = &#39;,&#39;.join(dr).replace(&#39;true&#39;, &#39;True&#39;).replace(&#39;false&#39;, &#39;False&#39;)
        attrs = literal_eval(ds)[&#39;data&#39;][&#39;attributes&#39;]
        rd = {
            &#39;id&#39;: _id,
            &#39;type&#39;: _type,
            &#39;space&#39;: attrs[&#39;aggregations&#39;][0][&#39;space&#39;],
            &#39;created_at&#39;: attrs[&#39;created_at&#39;],
            &#39;include_percentiles&#39;: attrs[&#39;include_percentiles&#39;],
            &#39;metric_type&#39;: attrs[&#39;metric_type&#39;]
        }
        for tag in attrs[&#39;tags&#39;]:
            rd[&#39;tags&#39;] = tag
            alldata.append(rd)
print(DataFrame(alldata))

Output:

  id type space                   created_at  include_percentiles metric_type        tags
0  1   xx   sum  2020-03-25T09:48:37.463835Z                 True       count         app
1  1   xx   sum  2020-03-25T09:48:37.463835Z                 True       count  datacenter

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在CSV中解析字典数值

问题

答案1

从主数据框中查找值并总结。

Python获取Windows上隐藏文件的元数据

python mysql – SELECT 语句是否需要 commit()？

Event Sourcing with Python: 如何制作一个投影？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。