2023年5月25日 18:41:51go评论176阅读模式

英文:

Parse JSON string within dataframe and insert extracted information into another column

问题

你可以使用以下代码将table, questions, answers, answer_type 等输出作为新列添加到原始的 df_sample 数据框中：

import json
import pandas as pd

# 假设你已经有了原始的 df_sample 数据框

# 创建空的列来存储新的数据
df_sample['table_output'] = ""
df_sample['questions_output'] = ""
df_sample['answers_output'] = ""
df_sample['answer_type_output'] = ""

for index, row in df_sample.iterrows():
    table_json = row['table']
    paragraphs_json = row['paragraphs']
    questions_json = row['questions']
    table = json.loads(json.dumps(table_json)).get("table")
    paragraphs = [json.loads(json.dumps(x)).get("text") for x in paragraphs_json]
    questions = [json.loads(json.dumps(x)).get("question") for x in questions_json]
    answer = [json.loads(json.dumps(x)).get("answer") for x in questions_json]
    answer_type = [json.loads(json.dumps(x)).get("answer_type") for x in questions_json]

    # 将提取的数据添加到对应的列
    df_sample.at[index, 'table_output'] = table
    df_sample.at[index, 'questions_output'] = questions
    df_sample.at[index, 'answers_output'] = answer
    df_sample.at[index, 'answer_type_output'] = answer_type

# 打印结果
print(df_sample[['table_output', 'questions_output', 'answers_output', 'answer_type_output']])

这将在 df_sample 数据框中添加新的列，分别存储提取的 table, questions, answers, answer_type 数据。

英文:

I am trying to extract information from each cell in a row from a data frame and add them as another column.

import json
import pandas as pd
df_nested = pd.read_json(&#39;train.json&#39;)
df_sample = df_nested.sample(n=50, random_state=0)
display(df_sample)

for index, row in df_sample.iterrows():
    table_json = row[&#39;table&#39;]
    paragraphs_json = row[&#39;paragraphs&#39;]
    questions_json = row[&#39;questions&#39;]
    table = json.loads(json.dumps(table_json)).get(&quot;table&quot;)
    #print(table)
    paragraphs = [json.loads(json.dumps(x)).get(&quot;text&quot;) for x in paragraphs_json]
    #print(paragraphs)
    questions = [json.loads(json.dumps(x)).get(&quot;question&quot;) for x in questions_json]
    answer = [json.loads(json.dumps(x)).get(&quot;answer&quot;) for x in questions_json]
    answer_type = [json.loads(json.dumps(x)).get(&quot;answer_type&quot;) for x in questions_json]
    program = [json.loads(json.dumps(x)).get(&quot;derivation&quot;) for x in questions_json]
    print(program)

The dataframe is as

table	paragraphs	questions
{"uid": "bf2c6a2f-0b76-4bba-8d3c-2ee02d1b7d73", "table": "[[, , December 31,,], [, Useful Life, 2019, 2018], [Computer equipment and software, 3 – 5 years, $57,474, $52,055], [Furniture and fixtures, 7 years, 6,096, 4,367], [Leasehold improvements, 2 – 6 years, 22,800, 9,987], [Renovation in progress, n/a, 8, 1,984], [Build-to-suit property, 25 years, —, 51,058], [Total property and equipment, gross, , 86,378, 119,451], [Less: accumulated depreciation and amortization, , (49,852), (42,197)], [Total property and equipment, net, , $36,526, $77,254]]"}	[{"uid": "07e28145-95d5-4f9f-b313-ac8c3b4a869f", "text": "Accounts Receivable", "order": "1"}, {"uid": "b41652f7-0e68-4cf6-9723-fec443b1e604", "text": "The following is a summary of Accounts receivable (in thousands):", "order": "2"}]	[{"rel_paragraphs": "2", "answer_from": "table-text", "question": "Which years does the table provide information for the company's Accounts receivable?", "scale": "", "answer_type": "multi-span", "req_comparison": "false", "order": "1", "uid": "53041a93-1d06-48fd-a478-6f690b8da302", "answer": "[2019, 2018]", "derivation": ""}, {"rel_paragraphs": "2", "answer_from": "table-text", "question": "What was the amount of accounts receivable in 2018?", "scale": "thousand", "answer_type": "span", "req_comparison": "false", "order": "2", "uid": "a196a61c-43b0-43f5-bb4b-b059a1103c54", "answer": "[225,167]", "derivation": ""}, {"rel_paragraphs": "2", "answer_from": "table-text", "question": "What was the allowance for product returns in 2019?", "scale": "thousand", "answer_type": "span", "req_comparison": "false", "order": "3", "uid": "c8656e5e-2bb7-4f03-ae73-0d04492155c0", "answer": "[(25,897)]", "derivation": ""}, {"rel_paragraphs": "2", "answer_from": "table-text", "question": "How many years did the net accounts receivable exceed $200,000 thousand?", "scale": "", "answer_type": "count", "req_comparison": "false", "order": "4", "uid": "fdf08d3d-d570-4c21-9b3e-a3c86e164665", "answer": "1", "derivation": "2018"}, {"rel_paragraphs": "2", "answer_from": "table-text", "question": "What was the change in the Allowance for doubtful accounts between 2018 and 2019?", "scale": "thousand", "answer_type": "arithmetic", "req_comparison": "false", "order": "5", "uid": "6ecb2062-daca-4e1e-900e-2b99b2fce929", "answer": "424", "derivation": "-1,054-(-1,478)"}, {"rel_paragraphs": "[]", "answer_from": "table", "question": "What was the percentage change in the Allowance for product returns between 2018 and 2019?", "scale": "percent", "answer_type": "arithmetic", "req_comparison": "false", "order": "6", "uid": "f2c1edad-622d-4959-8cd5-a7f2bd2d7bb1", "answer": "129.87", "derivation": "(-25,897+11,266)/-11,266"}]

The above code is not an efficient one. But, how do I add the outputs from the df_sample.iterrows() i.e. table, questions, answers, answer_type etc.. as another column in my original df_sample dataframe

答案1

得分: 1

以下是您提供的数据框：

import pandas as pd

df = pd.DataFrame(
    {
        "table": [
            {
                "uid": "bf2c6a2f-0b76-4bba-8d3c-2ee02d1b7d73",
                "table": "[[, , December 31,,], [, Useful Life, 2019, 2018], [Computer equipment and software, 3 &#226;\x80\x93 5 years, $57,474, $52,055], [Furniture and fixtures, 7 years, 6,096, 4,367], [Leasehold improvements, 2 &#226;\x80\x93 6 years, 22,800, 9,987], [Renovation in progress, n/a, 8, 1,984], [Build-to-suit property, 25 years, &#226;\x80\x94, 51,058], [Total property and equipment, gross, , 86,378, 119,451], [Less: accumulated depreciation and amortization, , (49,852), (42,197)], [Total property and equipment, net, , $36,526, $77,254]]",
            }
        ],
        "paragraphs": [
            [
                {
                    "uid": "07e28145-95d5-4f9f-b313-ac8c3b4a869f",
                    "text": "Accounts Receivable",
                    "order": "1",
                },
                {
                    "uid": "b41652f7-0e68-4cf6-9723-fec443b1e604",
                    "text": "The following is a summary of Accounts receivable (in thousands):",
                    "order": "2",
                },
            ]
        ],
        "questions": [
            [
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "Which years does the table provide information for the company's Accounts receivable?",
                    "scale": "",
                    "answer_type": "multi-span",
                    "req_comparison": "false",
                    "order": "1",
                    "uid": "53041a93-1d06-48fd-a478-6f690b8da302",
                    "answer": "[2019, 2018]",
                    "derivation": "",
                },
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "What was the amount of accounts receivable in 2018?",
                    "scale": "thousand",
                    "answer_type": "span",
                    "req_comparison": "false",
                    "order": "2",
                    "uid": "a196a61c-43b0-43f5-bb4b-b059a1103c54",
                    "answer": "[225,167]",
                    "derivation": "",
                },
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "What was the allowance for product returns in 2019?",
                    "scale": "thousand",
                    "answer_type": "span",
                    "req_comparison": "false",
                    "order": "3",
                    "uid": "c8656e5e-2bb7-4f03-ae73-0d04492155c0",
                    "answer": "[(25,897)]",
                    "derivation": "",
                },
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "How many years did the net accounts receivable exceed $200,000 thousand?",
                    "scale": "",
                    "answer_type": "count",
                    "req_comparison": "false",
                    "order": "4",
                    "uid": "fdf08d3d-d570-4c21-9b3e-a3c86e164665",
                    "answer": "1",
                    "derivation": "2018",
                },
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "What was the change in the Allowance for doubtful accounts between 2018 and 2019?",
                    "scale": "thousand",
                    "answer_type": "arithmetic",
                    "req_comparison": "false",
                    "order": "5",
                    "uid": "6ecb2062-daca-4e1e-900e-2b99b2fce929",
                    "answer": "424",
                    "derivation": "-1,054-(-1,478)",
                },
                {
                    "rel_paragraphs": "[]",
                    "answer_from": "table",
                    "question": "What was the percentage change in the Allowance for product returns between 2018 and 2019?",
                    "scale": "percent",
                    "answer_type": "arithmetic",
                    "req_comparison": "false",
                    "order": "6",
                    "uid": "f2c1edad-622d-4959-8cd5-a7f2bd2d7bb1",
                    "answer": "129.87",
                    "derivation": "(-25,897+11,266)/-11,266",
                },
            ]
        ],
    }
)

这是一种使用Python内置函数isinstance、"walrus"以及Pandas的explode、json_normalize和concat的方法：

for col in df.columns:
    # 处理包含JSON列表的列
    if df[col].apply(lambda x: isinstance(x, list)).all():
        df = df.explode(col, ignore_index=True)
    # 处理JSON
    if not (new_cols := pd.json_normalize(df[col])).empty:
        df = pd.concat([df.drop(columns=col), new_cols], axis=1).drop(columns="uid")

然后：

print(df)
# 输出

请注意，上述代码和数据框是以英文编写的，您可以根据需要进行翻译。如果您需要进一步的帮助，请告诉我。

英文:

With the dataframe you provided:

import pandas as pd

df = pd.DataFrame(
    {
        &quot;table&quot;: [
            {
                &quot;uid&quot;: &quot;bf2c6a2f-0b76-4bba-8d3c-2ee02d1b7d73&quot;,
                &quot;table&quot;: &quot;[[, , December 31,,], [, Useful Life, 2019, 2018], [Computer equipment and software, 3 &#226;\x80\x93 5 years, $57,474, $52,055], [Furniture and fixtures, 7 years, 6,096, 4,367], [Leasehold improvements, 2 &#226;\x80\x93 6 years, 22,800, 9,987], [Renovation in progress, n/a, 8, 1,984], [Build-to-suit property, 25 years, &#226;\x80\x94, 51,058], [Total property and equipment, gross, , 86,378, 119,451], [Less: accumulated depreciation and amortization, , (49,852), (42,197)], [Total property and equipment, net, , $36,526, $77,254]]&quot;,
            }
        ],
        &quot;paragraphs&quot;: [
            [
                {
                    &quot;uid&quot;: &quot;07e28145-95d5-4f9f-b313-ac8c3b4a869f&quot;,
                    &quot;text&quot;: &quot;Accounts Receivable&quot;,
                    &quot;order&quot;: &quot;1&quot;,
                },
                {
                    &quot;uid&quot;: &quot;b41652f7-0e68-4cf6-9723-fec443b1e604&quot;,
                    &quot;text&quot;: &quot;The following is a summary of Accounts receivable (in thousands):&quot;,
                    &quot;order&quot;: &quot;2&quot;,
                },
            ]
        ],
        &quot;questions&quot;: [
            [
                {
                    &quot;rel_paragraphs&quot;: &quot;[2]&quot;,
                    &quot;answer_from&quot;: &quot;table-text&quot;,
                    &quot;question&quot;: &quot;Which years does the table provide information for the company&#39;s Accounts receivable?&quot;,
                    &quot;scale&quot;: &quot;&quot;,
                    &quot;answer_type&quot;: &quot;multi-span&quot;,
                    &quot;req_comparison&quot;: &quot;false&quot;,
                    &quot;order&quot;: &quot;1&quot;,
                    &quot;uid&quot;: &quot;53041a93-1d06-48fd-a478-6f690b8da302&quot;,
                    &quot;answer&quot;: &quot;[2019, 2018]&quot;,
                    &quot;derivation&quot;: &quot;&quot;,
                },
                {
                    &quot;rel_paragraphs&quot;: &quot;[2]&quot;,
                    &quot;answer_from&quot;: &quot;table-text&quot;,
                    &quot;question&quot;: &quot;What was the amount of accounts receivable in 2018?&quot;,
                    &quot;scale&quot;: &quot;thousand&quot;,
                    &quot;answer_type&quot;: &quot;span&quot;,
                    &quot;req_comparison&quot;: &quot;false&quot;,
                    &quot;order&quot;: &quot;2&quot;,
                    &quot;uid&quot;: &quot;a196a61c-43b0-43f5-bb4b-b059a1103c54&quot;,
                    &quot;answer&quot;: &quot;[225,167]&quot;,
                    &quot;derivation&quot;: &quot;&quot;,
                },
                {
                    &quot;rel_paragraphs&quot;: &quot;[2]&quot;,
                    &quot;answer_from&quot;: &quot;table-text&quot;,
                    &quot;question&quot;: &quot;What was the allowance for product returns in 2019?&quot;,
                    &quot;scale&quot;: &quot;thousand&quot;,
                    &quot;answer_type&quot;: &quot;span&quot;,
                    &quot;req_comparison&quot;: &quot;false&quot;,
                    &quot;order&quot;: &quot;3&quot;,
                    &quot;uid&quot;: &quot;c8656e5e-2bb7-4f03-ae73-0d04492155c0&quot;,
                    &quot;answer&quot;: &quot;[(25,897)]&quot;,
                    &quot;derivation&quot;: &quot;&quot;,
                },
                {
                    &quot;rel_paragraphs&quot;: &quot;[2]&quot;,
                    &quot;answer_from&quot;: &quot;table-text&quot;,
                    &quot;question&quot;: &quot;How many years did the net accounts receivable exceed $200,000 thousand?&quot;,
                    &quot;scale&quot;: &quot;&quot;,
                    &quot;answer_type&quot;: &quot;count&quot;,
                    &quot;req_comparison&quot;: &quot;false&quot;,
                    &quot;order&quot;: &quot;4&quot;,
                    &quot;uid&quot;: &quot;fdf08d3d-d570-4c21-9b3e-a3c86e164665&quot;,
                    &quot;answer&quot;: &quot;1&quot;,
                    &quot;derivation&quot;: &quot;2018&quot;,
                },
                {
                    &quot;rel_paragraphs&quot;: &quot;[2]&quot;,
                    &quot;answer_from&quot;: &quot;table-text&quot;,
                    &quot;question&quot;: &quot;What was the change in the Allowance for doubtful accounts between 2018 and 2019?&quot;,
                    &quot;scale&quot;: &quot;thousand&quot;,
                    &quot;answer_type&quot;: &quot;arithmetic&quot;,
                    &quot;req_comparison&quot;: &quot;false&quot;,
                    &quot;order&quot;: &quot;5&quot;,
                    &quot;uid&quot;: &quot;6ecb2062-daca-4e1e-900e-2b99b2fce929&quot;,
                    &quot;answer&quot;: &quot;424&quot;,
                    &quot;derivation&quot;: &quot;-1,054-(-1,478)&quot;,
                },
                {
                    &quot;rel_paragraphs&quot;: &quot;[]&quot;,
                    &quot;answer_from&quot;: &quot;table&quot;,
                    &quot;question&quot;: &quot;What was the percentage change in the Allowance for product returns between 2018 and 2019?&quot;,
                    &quot;scale&quot;: &quot;percent&quot;,
                    &quot;answer_type&quot;: &quot;arithmetic&quot;,
                    &quot;req_comparison&quot;: &quot;false&quot;,
                    &quot;order&quot;: &quot;6&quot;,
                    &quot;uid&quot;: &quot;f2c1edad-622d-4959-8cd5-a7f2bd2d7bb1&quot;,
                    &quot;answer&quot;: &quot;129.87&quot;,
                    &quot;derivation&quot;: &quot;(-25,897+11,266)/-11,266&quot;,
                },
            ]
        ],
    }
)

Here is one way to do it with Python built-in function isinstance and "walrus" as well as Pandas explode, json_normalize and concat:

for col in df.columns:
    # Deal with columns containing lists of json
    if df[col].apply(lambda x: isinstance(x, list)).all():
        df = df.explode(col, ignore_index=True)
    # Deal with json
    if not (new_cols := pd.json_normalize(df[col])).empty:
        df = pd.concat([df.drop(columns=col), new_cols], axis=1).drop(columns=&quot;uid&quot;)

Then:

print(df)
# Output

                       table                      text order rel_paragraphs   
0   [[, , December 31,,],...       Accounts Receivable     1            [2]  \
1   [[, , December 31,,],...       Accounts Receivable     1            [2]   
2   [[, , December 31,,],...       Accounts Receivable     1            [2]   
3   [[, , December 31,,],...       Accounts Receivable     1            [2]   
4   [[, , December 31,,],...       Accounts Receivable     1            [2]   
5   [[, , December 31,,],...       Accounts Receivable     1             []   
6   [[, , December 31,,],...  The following is a su...     2            [2]   
7   [[, , December 31,,],...  The following is a su...     2            [2]   
8   [[, , December 31,,],...  The following is a su...     2            [2]   
9   [[, , December 31,,],...  The following is a su...     2            [2]   
10  [[, , December 31,,],...  The following is a su...     2            [2]   
11  [[, , December 31,,],...  The following is a su...     2             []   

   answer_from                  question     scale answer_type req_comparison   
0   table-text  Which years does the ...            multi-span          false  \
1   table-text  What was the amount o...  thousand        span          false   
2   table-text  What was the allowanc...  thousand        span          false   
3   table-text  How many years did th...                 count          false   
4   table-text  What was the change i...  thousand  arithmetic          false   
5        table  What was the percenta...   percent  arithmetic          false   
6   table-text  Which years does the ...            multi-span          false   
7   table-text  What was the amount o...  thousand        span          false   
8   table-text  What was the allowanc...  thousand        span          false   
9   table-text  How many years did th...                 count          false   
10  table-text  What was the change i...  thousand  arithmetic          false   
11       table  What was the percenta...   percent  arithmetic          false   

   order        answer                derivation  
0      1  [2019, 2018]
1      2     [225,167]
2      3    [(25,897)]
3      4             1                      2018  
4      5           424           -1,054-(-1,478)  
5      6        129.87  (-25,897+11,266)/-11,266  
6      1  [2019, 2018]
7      2     [225,167]
8      3    [(25,897)]
9      4             1                      2018  
10     5           424           -1,054-(-1,478)  
11     6        129.87  (-25,897+11,266)/-11,266

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

解析数据框中的JSON字符串，并将提取的信息插入另一列。

问题

答案1

Perform a "Text-To-Columns' in a Panda's DF and copy existing row into a new row

获取到 ValueError: 时间数据与格式“%Y-%m-%d %H:%M:%S.%f%z”不匹配的错误。

Python – 将字符串转换为字典，其中键是副标题，值是链接。

重新连接到 discord.py 中的视图（带有按钮等）。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论