如何高效使用Python从JSON文件中获取数值?

huangapple go评论93阅读模式
英文:

How to get values from a json file efficiently with python?

问题

titles = []
questions = []

for i in data["data"]:
titles.append(i["title"])

for p in i["paragraphs"]:
    for q in p["qas"]:
        questions.append(q["question"])

print(titles)
print(questions)

英文:

I'm trying to retrieve values from different layers of a json file, I'm using a quite silly way -- get the values from one dictionary inside another dictionary through for looping. I want to get all the "title" and "question" and put them in a list or a pandas dataframe. How can I retrieve the values needed in a simpler way? How to handle json files efficiently in general?
Thanks a lot for anyone who answers the question:)

here's a piece of the json:

{
    "contact": "xxx",
    "version": 1.0,
    "data": [
        {
            "title": "anges-musiciens-(national-gallery)",
            "paragraphs": [
                {
                    "qas": [
                        {
                            "answers": [{
                                    "text": "La Vierge aux rochers"
                                }
                            ],
                            
                            "question": "Que concerne principalement les documents ?"
                        }
                 }
             ]
        }
     ]
}
titles = []
questions = []

for i in data["data"]:
    titles.append(i["title"])

    for p in i["paragraphs"]:
        for q in p["qas"]:
            questions.append(q["question"])
    
print(titles)
print(questions)

答案1

得分: 0

如果结构是规则的(即总是相同的层次结构模式,字典存在时没有丢失的键),那么您可以使用嵌套列表理解来获取结果:

titles = [d["title"] for d in data["data"]]
questions = [q["question"] for d in data.get("data", [])
                           for p in d.get("paragraphs", [])
                           for q in p.get("qas", [])]

如果结构不规则,您需要在深入结构的过程中跟踪新条目。您可以使用列表(或队列)来实现:

titles = []
questions = []
more = [*data.items()]  # 从第一级字典的键/值开始
while more:
    key, value = more.pop(0)  # 获取要处理的下一个键/值对
    if isinstance(value, list):  # 如果值是列表
        more.extend(enumerate(value))  # 使用索引作为键添加键/值
    elif isinstance(value, dict):  # 如果值是字典
        more.extend(value.items())  # 从其项中添加更多键/值
    elif key == "title":  # 对于 "title" 键,添加到 titles 列表
        titles.append(value)
    elif key == "question":  # 对于 "question" 键,也一样
        questions.append(value)

输出:

print(titles)
['anges-musiciens-(national-gallery)']

print(questions)

['Que concerne principalement les documents ?']

希望这有帮助。

英文:

If the structure is regular (i.e. always the same hierarchy patterns and no missing keys when a dictionary is present), then you can obtain your results with nested list comprehensions:

titles    = [d["title"] for d in data["data"]]
questions = [q["question"] for d in data.get("data",[])
                           for p in d.get("paragraphs",[])
                           for q in p.get("qas",[])]

If the structure is not regular, you will need to keep track of new entries as you progress deeper and deeper in the structure. You can do this with a list (or a queue):

titles    = []
questions = []
more      = [*data.items()]  # start with key/values of first level dictionary
while more:
    key,value = more.pop(0)            # get next key/value pair to process
    if isinstance(value,list):         # if value is a list
        more.extend(enumerate(value))  # add key/values using indexes as keys
    elif isinstance(value,dict):       # if value is a dictionary
        more.extend(value.items())     # add more key/values from its items
    elif key == "title":               # for "title" key, add to titles list
        titles.append(value)
    elif key == "question":            # same for "question" keys
        questions.append(value)

output:

print(titles)
['anges-musiciens-(national-gallery)']

print(questions)

['Que concerne principalement les documents ?']

答案2

得分: 0

如果你想返回一个DataFrame

data = {
    "contact": "xxx",
    "version": 1.0,
    "data": [
        {
            "title": "anges-musiciens-(national-gallery)",
            "paragraphs": [
                {
                    "qas": [
                        {
                            "answers": [{
                                "text": "La Vierge aux rochers"
                            }],
                            "question": "Que concerne principalement les documents ?"
                        }
                    ]
                }
            ]
        }
    ]
}

df = pd.json_normalize(data['data'], ['paragraphs', 'qas'], 'title')[['title', 'question']]
print(df)
英文:

If you want to return a DataFrame

data = {
    "contact": "xxx",
    "version": 1.0,
    "data": [
        {
            "title": "anges-musiciens-(national-gallery)",
            "paragraphs": [
                {
                    "qas": [
                        {
                            "answers": [{
                                    "text": "La Vierge aux rochers"
                                }
                            ],
                            
                            "question": "Que concerne principalement les documents ?"
                        }
                    ]
                }
            ]
        }
    ]
}

df = pd.json_normalize(data['data'], ['paragraphs', 'qas'], 'title')[['title', 'question']]
print(df)

                                title                            question  
0  anges-musiciens-(national-gallery)  Que concerne principalement les documents ?  

答案3

得分: -1

你可以使用递归来在嵌套结构上执行深度优先搜索:

def extract_fields(json_data, fields_of_interest=None, extracted=None):
    if extracted is None:
        extracted = {}
    if isinstance(json_data, dict):
        for field, value in json_data.items():
            if field in fields_of_interest:
                extracted.setdefault(field, []).append(value)
            elif isinstance(value, dict) or isinstance(value, list):
                extract_fields(value, fields_of_interest, extracted)
    elif isinstance(json_data, list):
        for x in json_data:
            extract_fields(x, fields_of_interest, extracted)
    return extracted

j = {'title': 'abc',
     'deep': {'question': 'zyx',
              'deeper': [{'title': 'def',
                          'question': 'wvu',
                          'nothing': 'hahaha'},
                         {'even deeper': [{'title': 'ghi',
                                           'question': 'tsr',
                                           'answer': 42},
                                          {'not a title': "ceci n'est pas une pipe"}]}]}

extracted = extract_fields(j, ('title', 'question'))

print(extracted)
# {'title': ['abc', 'def', 'ghi'], 'question': ['zyx', 'wvu', 'tsr']}
英文:

You can use recursion to perform a depth-first-search on the nested structure:

def extract_fields(json_data, fields_of_interest=None, extracted=None):
    if extracted is None:
        extracted = {}
    if isinstance(json_data, dict):
        for field,value in json_data.items():
            if field in fields_of_interest:
                extracted.setdefault(field, []).append(value)
            elif isinstance(value, dict) or isinstance(value, list):
                extract_fields(value, fields_of_interest, extracted)
    elif isinstance(json_data, list):
        for x in json_data:
            extract_fields(x, fields_of_interest, extracted)
    return extracted

j = {'title': 'abc',
     'deep': {'question': 'zyx',
              'deeper': [{'title': 'def',
                          'question': 'wvu',
                          'nothing': 'hahaha'},
                         {'even deeper': [{'title': 'ghi',
                                           'question':'tsr',
                                           'answer': 42},
                                          {'not a title': "ceci n'est pas une pipe"}]}]}}

extracted = extract_fields(j, ('title', 'question'))

print(extracted)
# {'title': ['abc', 'def', 'ghi'], 'question': ['zyx', 'wvu', 'tsr']}

huangapple
  • 本文由 发表于 2023年3月7日 04:12:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75655424.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定