如何正确将这种类型的XML导入数据框架?

huangapple go评论80阅读模式
英文:

How to properly import this type of XML into a dataframe?

问题

In this page (http://www.t3db.ca/downloads) you can download the 'All Toxin Records (with Toxin-Target Mechanisms of Action and References)' file to see its content.

I want to use the data contained in this file to obtain a dataframe in python using pandas. The problem is that apparently the code is not correct: There are multiple lines starting with '' and this is not allowed.
I tried removing the duplicates but it still doesn't work.

Can you please help me saying what type of corrections I should do to make it work with 'pandas.read_xml()' method?

英文:

In this page (http://www.t3db.ca/downloads) you can download the All Toxin Records (with Toxin-Target Mechanisms of Action and References) file to see its content.

I want to use the data contained in this file to obtain a dataframe in python using pandas. The problem is that apparently the code is not correct: There are multiple lines starting with <?xml version="1.0" encoding="UTF-8"?> and this is not allowed.
I tried removing the duplicates but it still doesn't work.

Can you please help me saying what type of corrections I should do to make it work with pandas.read_xml() method?

答案1

得分: 1

这个.xml文件看起来可能包含多个.xml文件。我的建议是将每个文件拆分成单独的文件,然后分别解析它们。您可以通过在每个<?xml version="1.0" encoding="UTF-8"?>处拆分文件,然后将每个块写入自己的文件来以编程方式完成这个操作。

英文:

It looks like this one .xml file might contain multiple .xml files. My suggestion would be splitting each of these files out into their own file and then parsing them separately. You could do this programmatically by splitting the file at every <?xml version="1.0" encoding="UTF-8"?> and then writing every chunk to its own file.

答案2

得分: 1

这是代码的翻译部分:

看起来您的XML结果是从多个XML文件连接在一起的

以下是我处理它的方式

import pandas as pd
import xmlplain


# 读取并清理输入文件
print('拆分输入文件', end='...')
with open('toxins.xml', 'r') as f:
    text = f.read()
    # 看起来像是将多个XML文件连接在一个文件中
    # 让我们来撤销这个操作
    blocks = text.split('<?xml version="1.0" encoding="UTF-8"?>')
    blocks = filter(None, blocks) # 删除空行
print('OK')


# 拆分成单独的文件
print('将每个条目写入单独的文件', end='...')
for n, text in enumerate(blocks):
    with open(f'block_{n}.xml', 'w') as f:
        f.write('<?xml version="1.0" encoding="UTF-8"?>\n' + text)
print(f'OK。已写入{n}个文件')


# 从所有XML文件创建一个DataFrame
print('生成DataFrame', end='...')
df = pd.DataFrame()
for file_number in range(n):
    with open(f'block_{file_number}.xml', 'r') as f:
        js = xmlplain.xml_to_obj(f, strip_space=True, fold_dict=True)
        df = pd.concat([df, pd.json_normalize(js)])
print('OK')


display(df)

只返回代码的翻译部分,不包括问题。

英文:

It looks like your XML results from the concatenation of multiple XML files.

Here is how I would process it.

import pandas as pd
import xmlplain


#read and clean the input file
print(&#39;splitting input file&#39;, end=&#39;...&#39;)
with open(&#39;toxins.xml&#39;, &#39;r&#39;) as f:
    text = f.read()
    #it looks like multiple XML files were concatenated in a single file
    #let&#39;s undo this
    blocks = text.split(&#39;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;&#39;)
    blocks = filter(None, blocks) #remove empty lines
print(&#39;OK&#39;)


#split in separate files
print(&#39;writing each entry in a separate file&#39;, end=&#39;...&#39;)
for n, text in enumerate(blocks):
    with open(f&#39;block_{n}.xml&#39;, &#39;w&#39;) as f:
        f.write(&#39;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;\n&#39; + text)
print(f&#39;OK. Written {n} files&#39;)


#make a DataFrame out of all the XML files
print(&#39;making the DataFrame&#39;, end=&#39;...&#39;)
df = pd.DataFrame()
for file_number in range(n):
    with open(f&#39;block_{file_number}.xml&#39;, &#39;r&#39;) as f:
        js = xmlplain.xml_to_obj(f, strip_space=True, fold_dict=True)
        df = pd.concat([df, pd.json_normalize(js)])
print(&#39;OK&#39;)


display(df)

You can avoid writing back the single files if you don't need them. I do prefer having a separate file for each entry.

答案3

得分: 0

以下是使用beautifulsoup解析文件并创建示例数据框的示例代码:

import warnings
import pandas as pd
from io import StringIO
from bs4 import BeautifulSoup, XMLParsedAsHTMLWarning

warnings.filterwarnings('ignore', category=XMLParsedAsHTMLWarning)

with open('toxins.xml', 'r') as f_in:
    xml_text = f_in.read()

soup = BeautifulSoup(xml_text, 'html.parser')

out = []
for c in soup.select('compound'):
    common_name = c.find('common_name').text
    desc = c.find('description').text
    cats = [cat.text for cat in c.select('categories > category')]
    out.append((common_name, desc, cats))

df = pd.DataFrame(out, columns=['Common Name', 'Description', 'Categories'])
print(df.head(10))

打印输出:

               Common Name  Description  Categories
0                  Arsenic  Arsenic(As) is a ubiquitous metalloid found in several forms in food and the environment, such as the soil, air and water. Physiologically, it exists as an ion in the body. The predominant form is inorganic arsenic in drinking water, which is both highly toxic and carcinogenic and rapidly bioavailable. Arsenic is currently one of the most important environmental global contaminants and toxicants, particularly in the developing countries. For decades, very large populations have been and are currently still exposed to inorganic Arsenic through geogenically contaminated drinking water. An increased incidence of disease mediated by this toxicant is the consequence of long-term exposure. In human's chronic ingestion of inorganic arsenic (> 500 mg/L As) has been associated with cardiovascular, nervous, hepatic and renal diseases and diabetes mellitus as well as cancer of the skin, bladder, lung, liver and prostate. Contrary to the earlier view that methylated compounds are innocuous, the methylated metabolites are now recognized to be both toxic and carcinogenic, possibly due to genotoxicity, inhibition of antioxidative enzyme functions, or other mechanisms. Arsenic inhibits indirectly sulfhydryl containing enzymes and interferes with cellular metabolism. Effects involve such phenomena as cytotoxicity, genotoxicity and inhibition of enzymes with antioxidant function. These are all related to nutritional factors directly or indirectly. Nutritional studies both in experimental and epidemiological studies provide convincing evidence that nutritional intervention, including chemoprevention, offers a pragmatic approach to mitigate the health effects of arsenic exposure, particularly cancer, in the relatively resource-poor developing countries. Nutritional intervention, especially with micronutrients, many of which are antioxidants and share the same pathway with Arsenic , appears a host defence against the health effects of arsenic contamination in developing countries and should be embraced as it is pragmatic and inexpensive. (A7664, A7665).                                [Cigarette Toxin, Pesticide, Household Toxin, Pollutant, Airborne Pollutant, Food Toxin, Natural Toxin]
1                     Lead  Lead is a soft and malleable heavy and post-transition metal. Metallic lead has a bluish-white color after being freshly cut, but it soon tarnishes to a dull grayish color when exposed to air. It is the heaviest non-radioactive elemen and has the highest atomic number of all of the stable elements. Lead is used in building construction, lead-acid batteries, bullets and shot, weights, as part of solders, pewters, fusible alloys, and as a radiation shield. It readily forms many lead salts and organo-lead compounds. Lead is one of the oldest known and most widely studied occupational and environmental toxins. Despite intensive study, there is still vigorous debate about the toxic effects of lead, both from low level exposure in the general population owing to environmental pollution and historic use of lead in paint and plumbing and from exposure in the occupational setting. The majority of industries historically associated with high lead exposure have made dramatic advances in their control of occupational exposure. However, cases of unacceptably high exposure and even of frank lead poisoning are still seen, predominantly in the demolition and tank cleaning industries. Nevertheless, in most industries blood lead levels have declined below levels at which signs or symptoms are seen and the current focus of attention is on the subclinical effects of exposure. The significance of some of these effects for the overt health of the workers is often the subject of debate. Inevitably there is pressure to reduce lead exposure in the general population and in working environments, but any legislation must be based on a genuine scientific evaluation of the available evidence. Physiologically, it exists as an ion in the body. Inorganic lead is undoubtedly one of the oldest occupational toxins and evidence of lead poisoning can be found dating back to Roman times. As industrial lead production started at least 5000 years ago, it is likely that outbreaks of lead poisoning occurred from this time. These episodes of poisoning were not limited to lead workers. The general population could be significantly exposed owing to poorly glazed ceramic ware, the use of lead solder in the food canning industry, high levels of lead in drinking water, the use of lead compounds in paint and cosmetics and by deposition on crops and dust from industrial and motor vehicle sources. It was an important cause of morbidity and mortality during the Industrial Revolution and effective formal control of lead workers did not occur until the pioneering occupational health work of Ronald Lane in 1949. At very high blood lead levels, lead is a powerful abortifacient. At lower levels, it has been associated with miscarriages and low birth weights of infants. Predominantly to protect the developing fetus, legislation for lead workers often includes lower exposure criteria for women of reproductive capacity. Studies have shown a slowing of sensory motor reaction time in male lead workers and some disturbance of cognitive function in workers with blood lead levels >40 ug/100 ml. Peripheral motor neuropathy is seen as a result of chronic high-level lead exposure, but there is conflicting, although on the whole convincing, evidence of a reduction in peripheral nerve conduction velocity at lower blood lead levels. The threshold has been suggested to be as low as 30 ug/100 ml, although other studies have not seen effects below a blood lead level of 70 ug/100 ml. Several large epidemiological studies of lead workers have found inconclusive evidence of an association between lead exposure and the incidence of cancer. However, based on closer analysis, the increase did not appear to be related to lead exposure. There was also a small but significant increase in the incidence of lung cancer, but this could have been the result of confounding from cigarette smoking or concurrent arsenic exposure. There is some evidence in humans that there is an association between low-level lead exposure and blood pressure, but the results are inconsistent. Lead appears to reduce the resistance and increase the mortality of experimental animals. It apparently impairs antibody production and decreases immunoglobulin plaque forming cells. There is some evidence for suggesting that workers with blood lead levels between 20 and 85 ug/100 ml may have an increased susceptibility to colds, but a study of lead workers with blood lead levels less than 50 ug/100 ml showed no significant immunological changes. Although it is widely accepted that personal hygiene is the most important determinant of an individual's blood lead level, recent interesting information has shown that certain genetic polymorphisms may also have an impact. The use of most of lead containing chemicals is declining with the gradual demise of the

<details>
<summary>英文:</summary>

Here is an example how you can parse this file using `beautifulsoup` and create a sample dataframe:

```py
import warnings
import pandas as pd
from io import StringIO
from bs4 import BeautifulSoup, XMLParsedAsHTMLWarning

warnings.filterwarnings(&#39;ignore&#39;, category=XMLParsedAsHTMLWarning)

with open(&#39;toxins.xml&#39;, &#39;r&#39;) as f_in:
    xml_text = f_in.read()

soup = BeautifulSoup(xml_text, &#39;html.parser&#39;)

out = []
for c in soup.select(&#39;compound&#39;):
    common_name = c.find(&#39;common_name&#39;).text
    desc = c.find(&#39;description&#39;).text
    cats = [cat.text for cat in c.select(&#39;categories &gt; category&#39;)]
    out.append((common_name, desc, cats))

df = pd.DataFrame(out, columns=[&#39;Common Name&#39;, &#39;Description&#39;, &#39;Categories&#39;])
print(df.head(10))

Prints:

               Common Name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Description                                                                                                                             Categories
0                  Arsenic                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Arsenic(As) is a ubiquitous metalloid found in several forms in food and the environment, such as the soil, air and water. Physiologically, it exists as an ion in the body. The predominant form is inorganic arsenic in drinking water, which is both highly toxic and carcinogenic and rapidly bioavailable. Arsenic is currently one of the most important environmental global contaminants and toxicants, particularly in the developing countries. For decades, very large populations have been and are currently still exposed to inorganic Arsenic through geogenically contaminated drinking water. An increased incidence of disease mediated by this toxicant is the consequence of long-term exposure. In human&#39;s chronic ingestion of inorganic arsenic (&gt; 500 mg/L As) has been associated with cardiovascular, nervous, hepatic and renal diseases and diabetes mellitus as well as cancer of the skin, bladder, lung, liver and prostate. Contrary to the earlier view that methylated compounds are innocuous, the methylated metabolites are now recognized to be both toxic and carcinogenic, possibly due to genotoxicity, inhibition of antioxidative enzyme functions, or other mechanisms. Arsenic inhibits indirectly sulfhydryl containing enzymes and interferes with cellular metabolism. Effects involve such phenomena as cytotoxicity, genotoxicity and inhibition of enzymes with antioxidant function. These are all related to nutritional factors directly or indirectly. Nutritional studies both in experimental and epidemiological studies provide convincing evidence that nutritional intervention, including chemoprevention, offers a pragmatic approach to mitigate the health effects of arsenic exposure, particularly cancer, in the relatively resource-poor developing countries. Nutritional intervention, especially with micronutrients, many of which are antioxidants and share the same pathway with Arsenic , appears a host defence against the health effects of arsenic contamination in developing countries and should be embraced as it is pragmatic and inexpensive. (A7664, A7665).                                [Cigarette Toxin, Pesticide, Household Toxin, Pollutant, Airborne Pollutant, Food Toxin, Natural Toxin]
1                     Lead  Lead is a soft and malleable heavy and post-transition metal. Metallic lead has a bluish-white color after being freshly cut, but it soon tarnishes to a dull grayish color when exposed to air. It is the heaviest non-radioactive elemen and has the highest atomic number of all of the stable elements. Lead is used in building construction, lead-acid batteries, bullets and shot, weights, as part of solders, pewters, fusible alloys, and as a radiation shield. It readily forms many lead salts and organo-lead compounds. Lead is one of the oldest known and most widely studied occupational and environmental toxins. Despite intensive study, there is still vigorous debate about the toxic effects of lead, both from low level exposure in the general population owing to environmental pollution and historic use of lead in paint and plumbing and from exposure in the occupational setting. The majority of industries historically associated with high lead exposure have made dramatic advances in their control of occupational exposure. However, cases of unacceptably high exposure and even of frank lead poisoning are still seen, predominantly in the demolition and tank cleaning industries. Nevertheless, in most industries blood lead levels have declined below levels at which signs or symptoms are seen and the current focus of attention is on the subclinical effects of exposure. The significance of some of these effects for the overt health of the workers is often the subject of debate. Inevitably there is pressure to reduce lead exposure in the general population and in working environments, but any legislation must be based on a genuine scientific evaluation of the available evidence. Physiologically, it exists as an ion in the body. Inorganic lead is undoubtedly one of the oldest occupational toxins and evidence of lead poisoning can be found dating back to Roman times. As industrial lead production started at least 5000 years ago, it is likely that outbreaks of lead poisoning occurred from this time. These episodes of poisoning were not limited to lead workers. The general population could be significantly exposed owing to poorly glazed ceramic ware, the use of lead solder in the food canning industry, high levels of lead in drinking water, the use of lead compounds in paint and cosmetics and by deposition on crops and dust from industrial and motor vehicle sources. It was an important cause of morbidity and mortality during the Industrial Revolution and effective formal control of lead workers did not occur until the pioneering occupational health work of Ronald Lane in 1949. At very high blood lead levels, lead is a powerful abortifacient. At lower levels, it has been associated with miscarriages and low birth weights of infants. Predominantly to protect the developing fetus, legislation for lead workers often includes lower exposure criteria for women of reproductive capacity. Studies have shown a slowing of sensory motor reaction time in male lead workers and some disturbance of cognitive function in workers with blood lead levels &gt;40 ug/100 ml. Peripheral motor neuropathy is seen as a result of chronic high-level lead exposure, but there is conflicting, although on the whole convincing, evidence of a reduction in peripheral nerve conduction velocity at lower blood lead levels. The threshold has been suggested to be as low as 30 ug/100 ml, although other studies have not seen effects below a blood lead level of 70 ug/100 ml. Several large epidemiological studies of lead workers have found inconclusive evidence of an association between lead exposure and the incidence of cancer. However, based on closer analysis, the increase did not appear to be related to lead exposure. There was also a small but significant increase in the incidence of lung cancer, but this could have been the result of confounding from cigarette smoking or concurrent arsenic exposure. There is some evidence in humans that there is an association between low-level lead exposure and blood pressure, but the results are inconsistent. Lead appears to reduce the resistance and increase the mortality of experimental animals. It apparently impairs antibody production and decreases immunoglobulin plaque forming cells. There is some evidence for suggesting that workers with blood lead levels between 20 and 85 ug/100 ml may have an increased susceptibility to colds, but a study of lead workers with blood lead levels less than 50 ug/100 ml showed no significant immunological changes. Although it is widely accepted that personal hygiene is the most important determinant of an individual&#39;s blood lead level, recent interesting information has shown that certain genetic polymorphisms may also have an impact. The use of most of lead containing chemicals is declining with the gradual demise of the use of lead in gasoline (petrol), but lead naphthenates and lead stearates are still used in stabilizers for plastics and as lead &#39;soaps&#39;. In fact, the only compound now produced for gasoline/fuel usage is tetraethyl lead. Exposure is only seen during the production, transportation and blending of this substance into gasoline/fuel/petrol and in workers involved in cleaning storage tanks that have contained leaded gasoline (or petrol). It is in this final group, the tank cleaners, where the highest potential morbidity and mortality may be seen. (A7666).               [Cigarette Toxin, Household Toxin, Industrial/Workplace Toxin, Pollutant, Airborne Pollutant, Food Toxin, Natural Toxin]
2                  Mercury                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Mercury is a metal that is a liquid at room temperature. Mercury has a long and interesting history deriving from its use in medicine and industry, with the resultant toxicity produced. In high enough doses, all forms of mercury can produce toxicity. The most devastating tragedies related to mercury toxicity in recent history include Minamata Bay and Niagata, Japan in the 1950s, and Iraq in the 1970s. More recent mercury toxicity issues include the extreme toxicity of the dimethylmercury compound noted in 1998, the possible toxicity related to dental amalgams, and the disproved relationship between vaccines and autism related to the presence of the mercury-containing preservative, thimerosal. Hair has been used in many studies as a bioindicator of mercury exposure for human populations. At the time of hair formation, mercury from the blood capillaries penetrates into the hair follicles. As hair grows approximately 1 cm each month, mercury exposure over time is recapitulated in hair strands. Mercury levels in hair closest to the scalp reflect the most recent exposure, while those farthest from the scalp are representative of previous blood concentrations. Sequential analyses of hair mercury have been useful for identifying seasonal variations over time in hair mercury content, which may be the result of seasonal differences in bioavailability of fish and differential consumption of piscivorous and herbivorous fish species. Knowledge of the relation between fish-eating practices and hair mercury levels is particularly important for adequate mitigation strategies. Physiologically, it exists as an ion in the body. Methyl mercury is well absorbed, and because the biological half-life is long, the body burden in humans may reach high levels. People who frequently eat contaminated seafood can acquire mercury concentrations that are potentially dangerous to the fetus in pregnant women. The dose-response relationships have been extensively studied, and the safe levels of exposure have tended to decline. Individual methyl mercury exposure is usually determined by analysis of mercury in blood and hair. Whilst the clinical features of acute mercury poisoning have been well described, chronic low dose exposure to mercury remains poorly characterised and its potential role in various chronic disease states remains controversial. Low molecular weight thiols, i.e. sulfhydryl containing molecules such as cysteine, are emerging as important factors in the transport and distribution of mercury throughout the body due to the phenomenon of Molecular Mimicry and its role in the molecular transport of mercury. Chelation agents such as the dithiols sodium 2,3-dimercaptopropanesulfate (DMPS) and meso-2,3-dimercaptosuccinic acid (DMSA) are the treatments of choice for mercury toxicity. Alpha-lipoic acid (ALA), a disulfide, and its metabolite dihydrolipoic acid (DHLA), a dithiol, have also been shown to have chelation properties when used in an appropriate manner. Whilst N-acetyl-cysteine (NAC) and glutathione (GSH) have been recommended in the treatment of mercury toxicity in the past, an examination of available evidence suggests these agents may in fact be counterproductive. Zinc and selenium have also been shown to exert protective effects against mercury toxicity, most likely mediated by induction of the metal binding proteins metallothionein and selenoprotein-P. Evidence suggests however that the co-administration of selenium and dithiol chelation agents during treatment may also be counter-productive. Finally, the issue of diagnostic testing for chronic, historical or low dose mercury poisoning is considered including an analysis of the influence of ligand interactions and nutritional factors upon the accuracy of chelation challenge tests. (A7, A7667, A7668).                                [Household Toxin, Industrial/Workplace Toxin, Pollutant, Airborne Pollutant, Food Toxin, Natural Toxin]
3           Vinyl chloride                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Vinyl chloride is a man-made organic compound, formed when other substances such as trichloroethane, trichloroethylene, and tetrachloroethylene are broken down. In its monomer form it is acutely hazardous, thus it is primarily used for the production of polymers. At room temperature it is a flammable, colorless gas with a sweet odor, but it is easily condensed and usually stored as a liquid. It is one ingredient of cigarette.(L3)                         [Cigarette Toxin, Household Toxin, Industrial/Workplace Toxin, Pollutant, Airborne Pollutant, Synthetic Toxin]
...

huangapple
  • 本文由 发表于 2023年7月6日 21:31:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76629389.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定