Python解析然后放入数据框架中

huangapple go评论104阅读模式
英文:

Python parse then put in a dataframe

问题

我想创建一个类似的数据框:

DATA1 ERROR1
123456 500
56789 505
英文:

I have a file with a data like this:

  1. ------------------------------
  2. ------------------------------
  3. <TIME:2020-01-01 01:25:10>
  4. <TIME:2020-01-01 01:25:10>
  5. <TIME:2020-01-01 01:25:10>
  6. <TIME:2020-01-01 01:25:10>
  7. ------
  8. ++++++
  9. %%RequestHandler
  10. DATA1 = 123456
  11. ERROR1 = 500
  12. DATA2 = 56789
  13. ERROR2 = 505
  14. Count = 4
  15. ---

I would like to create a dataframe like

DATA1 ERROR1
123456 500
56789 505

答案1

得分: 2

以下是您要的代码翻译:

  1. import re
  2. import pandas as pd
  3. # 读取文件
  4. with open("file.txt", "r") as file:
  5. content = file.read()
  6. # 使用正则表达式从原始结构化文本文件中提取值
  7. data = re.findall(r"DATA\d+\s*=\s*(\d+)", content)
  8. error = re.findall(r"ERROR\d+\s*=\s*(\d+)", content)
  9. # 创建一个数据框
  10. df = pd.DataFrame({"DATA": data, "ERROR": error})
  11. print(df)

示例:

  1. import re
  2. import pandas as pd
  3. content = '''
  4. ------------------------------
  5. ------------------------------
  6. <TIME:2020-01-01 01:25:10>
  7. <TIME:2020-01-01 01:25:10>
  8. <TIME:2020-01-01 01:25:10>
  9. <TIME:2020-01-01 01:25:10>
  10. ------
  11. ++++++
  12. %%RequestHandler
  13. DATA1 = 123456
  14. ERROR1 = 500
  15. DATA2 = 56789
  16. ERROR2 = 505
  17. Count = 4
  18. ---
  19. '''
  20. data = re.findall(r"DATA\d+\s*=\s*(\d+)", content)
  21. error = re.findall(r"ERROR\d+\s*=\s*(\d+)", content)
  22. df = pd.DataFrame({"DATA": data, "ERROR": error})
  23. print(df)

输出:

  1. DATA ERROR
  2. 0 123456 500
  3. 1 56789 505

(注意:代码中的 &quot; 在中文翻译中并没有特殊意义,因此我将其保留为英文引号 "。)

英文:

Here is the code that you want, you can regular expressions to extract desired data from raw structured text file:

  1. import re
  2. import pandas as pd
  3. # Read the file
  4. with open(&quot;file.txt&quot;, &quot;r&quot;) as file:
  5. content = file.read()
  6. # Use regular expressions to extract the values
  7. data = re.findall(r&quot;DATA\d+\s*=\s*(\d+)&quot;, content)
  8. error = re.findall(r&quot;ERROR\d+\s*=\s*(\d+)&quot;, content)
  9. # Create a dataframe
  10. df = pd.DataFrame({&quot;DATA&quot;: data, &quot;ERROR&quot;: error})
  11. print(df)

Example:

  1. import re
  2. import pandas as pd
  3. content = &#39;&#39;&#39;
  4. ------------------------------
  5. ------------------------------
  6. &lt;TIME:2020-01-01 01:25:10&gt;
  7. &lt;TIME:2020-01-01 01:25:10&gt;
  8. &lt;TIME:2020-01-01 01:25:10&gt;
  9. &lt;TIME:2020-01-01 01:25:10&gt;
  10. ------
  11. ++++++
  12. %%RequestHandler
  13. DATA1 = 123456
  14. ERROR1 = 500
  15. DATA2 = 56789
  16. ERROR2 = 505
  17. Count = 4
  18. ---
  19. &#39;&#39;&#39;
  20. data = re.findall(r&quot;DATA\d+\s*=\s*(\d+)&quot;, content)
  21. error = re.findall(r&quot;ERROR\d+\s*=\s*(\d+)&quot;, content)
  22. df = pd.DataFrame({&quot;DATA&quot;: data, &quot;ERROR&quot;: error})
  23. print(df)

Output:

  1. DATA ERROR
  2. 0 123456 500
  3. 1 56789 505

答案2

得分: 2

另一种使用 pivot 的正则表达式方法:

  1. import re
  2. # 或者使用 file.read()
  3. out = (pd.DataFrame(re.findall(r'^\s+(\w+)(\d+) = (\d+)', text, flags=re.M))
  4. .pivot(index=1, columns=0, values=2)
  5. .rename_axis(index=None, columns=None)
  6. )
  7. print(out)

输出结果:

  1. DATA ERROR
  2. 1 123456 500
  3. 2 56789 505

使用的输入:

  1. text = '''------------------------------
  2. ------------------------------
  3. <TIME:2020-01-01 01:25:10>
  4. <TIME:2020-01-01 01:25:10>
  5. <TIME:2020-01-01 01:25:10>
  6. <TIME:2020-01-01 01:25:10>;
  7. ------
  8. ++++++
  9. %%RequestHandler
  10. DATA1 = 123456
  11. ERROR1 = 500
  12. DATA2 = 56789
  13. ERROR2 = 505
  14. Count = 4'''

正则表达式演示

英文:

Another regex approach with pivot:

  1. import re
  2. # or file.read()
  3. out = (pd.DataFrame(re.findall(r&#39;^\s+(\w+)(\d+) = (\d+)&#39;, text, flags=re.M))
  4. .pivot(index=1, columns=0, values=2)
  5. .rename_axis(index=None, columns=None)
  6. )
  7. print(out)

Output:

  1. DATA ERROR
  2. 1 123456 500
  3. 2 56789 505

Used input:

  1. text = &#39;&#39;&#39;------------------------------
  2. ------------------------------
  3. &lt;TIME:2020-01-01 01:25:10&gt;
  4. &lt;TIME:2020-01-01 01:25:10&gt;
  5. &lt;TIME:2020-01-01 01:25:10&gt;
  6. &lt;TIME:2020-01-01 01:25:10&gt;
  7. ------
  8. ++++++
  9. %%RequestHandler
  10. DATA1 = 123456
  11. ERROR1 = 500
  12. DATA2 = 56789
  13. ERROR2 = 505
  14. Count = 4&#39;&#39;&#39;

regex demo

huangapple
  • 本文由 发表于 2023年7月12日 20:31:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76670562.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定