英文:
Python parse then put in a dataframe
问题
我想创建一个类似的数据框:
DATA1 | ERROR1 |
---|---|
123456 | 500 |
56789 | 505 |
英文:
I have a file with a data like this:
------------------------------
------------------------------
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>
------
++++++
%%RequestHandler
DATA1 = 123456
ERROR1 = 500
DATA2 = 56789
ERROR2 = 505
Count = 4
---
I would like to create a dataframe like
DATA1 | ERROR1 |
---|---|
123456 | 500 |
56789 | 505 |
答案1
得分: 2
以下是您要的代码翻译:
import re
import pandas as pd
# 读取文件
with open("file.txt", "r") as file:
content = file.read()
# 使用正则表达式从原始结构化文本文件中提取值
data = re.findall(r"DATA\d+\s*=\s*(\d+)", content)
error = re.findall(r"ERROR\d+\s*=\s*(\d+)", content)
# 创建一个数据框
df = pd.DataFrame({"DATA": data, "ERROR": error})
print(df)
示例:
import re
import pandas as pd
content = '''
------------------------------
------------------------------
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>
------
++++++
%%RequestHandler
DATA1 = 123456
ERROR1 = 500
DATA2 = 56789
ERROR2 = 505
Count = 4
---
'''
data = re.findall(r"DATA\d+\s*=\s*(\d+)", content)
error = re.findall(r"ERROR\d+\s*=\s*(\d+)", content)
df = pd.DataFrame({"DATA": data, "ERROR": error})
print(df)
输出:
DATA ERROR
0 123456 500
1 56789 505
(注意:代码中的 "
在中文翻译中并没有特殊意义,因此我将其保留为英文引号 "
。)
英文:
Here is the code that you want, you can regular expressions to extract desired data from raw structured text file:
import re
import pandas as pd
# Read the file
with open("file.txt", "r") as file:
content = file.read()
# Use regular expressions to extract the values
data = re.findall(r"DATA\d+\s*=\s*(\d+)", content)
error = re.findall(r"ERROR\d+\s*=\s*(\d+)", content)
# Create a dataframe
df = pd.DataFrame({"DATA": data, "ERROR": error})
print(df)
Example:
import re
import pandas as pd
content = '''
------------------------------
------------------------------
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>
------
++++++
%%RequestHandler
DATA1 = 123456
ERROR1 = 500
DATA2 = 56789
ERROR2 = 505
Count = 4
---
'''
data = re.findall(r"DATA\d+\s*=\s*(\d+)", content)
error = re.findall(r"ERROR\d+\s*=\s*(\d+)", content)
df = pd.DataFrame({"DATA": data, "ERROR": error})
print(df)
Output:
DATA ERROR
0 123456 500
1 56789 505
答案2
得分: 2
另一种使用 pivot
的正则表达式方法:
import re
# 或者使用 file.read()
out = (pd.DataFrame(re.findall(r'^\s+(\w+)(\d+) = (\d+)', text, flags=re.M))
.pivot(index=1, columns=0, values=2)
.rename_axis(index=None, columns=None)
)
print(out)
输出结果:
DATA ERROR
1 123456 500
2 56789 505
使用的输入:
text = '''------------------------------
------------------------------
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>;
------
++++++
%%RequestHandler
DATA1 = 123456
ERROR1 = 500
DATA2 = 56789
ERROR2 = 505
Count = 4'''
英文:
Another regex approach with pivot
:
import re
# or file.read()
out = (pd.DataFrame(re.findall(r'^\s+(\w+)(\d+) = (\d+)', text, flags=re.M))
.pivot(index=1, columns=0, values=2)
.rename_axis(index=None, columns=None)
)
print(out)
Output:
DATA ERROR
1 123456 500
2 56789 505
Used input:
text = '''------------------------------
------------------------------
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>
<TIME:2020-01-01 01:25:10>
------
++++++
%%RequestHandler
DATA1 = 123456
ERROR1 = 500
DATA2 = 56789
ERROR2 = 505
Count = 4'''
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论