英文:
Parsing data with unusual format in python
问题
# Open file
f = open('data.txt', 'r')
# Initialize variables to store data
headers = []
values = []
# Flag to identify whether we are reading headers or values
reading_headers = True
# Loop over lines and extract variables of interest
for line in f:
line = line.strip()
# Check if the line is empty, and if so, switch to reading values
if not line:
reading_headers = not reading_headers
continue
columns = line.split()
if reading_headers:
headers.extend(columns)
else:
values.extend(columns)
# Now, you can organize the data into five separate columns
num_columns = 5
num_values = len(values) // num_columns
for i in range(num_values):
start_index = i * num_columns
end_index = (i + 1) * num_columns
header_values = headers[start_index:end_index]
data_values = values[start_index:end_index]
print("\t".join(header_values + data_values))
# Close the file
f.close()
This code will read the file and separate the headers and values into two separate lists. It will then organize the data into five separate columns as requested and print the result.
英文:
I have this ASCII file that is formatted somewhat oddly where the headers are organized by rows followed the corresponding values and their assoicated error also organized by rows. A snippet of the file is shown below.
f 7523
10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 10011
10012 10013 10014 10015 10016 10017 10018 10019 10020 10021 10022
vals
0.00000E+00 0.0000 7.65079E-12 0.7071 0.00000E+00 0.0000 3.87977E-14 0.5513
0.00000E+00 0.0000 0.00000E+00 0.0000 0.00000E+00 0.0000 0.00000E+00 0.0000
0.00000E+00 0.0000 1.92698E-14 0.7071 4.47277E-14 0.7071 0.00000E+00 0.0000
0.00000E+00 0.0000 0.00000E+00 0.0000 0.00000E+00 0.0000 1.10023E-11 0.7053
1.74410E-11 0.5005 0.00000E+00 0.0000 1.69181E-13 0.7071 0.00000E+00 0.0000
0.00000E+00 0.0000 0.00000E+00 0.0000
vals
1.34313E-06 0.0104 1.32163E-06 0.0102 1.30039E-06 0.0105 1.37575E-06 0.0106
1.30792E-06 0.0102 1.28351E-06 0.0104 1.32164E-06 0.0102 1.32969E-06 0.0104
1.31707E-06 0.0104 1.27281E-06 0.0103 1.28429E-06 0.0106 1.27027E-06 0.0105
1.29623E-06 0.0105 1.32037E-06 0.0101 1.28948E-06 0.0105 1.33163E-06 0.0106
1.36073E-06 0.0102 1.35462E-06 0.0102 1.38641E-06 0.0102 1.33099E-06 0.0102
1.35307E-06 0.0100 1.33882E-06 0.0105
I want to parse this file so that the data is organized in five separate columns so that it looks something like
10001 0.00000E+00 0.0000 1.34313E-06 0.0104
10002 7.65079E-12 0.7071 1.32163E-06 0.0102
10003 0.00000E+00 0.0000 1.30039E-06 0.0105
....
10022 0.00000E+00 0.0000 1.33882E-06 0.0105
I am unsure of how to cycle through this data to have the above format however beyond the simple start of it with
# Open file
f = open('data.txt', 'r')
# Loop over lines and extract variables of interest
for line in f:
line = line.strip()
columns = line.split()
</details>
# 答案1
**得分**: 1
你可以尝试使用 `re` 来将文件解析成数据框:
```py
text = """\
f 7523
10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 10011
10012 10013 10014 10015 10016 10017 10018 10019 10020 10021 10022
vals
0.00000E+00 0.0000 7.65079E-12 0.7071 0.00000E+00 0.0000 3.87977E-14 0.5513
0.00000E+00 0.0000 0.00000E+00 0.0000 0.00000E+00 0.0000 0.00000E+00 0.0000
0.00000E+00 0.0000 1.92698E-14 0.7071 4.47277E-14 0.7071 0.00000E+00 0.0000
0.00000E+00 0.0000 0.00000E+00 0.0000 0.00000E+00 0.0000 1.10023E-11 0.7053
1.74410E-11 0.5005 0.00000E+00 0.0000 1.69181E-13 0.7071 0.00000E+00 0.0000
0.00000E+00 0.0000 0.00000E+00 0.0000
vals
1.34313E-06 0.0104 1.32163E-06 0.0102 1.30039E-06 0.0105 1.37575E-06 0.0106
1.30792E-06 0.0102 1.28351E-06 0.0104 1.32164E-06 0.0102 1.32969E-06 0.0104
1.31707E-06 0.0104 1.27281E-06 0.0103 1.28429E-06 0.0106 1.27027E-06 0.0105
1.29623E-06 0.0105 1.32037E-06 0.0101 1.28948E-06 0.0105 1.33163E-06 0.0106
1.36073E-06 0.0102 1.35462E-06 0.0102 1.38641E-06 0.0102 1.33099E-06 0.0102
1.35307E-06 0.0100 1.33882E-06 0.0105
"""
import re
import pandas as pd
data = [
[c.split() for c in group.split(" ")]
for _, group in re.findall(r"(^\s+)(.*?)(?=^\S|\Z)", text, flags=re.S | re.M)
]
idx = [h for c in data[0] for h in c]
df = pd.concat(
[pd.DataFrame(d, index=idx) for d in data[1:]], axis=1
)
df.columns = range(len(df.columns))
print(df)
输出:
0 1 2 3
10001 0.00000E+00 0.0000 1.34313E-06 0.0104
10002 7.65079E-12 0.7071 1.32163E-06 0.0102
10003 0.00000E+00 0.0000 1.30039E-06 0.0105
10004 3.87977E-14 0.5513 1.37575E-06 0.0106
10005 0.00000E+00 0.0000 1.30792E-06 0.0102
10006 0.00000E+00 0.0000 1.28351E-06 0.0104
10007 0.00000E+00 0.0000 1.32164E-06 0.0102
10008 0.00000E+00 0.0000 1.32969E-06 0.0104
10009 0.00000E+00 0.0000 1.31707E-06 0.0104
10010 1.92698E-14 0.7071 1.27281E-06 0.0103
10011 4.47277E-14 0.7071 1.28429E-06 0.0106
10012 0.00000E+00 0.0000 1.27027E-06 0.0105
10013 0.00000E+00 0.0000 1.29623E-06 0.0105
10014 0.00000E+00 0.0000 1.32037E-06 0.0101
10015 0.00000E+00 0.0000 1.28948E-06 0.0105
10016 1.10023E-11 0.7053 1.33163E-06 0.0106
10017 1.74410E-11 0.5005 1.36073E-06 0.0102
10018 0.00000E+00 0.0000 1.35462E-06 0.0102
10019 1.69181E-13 0.7071 1.38641E-06 0.0102
10020
<details>
<summary>英文:</summary>
You can try to use `re` to parse the file into a dataframe:
```py
text = """\
f 7523
10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 10011
10012 10013 10014 10015 10016 10017 10018 10019 10020 10021 10022
vals
0.00000E+00 0.0000 7.65079E-12 0.7071 0.00000E+00 0.0000 3.87977E-14 0.5513
0.00000E+00 0.0000 0.00000E+00 0.0000 0.00000E+00 0.0000 0.00000E+00 0.0000
0.00000E+00 0.0000 1.92698E-14 0.7071 4.47277E-14 0.7071 0.00000E+00 0.0000
0.00000E+00 0.0000 0.00000E+00 0.0000 0.00000E+00 0.0000 1.10023E-11 0.7053
1.74410E-11 0.5005 0.00000E+00 0.0000 1.69181E-13 0.7071 0.00000E+00 0.0000
0.00000E+00 0.0000 0.00000E+00 0.0000
vals
1.34313E-06 0.0104 1.32163E-06 0.0102 1.30039E-06 0.0105 1.37575E-06 0.0106
1.30792E-06 0.0102 1.28351E-06 0.0104 1.32164E-06 0.0102 1.32969E-06 0.0104
1.31707E-06 0.0104 1.27281E-06 0.0103 1.28429E-06 0.0106 1.27027E-06 0.0105
1.29623E-06 0.0105 1.32037E-06 0.0101 1.28948E-06 0.0105 1.33163E-06 0.0106
1.36073E-06 0.0102 1.35462E-06 0.0102 1.38641E-06 0.0102 1.33099E-06 0.0102
1.35307E-06 0.0100 1.33882E-06 0.0105
"""
import re
import pandas as pd
data = [
[c.split() for c in group.split(" ")]
for _, group in re.findall(r"(^\s+)(.*?)(?=^\S|\Z)", text, flags=re.S | re.M)
]
idx = [h for c in data[0] for h in c]
df = pd.concat(
[pd.DataFrame(d, index=idx) for d in data[1:]], axis=1
)
df.columns = range(len(df.columns))
print(df)
Prints:
0 1 2 3
10001 0.00000E+00 0.0000 1.34313E-06 0.0104
10002 7.65079E-12 0.7071 1.32163E-06 0.0102
10003 0.00000E+00 0.0000 1.30039E-06 0.0105
10004 3.87977E-14 0.5513 1.37575E-06 0.0106
10005 0.00000E+00 0.0000 1.30792E-06 0.0102
10006 0.00000E+00 0.0000 1.28351E-06 0.0104
10007 0.00000E+00 0.0000 1.32164E-06 0.0102
10008 0.00000E+00 0.0000 1.32969E-06 0.0104
10009 0.00000E+00 0.0000 1.31707E-06 0.0104
10010 1.92698E-14 0.7071 1.27281E-06 0.0103
10011 4.47277E-14 0.7071 1.28429E-06 0.0106
10012 0.00000E+00 0.0000 1.27027E-06 0.0105
10013 0.00000E+00 0.0000 1.29623E-06 0.0105
10014 0.00000E+00 0.0000 1.32037E-06 0.0101
10015 0.00000E+00 0.0000 1.28948E-06 0.0105
10016 1.10023E-11 0.7053 1.33163E-06 0.0106
10017 1.74410E-11 0.5005 1.36073E-06 0.0102
10018 0.00000E+00 0.0000 1.35462E-06 0.0102
10019 1.69181E-13 0.7071 1.38641E-06 0.0102
10020 0.00000E+00 0.0000 1.33099E-06 0.0102
10021 0.00000E+00 0.0000 1.35307E-06 0.0100
10022 0.00000E+00 0.0000 1.33882E-06 0.0105
答案2
得分: 1
这是您提供的Python代码的翻译部分:
from pprint import pp
# 打开名为'data.txt'的文件并读取内容
with open('data.txt') as f:
data = f.read()
# 用空字符串替换换行符,并按'vals'拆分字符串
rows, val1, val2 = data.replace('\n', '').split('vals')
# 将rows转换为整数列表
rows = list(map(int, rows.split()[2:]))
val1 = val1.split()
val2 = val2.split()
# 使用列表推导式从val1和val2中收集元组
# 如果需要实际的浮点数,可以在内部使用list(map(float, valX[i:i+2]))
val1 = [val1[i:i+2] for i in range(0, len(val1), 2)]
val2 = [val2[i:i+2] for i in range(0, len(val2), 2)]
# 使用zip函数和解包元组
result = [[a, *b, *c] for a, b, c in zip(rows, val1, val2)]
pp(result)
请注意,此翻译只涵盖了代码的翻译部分,不包括问题或其他额外内容。
英文:
from pprint import pp
with open('data.txt') as f: data = f.read()
rows, val1, val2 = data.replace('\n','').split('vals')
# split into to lists
# drop 'f 7523' and convert rows to ints
rows = list(map(int, rows.split()[2:]))
val1 = val1.split()
val2 = val2.split()
# list comprehension to gather tuples from vals
# list(map(float, valX[i:i+2])) inside if you want actual floats
val1 = [val1[i:i+2] for i in range(0, len(val1), 2)]
val2 = [val2[i:i+2] for i in range(0, len(val2), 2)]
# zip and unpack tuples
result = [[a,*b,*c] for a,b,c in zip(rows, val1, val2)]
pp(result)
[[10001, '0.00000E+00', '0.0000', '1.34313E-06', '0.0104'],
[10002, '7.65079E-12', '0.7071', '1.32163E-06', '0.0102'],
[10003, '0.00000E+00', '0.0000', '1.30039E-06', '0.0105'],
[10004, '3.87977E-14', '0.5513', '1.37575E-06', '0.0106'],
[10005, '0.00000E+00', '0.0000', '1.30792E-06', '0.0102'],
[10006, '0.00000E+00', '0.0000', '1.28351E-06', '0.0104'],
[10007, '0.00000E+00', '0.0000', '1.32164E-06', '0.0102'],
[10008, '0.00000E+00', '0.0000', '1.32969E-06', '0.0104'],
[10009, '0.00000E+00', '0.0000', '1.31707E-06', '0.0104'],
[10010, '1.92698E-14', '0.7071', '1.27281E-06', '0.0103'],
[10011, '4.47277E-14', '0.7071', '1.28429E-06', '0.0106'],
[10012, '0.00000E+00', '0.0000', '1.27027E-06', '0.0105'],
[10013, '0.00000E+00', '0.0000', '1.29623E-06', '0.0105'],
[10014, '0.00000E+00', '0.0000', '1.32037E-06', '0.0101'],
[10015, '0.00000E+00', '0.0000', '1.28948E-06', '0.0105'],
[10016, '1.10023E-11', '0.7053', '1.33163E-06', '0.0106'],
[10017, '1.74410E-11', '0.5005', '1.36073E-06', '0.0102'],
[10018, '0.00000E+00', '0.0000', '1.35462E-06', '0.0102'],
[10019, '1.69181E-13', '0.7071', '1.38641E-06', '0.0102'],
[10020, '0.00000E+00', '0.0000', '1.33099E-06', '0.0102'],
[10021, '0.00000E+00', '0.0000', '1.35307E-06', '0.0100'],
[10022, '0.00000E+00', '0.0000', '1.33882E-06', '0.0105']]
答案3
得分: 0
读取文件逐行操作。
如果当前行以 f
开头,表示开始一个新的标题部分。(在你的示例中,7523
值与 f
在同一行,不清楚你想要对它进行什么操作)
否则,如果当前行以 vals
开头,表示开始一个新的数值部分。
否则,你正在处理已经开始的标题或数值部分。将行中的值读取到一个列表中,并适当地存储它们。
当文件处理完毕时,打印所有的标题和数值。
英文:
Read the file line by line.
If the current line starts with f
, you have started a new header section. (It's not clear what you want to do with the 7523
value that is on the same line as the f
in your example)
Otherwise if the current line starts with vals
, you have started a new values section.
Otherwise you're in the middle of a header or values section that has already started. Read the values from the line into a list and store them appropriately.
When the file is exhausted, print all the headers and values.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论