用Python解析不寻常格式的数据

huangapple go评论69阅读模式
英文:

Parsing data with unusual format in python

问题

# Open file
f = open('data.txt', 'r')

# Initialize variables to store data
headers = []
values = []

# Flag to identify whether we are reading headers or values
reading_headers = True

# Loop over lines and extract variables of interest
for line in f:
    line = line.strip()
    # Check if the line is empty, and if so, switch to reading values
    if not line:
        reading_headers = not reading_headers
        continue
    
    columns = line.split()
    if reading_headers:
        headers.extend(columns)
    else:
        values.extend(columns)

# Now, you can organize the data into five separate columns
num_columns = 5
num_values = len(values) // num_columns

for i in range(num_values):
    start_index = i * num_columns
    end_index = (i + 1) * num_columns
    header_values = headers[start_index:end_index]
    data_values = values[start_index:end_index]
    print("\t".join(header_values + data_values))
    
# Close the file
f.close()

This code will read the file and separate the headers and values into two separate lists. It will then organize the data into five separate columns as requested and print the result.

英文:

I have this ASCII file that is formatted somewhat oddly where the headers are organized by rows followed the corresponding values and their assoicated error also organized by rows. A snippet of the file is shown below.

f     7523
  10001  10002  10003  10004  10005  10006  10007  10008  10009  10010  10011
  10012  10013  10014  10015  10016  10017  10018  10019  10020  10021  10022
vals
  0.00000E+00 0.0000  7.65079E-12 0.7071  0.00000E+00 0.0000  3.87977E-14 0.5513
  0.00000E+00 0.0000  0.00000E+00 0.0000  0.00000E+00 0.0000  0.00000E+00 0.0000
  0.00000E+00 0.0000  1.92698E-14 0.7071  4.47277E-14 0.7071  0.00000E+00 0.0000
  0.00000E+00 0.0000  0.00000E+00 0.0000  0.00000E+00 0.0000  1.10023E-11 0.7053
  1.74410E-11 0.5005  0.00000E+00 0.0000  1.69181E-13 0.7071  0.00000E+00 0.0000
  0.00000E+00 0.0000  0.00000E+00 0.0000
vals
  1.34313E-06 0.0104  1.32163E-06 0.0102  1.30039E-06 0.0105  1.37575E-06 0.0106
  1.30792E-06 0.0102  1.28351E-06 0.0104  1.32164E-06 0.0102  1.32969E-06 0.0104
  1.31707E-06 0.0104  1.27281E-06 0.0103  1.28429E-06 0.0106  1.27027E-06 0.0105
  1.29623E-06 0.0105  1.32037E-06 0.0101  1.28948E-06 0.0105  1.33163E-06 0.0106
  1.36073E-06 0.0102  1.35462E-06 0.0102  1.38641E-06 0.0102  1.33099E-06 0.0102
  1.35307E-06 0.0100  1.33882E-06 0.0105

I want to parse this file so that the data is organized in five separate columns so that it looks something like

10001      0.00000E+00 0.0000   1.34313E-06 0.0104
10002      7.65079E-12 0.7071   1.32163E-06 0.0102
10003      0.00000E+00 0.0000   1.30039E-06 0.0105
....
10022      0.00000E+00 0.0000   1.33882E-06 0.0105

I am unsure of how to cycle through this data to have the above format however beyond the simple start of it with

# Open file
f = open('data.txt', 'r')

# Loop over lines and extract variables of interest
for line in f:
    line = line.strip()
    columns = line.split()

</details>


# 答案1
**得分**: 1

你可以尝试使用 `re` 来将文件解析成数据框:

```py
text = &quot;&quot;&quot;\
f     7523
  10001  10002  10003  10004  10005  10006  10007  10008  10009  10010  10011
  10012  10013  10014  10015  10016  10017  10018  10019  10020  10021  10022
vals
  0.00000E+00 0.0000  7.65079E-12 0.7071  0.00000E+00 0.0000  3.87977E-14 0.5513
  0.00000E+00 0.0000  0.00000E+00 0.0000  0.00000E+00 0.0000  0.00000E+00 0.0000
  0.00000E+00 0.0000  1.92698E-14 0.7071  4.47277E-14 0.7071  0.00000E+00 0.0000
  0.00000E+00 0.0000  0.00000E+00 0.0000  0.00000E+00 0.0000  1.10023E-11 0.7053
  1.74410E-11 0.5005  0.00000E+00 0.0000  1.69181E-13 0.7071  0.00000E+00 0.0000
  0.00000E+00 0.0000  0.00000E+00 0.0000
vals
  1.34313E-06 0.0104  1.32163E-06 0.0102  1.30039E-06 0.0105  1.37575E-06 0.0106
  1.30792E-06 0.0102  1.28351E-06 0.0104  1.32164E-06 0.0102  1.32969E-06 0.0104
  1.31707E-06 0.0104  1.27281E-06 0.0103  1.28429E-06 0.0106  1.27027E-06 0.0105
  1.29623E-06 0.0105  1.32037E-06 0.0101  1.28948E-06 0.0105  1.33163E-06 0.0106
  1.36073E-06 0.0102  1.35462E-06 0.0102  1.38641E-06 0.0102  1.33099E-06 0.0102
  1.35307E-06 0.0100  1.33882E-06 0.0105
&quot;&quot;&quot;

import re
import pandas as pd

data = [
    [c.split() for c in group.split(&quot;  &quot;)]
    for _, group in re.findall(r&quot;(^\s+)(.*?)(?=^\S|\Z)&quot;, text, flags=re.S | re.M)
]

idx = [h for c in data[0] for h in c]

df = pd.concat(
    [pd.DataFrame(d, index=idx) for d in data[1:]], axis=1
)
df.columns = range(len(df.columns))

print(df)

输出:

                 0       1            2       3
10001  0.00000E+00  0.0000  1.34313E-06  0.0104
10002  7.65079E-12  0.7071  1.32163E-06  0.0102
10003  0.00000E+00  0.0000  1.30039E-06  0.0105
10004  3.87977E-14  0.5513  1.37575E-06  0.0106
10005  0.00000E+00  0.0000  1.30792E-06  0.0102
10006  0.00000E+00  0.0000  1.28351E-06  0.0104
10007  0.00000E+00  0.0000  1.32164E-06  0.0102
10008  0.00000E+00  0.0000  1.32969E-06  0.0104
10009  0.00000E+00  0.0000  1.31707E-06  0.0104
10010  1.92698E-14  0.7071  1.27281E-06  0.0103
10011  4.47277E-14  0.7071  1.28429E-06  0.0106
10012  0.00000E+00  0.0000  1.27027E-06  0.0105
10013  0.00000E+00  0.0000  1.29623E-06  0.0105
10014  0.00000E+00  0.0000  1.32037E-06  0.0101
10015  0.00000E+00  0.0000  1.28948E-06  0.0105
10016  1.10023E-11  0.7053  1.33163E-06  0.0106
10017  1.74410E-11  0.5005  1.36073E-06  0.0102
10018  0.00000E+00  0.0000  1.35462E-06  0.0102
10019  1.69181E-13  0.7071  1.38641E-06  0.0102
10020  

<details>
<summary>英文:</summary>

You can try to use `re` to parse the file into a dataframe:

```py
text = &quot;&quot;&quot;\
f     7523
  10001  10002  10003  10004  10005  10006  10007  10008  10009  10010  10011
  10012  10013  10014  10015  10016  10017  10018  10019  10020  10021  10022
vals
  0.00000E+00 0.0000  7.65079E-12 0.7071  0.00000E+00 0.0000  3.87977E-14 0.5513
  0.00000E+00 0.0000  0.00000E+00 0.0000  0.00000E+00 0.0000  0.00000E+00 0.0000
  0.00000E+00 0.0000  1.92698E-14 0.7071  4.47277E-14 0.7071  0.00000E+00 0.0000
  0.00000E+00 0.0000  0.00000E+00 0.0000  0.00000E+00 0.0000  1.10023E-11 0.7053
  1.74410E-11 0.5005  0.00000E+00 0.0000  1.69181E-13 0.7071  0.00000E+00 0.0000
  0.00000E+00 0.0000  0.00000E+00 0.0000
vals
  1.34313E-06 0.0104  1.32163E-06 0.0102  1.30039E-06 0.0105  1.37575E-06 0.0106
  1.30792E-06 0.0102  1.28351E-06 0.0104  1.32164E-06 0.0102  1.32969E-06 0.0104
  1.31707E-06 0.0104  1.27281E-06 0.0103  1.28429E-06 0.0106  1.27027E-06 0.0105
  1.29623E-06 0.0105  1.32037E-06 0.0101  1.28948E-06 0.0105  1.33163E-06 0.0106
  1.36073E-06 0.0102  1.35462E-06 0.0102  1.38641E-06 0.0102  1.33099E-06 0.0102
  1.35307E-06 0.0100  1.33882E-06 0.0105
&quot;&quot;&quot;

import re
import pandas as pd

data = [
    [c.split() for c in group.split(&quot;  &quot;)]
    for _, group in re.findall(r&quot;(^\s+)(.*?)(?=^\S|\Z)&quot;, text, flags=re.S | re.M)
]

idx = [h for c in data[0] for h in c]

df = pd.concat(
    [pd.DataFrame(d, index=idx) for d in data[1:]], axis=1
)
df.columns = range(len(df.columns))

print(df)

Prints:

                 0       1            2       3
10001  0.00000E+00  0.0000  1.34313E-06  0.0104
10002  7.65079E-12  0.7071  1.32163E-06  0.0102
10003  0.00000E+00  0.0000  1.30039E-06  0.0105
10004  3.87977E-14  0.5513  1.37575E-06  0.0106
10005  0.00000E+00  0.0000  1.30792E-06  0.0102
10006  0.00000E+00  0.0000  1.28351E-06  0.0104
10007  0.00000E+00  0.0000  1.32164E-06  0.0102
10008  0.00000E+00  0.0000  1.32969E-06  0.0104
10009  0.00000E+00  0.0000  1.31707E-06  0.0104
10010  1.92698E-14  0.7071  1.27281E-06  0.0103
10011  4.47277E-14  0.7071  1.28429E-06  0.0106
10012  0.00000E+00  0.0000  1.27027E-06  0.0105
10013  0.00000E+00  0.0000  1.29623E-06  0.0105
10014  0.00000E+00  0.0000  1.32037E-06  0.0101
10015  0.00000E+00  0.0000  1.28948E-06  0.0105
10016  1.10023E-11  0.7053  1.33163E-06  0.0106
10017  1.74410E-11  0.5005  1.36073E-06  0.0102
10018  0.00000E+00  0.0000  1.35462E-06  0.0102
10019  1.69181E-13  0.7071  1.38641E-06  0.0102
10020  0.00000E+00  0.0000  1.33099E-06  0.0102
10021  0.00000E+00  0.0000  1.35307E-06  0.0100
10022  0.00000E+00  0.0000  1.33882E-06  0.0105

答案2

得分: 1

这是您提供的Python代码的翻译部分:

from pprint import pp

# 打开名为'data.txt'的文件并读取内容
with open('data.txt') as f:
    data = f.read()

# 用空字符串替换换行符,并按'vals'拆分字符串
rows, val1, val2 = data.replace('\n', '').split('vals')

# 将rows转换为整数列表
rows = list(map(int, rows.split()[2:]))
val1 = val1.split()
val2 = val2.split()

# 使用列表推导式从val1和val2中收集元组
# 如果需要实际的浮点数,可以在内部使用list(map(float, valX[i:i+2]))
val1 = [val1[i:i+2] for i in range(0, len(val1), 2)]
val2 = [val2[i:i+2] for i in range(0, len(val2), 2)]

# 使用zip函数和解包元组
result = [[a, *b, *c] for a, b, c in zip(rows, val1, val2)]

pp(result)

请注意,此翻译只涵盖了代码的翻译部分,不包括问题或其他额外内容。

英文:
from pprint import pp


with open(&#39;data.txt&#39;) as f: data = f.read()
rows, val1, val2 = data.replace(&#39;\n&#39;,&#39;&#39;).split(&#39;vals&#39;)

# split into to lists
# drop &#39;f     7523&#39; and convert rows to ints
rows = list(map(int, rows.split()[2:]))
val1 = val1.split()
val2 = val2.split()

# list comprehension to gather tuples from vals
# list(map(float, valX[i:i+2])) inside if you want actual floats
val1 = [val1[i:i+2] for i in range(0, len(val1), 2)]
val2 = [val2[i:i+2] for i in range(0, len(val2), 2)]

# zip and unpack tuples
result = [[a,*b,*c] for a,b,c in zip(rows, val1, val2)]

pp(result)
[[10001, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.34313E-06&#39;, &#39;0.0104&#39;],
 [10002, &#39;7.65079E-12&#39;, &#39;0.7071&#39;, &#39;1.32163E-06&#39;, &#39;0.0102&#39;],
 [10003, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.30039E-06&#39;, &#39;0.0105&#39;],
 [10004, &#39;3.87977E-14&#39;, &#39;0.5513&#39;, &#39;1.37575E-06&#39;, &#39;0.0106&#39;],
 [10005, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.30792E-06&#39;, &#39;0.0102&#39;],
 [10006, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.28351E-06&#39;, &#39;0.0104&#39;],
 [10007, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.32164E-06&#39;, &#39;0.0102&#39;],
 [10008, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.32969E-06&#39;, &#39;0.0104&#39;],
 [10009, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.31707E-06&#39;, &#39;0.0104&#39;],
 [10010, &#39;1.92698E-14&#39;, &#39;0.7071&#39;, &#39;1.27281E-06&#39;, &#39;0.0103&#39;],
 [10011, &#39;4.47277E-14&#39;, &#39;0.7071&#39;, &#39;1.28429E-06&#39;, &#39;0.0106&#39;],
 [10012, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.27027E-06&#39;, &#39;0.0105&#39;],
 [10013, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.29623E-06&#39;, &#39;0.0105&#39;],
 [10014, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.32037E-06&#39;, &#39;0.0101&#39;],
 [10015, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.28948E-06&#39;, &#39;0.0105&#39;],
 [10016, &#39;1.10023E-11&#39;, &#39;0.7053&#39;, &#39;1.33163E-06&#39;, &#39;0.0106&#39;],
 [10017, &#39;1.74410E-11&#39;, &#39;0.5005&#39;, &#39;1.36073E-06&#39;, &#39;0.0102&#39;],
 [10018, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.35462E-06&#39;, &#39;0.0102&#39;],
 [10019, &#39;1.69181E-13&#39;, &#39;0.7071&#39;, &#39;1.38641E-06&#39;, &#39;0.0102&#39;],
 [10020, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.33099E-06&#39;, &#39;0.0102&#39;],
 [10021, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.35307E-06&#39;, &#39;0.0100&#39;],
 [10022, &#39;0.00000E+00&#39;, &#39;0.0000&#39;, &#39;1.33882E-06&#39;, &#39;0.0105&#39;]]

答案3

得分: 0

读取文件逐行操作。

如果当前行以 f 开头,表示开始一个新的标题部分。(在你的示例中,7523 值与 f 在同一行,不清楚你想要对它进行什么操作)

否则,如果当前行以 vals 开头,表示开始一个新的数值部分。

否则,你正在处理已经开始的标题或数值部分。将行中的值读取到一个列表中,并适当地存储它们。

当文件处理完毕时,打印所有的标题和数值。

英文:

Read the file line by line.

If the current line starts with f, you have started a new header section. (It's not clear what you want to do with the 7523 value that is on the same line as the f in your example)

Otherwise if the current line starts with vals, you have started a new values section.

Otherwise you're in the middle of a header or values section that has already started. Read the values from the line into a list and store them appropriately.

When the file is exhausted, print all the headers and values.

huangapple
  • 本文由 发表于 2023年6月29日 01:49:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76575609.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定