英文:
How to convert an output file into an array
问题
这个问题可能很琐碎,但我似乎找不到一个好的解决方案。
我有一个以"output.file"格式的程序输出。它看起来像这样:
3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+02 5.2270e+01 1.7820e+02 -9.6401e+01 -3.8095e+01 1.5210e+02 -5.4532e+01 2.6628e+01 -1.0989e+01 -8.1933e+01 -6.6642e-01 1.8158e+01 2.2515e+01 -5.9261e+00 6.8567e+00 7.2896e+00 1.2575e+01 -1.1400e+01 1.7467e+01 4.1609e+00 -6.0523e+00 -1.8691e+01 3.5305e+01 4.0516e+00 2.9715e+00 1.0701e+01 -1.3146e+01 -1.1101e+00
1xhdFH_A.pdb A 3 169 1.0565e+02 -9.1260e+01 -9.3580e+01 1.5947e+02 4.8274e+01 1.3447e+02 -1.5060e+02 -7.6796e+01 1.3185e+02 -5.3275e+01 2.5539e+01 -6.5738e+01 -6.6355e+01 4.8942e+01 -1.3249e+01 6.7675e+01 -1.2348e+01 -4.3005e+01 2.1516e+02 -2.3099e+01 -8.0767e+00 2.2402e+01 -5.9237e+01 4.4889e+00 -1.2909e+02 4.5721e+01 -9.9285e+01 5.9332e+01 -5.7431e+01 -3.6852e+01 -1.7430e+02
3c18FH_A.pdb A 5 285 1.2576e+02 6.3883e+00 1.3145e+01 8.2794e+01 -5.0494e+01 5.9305e+01 1.4713e+01 6.8420e+00 6.6720e+01 5.1087e+00 -1.7846e+01 7.4458e+00 -1.9514e+00 7.8637e+00 -2.9961e+00 -7.0192e+00 9.0216e-02 -7.2202e+00 1.4839e+01 -4.0826e+00 1.3694e+01 -2.8499e+00 4.2015e+00 -6.8598e-01 5.8514e+00 -7.3843e+00 5.2737e-02 -4.9425e-03 2.9360e+00 4.7973e+00 6.2879e+00
...
输出有6000多行(每个pdb文件一行),我正在尝试将其转换为格式为[6000,35]的数组,以便每行包含新文件的数据(在这个示例中,将是三个文件"3cp0FH_A.pdb","1xhdFH_A.pdb"和"3c18FH_A.pdb"),每列将是文件的一个数据点(除了前4列)。数组的第一行将如下所示:
[3cp0FH_A.pdb, A, 1, 62, 7.5635e+01, 8.9632e+01, 1.9255e+00, 1.9154e+02, 5.2270e+01, 1.7820e+02, -9.6401e+01, -3.8095e+01, 1.5210e+02, etc.]
我已经找出如何将"output.file"作为列表获取,其中每个条目都是输出文件的一行。我甚至能够使用逗号分隔值。因此,如果我输入:
>>> list[0]
我将得到:
'3cp0FH_A.pdb,A,1,62,7.5635e+01,8.9632e+01,1.9255e+00,1.9154e+02,5.2270e+01,1.7820e+02,-9.6401e+01,-3.8095e+01,1.5210e+02,-5.4532e+01,2.6628e+01,-1.0989e+01,-8.1933e+01,-6.6642e-01,1.8158e+01,2.2515e+01,-5.9261e+00,6.8567e+00,7.2896e+00,1.2575e+01,-1.1400e+01,1.7467e+01,4.1609e+00,-6.0523e+00,-1.8691e+01,3.5305e+01,4.0516e+00,2.9715e+00,1.0701e+01,-1.3146e+01,-1.1101e+00\n'
我无法弄清如何将此列表转换为数组,以便由逗号分隔的每个字符串/值都在自己的列中。
英文:
This might be a trivial question, but I can't seem to find a good solution.
I have the output of a program in the format "output.file". It looks like this:
3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+02 5.2270e+01 1.7820e+02 -9.6401e+01 -3.8095e+01 1.5210e+02 -5.4532e+01 2.6628e+01 -1.0989e+01 -8.1933e+01 -6.6642e-01 1.8158e+01 2.2515e+01 -5.9261e+00 6.8567e+00 7.2896e+00 1.2575e+01 -1.1400e+01 1.7467e+01 4.1609e+00 -6.0523e+00 -1.8691e+01 3.5305e+01 4.0516e+00 2.9715e+00 1.0701e+01 -1.3146e+01 -1.1101e+00
1xhdFH_A.pdb A 3 169 1.0565e+02 -9.1260e+01 -9.3580e+01 1.5947e+02 4.8274e+01 1.3447e+02 -1.5060e+02 -7.6796e+01 1.3185e+02 -5.3275e+01 2.5539e+01 -6.5738e+01 -6.6355e+01 4.8942e+01 -1.3249e+01 6.7675e+01 -1.2348e+01 -4.3005e+01 2.1516e+02 -2.3099e+01 -8.0767e+00 2.2402e+01 -5.9237e+01 4.4889e+00 -1.2909e+02 4.5721e+01 -9.9285e+01 5.9332e+01 -5.7431e+01 -3.6852e+01 -1.7430e+02
3c18FH_A.pdb A 5 285 1.2576e+02 6.3883e+00 1.3145e+01 8.2794e+01 -5.0494e+01 5.9305e+01 1.4713e+01 6.8420e+00 6.6720e+01 5.1087e+00 -1.7846e+01 7.4458e+00 -1.9514e+00 7.8637e+00 -2.9961e+00 -7.0192e+00 9.0216e-02 -7.2202e+00 1.4839e+01 -4.0826e+00 1.3694e+01 -2.8499e+00 4.2015e+00 -6.8598e-01 5.8514e+00 -7.3843e+00 5.2737e-02 -4.9425e-03 2.9360e+00 4.7973e+00 6.2879e+00
.
.
.
The output has over 6000 rows (one row for each pdb file) and I am trying to convert this into an array in the format [6000,35], so that every row contains the data of a new file (here in the example those would be the three files "3cp0FH_A.pdb, "1xhdFH_A.pdb" and "3c18FH_A.pdb") and every column would be one data point of the file (except the first 4 columns). The first row of the array would, taking the example above, look like this:
[3cp0FH_A.pdb, A, 1, 62, 7.5635e+01, 8.9632e+01, 1.9255e+00, 1.9154e+02, 5.2270e+01, 1.7820e+02, -9.6401e+01, -3.8095e+01, 1.5210e+02, etc.]
I already figured out how to get the output.file as a list where every entry is one row of the output.file. I was even able to separate the values by commas. So if i'd type in:
>>> list[0]
I'd get:
'3cp0FH_A.pdb,A,1,62,7.5635e+01,8.9632e+01,1.9255e+00,1.9154e+02,5.2270e+01,1.7820e+02,-9.6401e+01,-3.8095e+01,1.5210e+02,-5.4532e+01,2.6628e+01,-1.0989e+01,-8.1933e+01,-6.6642e-01,1.8158e+01,2.2515e+01,-5.9261e+00,6.8567e+00,7.2896e+00,1.2575e+01,-1.1400e+01,1.7467e+01,4.1609e+00,-6.0523e+00,-1.8691e+01,3.5305e+01,4.0516e+00,2.9715e+00,1.0701e+01,-1.3146e+01,-1.1101e+00\n'
What I can't figure out is how to convert this list into an array so that each string/value that is separated by a comma is in it's own column.
答案1
得分: 1
现在你的列表索引是字符串,而你实际上希望它们是包含所有数据点的列表。要做到这一点,你可以执行以下操作:
for i in range(len(input_list)):
new_row = input_list[i].split(',')
# 可选地,将从第4列开始的数字转换为浮点数
new_row[4:] = [float(v) for v in new_row[4:]]
input_list[i] = new_row
这将在原地修改你的列表,以替换之前的内容。这也是一个纯粹的Python解决方案,不涉及NumPy(尽管这应该为你提供了一些如何使用NumPy解决方案的思路,如果需要的话)。
英文:
So right now your list indices are strings and what you actually want is for them to be lists containing all your data points. To do that you can do the following:
for i in range(len(input_list)):
new_row = input_list[i].split(',')
# Optionally, translate the numbers from column 4 on to floats
new_row[4:] = [float(v) for v in new_row[4:]]
input_list[i] = new_row
This would modify your list in place so that it replaces whatever was in it before. This is also a pure python solution, not involving numpy (though this should give you some ideas on how to get to a numpy solution if desired).
答案2
得分: 1
In [26]: txt = """3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+0
...
...: """
simplest load:
In [27]: np.genfromtxt(txt.splitlines())
Out[27]:
array([[ nan, nan, 1.0000e+00, 6.2000e+01, 7.5635e+01,
8.9632e+01, 1.9255e+00, 1.9154e+02, 5.2270e+01, 1.7820e+02,
-9.6401e+01, -3.8095e+01, 1.5210e+02, -5.4532e+01, 2.6628e+01,
-1.0989e+01, -8.1933e+01, -6.6642e-01, 1.8158e+01, 2.2515e+01,
-5.9261e+00, 6.8567e+00, 7.2896e+00, 1.2575e+01, -1.1400e+01,
1.7467e+01, 4.1609e+00, -6.0523e+00, -1.8691e+01, 3.5305e+01,
4.0516e+00, 2.9715e+00, 1.0701e+01, -1.3146e+01, -1.1101e+00],
...])
In [28]: _.shape
Out[28]: (3, 35)
The default load format is float, so the initial 2 columns are rendered as nan
. loadtxt
would throw an error for those entries.
You could separate out the integer column with:
In [32]: Out[27][:,2]
Out[32]: array([1., 3., 5.])
and the float data columns with:
In [33]: Out[27][:,2:].shape
Out[33]: (3, 33)
With usecols
you could load the label columns separately:
In [35]: np.genfromtxt(txt.splitlines(), dtype=None, usecols=[0,1,2], encoding=None)
Out[35]:
array([('3cp0FH_A.pdb', 'A', 1), ('1xhdFH_A.pdb', 'A', 3),
('3c18FH_A.pdb', 'A', 5)],
dtype=[('f0', '<U12'), ('f1', '<U1'), ('f2', '<i8')])
英文:
Copy-n-paste your sample:
In [26]: txt = """3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+0
...
...: """
simplest load:
In [27]: np.genfromtxt(txt.splitlines())
Out[27]:
array([[ nan, nan, 1.0000e+00, 6.2000e+01, 7.5635e+01,
8.9632e+01, 1.9255e+00, 1.9154e+02, 5.2270e+01, 1.7820e+02,
-9.6401e+01, -3.8095e+01, 1.5210e+02, -5.4532e+01, 2.6628e+01,
-1.0989e+01, -8.1933e+01, -6.6642e-01, 1.8158e+01, 2.2515e+01,
-5.9261e+00, 6.8567e+00, 7.2896e+00, 1.2575e+01, -1.1400e+01,
1.7467e+01, 4.1609e+00, -6.0523e+00, -1.8691e+01, 3.5305e+01,
4.0516e+00, 2.9715e+00, 1.0701e+01, -1.3146e+01, -1.1101e+00],
...])
In [28]: _.shape
Out[28]: (3, 35)
The default load format is float, so the intial 2 columns are rendered as nan
. loadtxt
would throw an error for those entries.
You could separate out the integer column with:
In [32]: Out[27][:,2]
Out[32]: array([1., 3., 5.])
and the float data columns with:
In [33]: Out[27][:,2:].shape
Out[33]: (3, 33)
With usecols
you could load the label columns separately:
In [35]: np.genfromtxt(txt.splitlines(), dtype=None, usecols=[0,1,2], encoding=None)
Out[35]:
array([('3cp0FH_A.pdb', 'A', 1), ('1xhdFH_A.pdb', 'A', 3),
('3c18FH_A.pdb', 'A', 5)],
dtype=[('f0', '<U12'), ('f1', '<U1'), ('f2', '<i8')])
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论