如何将输出文件转换为数组

huangapple go评论92阅读模式
英文:

How to convert an output file into an array

问题

这个问题可能很琐碎,但我似乎找不到一个好的解决方案。

我有一个以"output.file"格式的程序输出。它看起来像这样:

3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+02 5.2270e+01 1.7820e+02 -9.6401e+01 -3.8095e+01 1.5210e+02 -5.4532e+01 2.6628e+01 -1.0989e+01 -8.1933e+01 -6.6642e-01 1.8158e+01 2.2515e+01 -5.9261e+00 6.8567e+00 7.2896e+00 1.2575e+01 -1.1400e+01 1.7467e+01 4.1609e+00 -6.0523e+00 -1.8691e+01 3.5305e+01 4.0516e+00 2.9715e+00 1.0701e+01 -1.3146e+01 -1.1101e+00
1xhdFH_A.pdb A 3 169 1.0565e+02 -9.1260e+01 -9.3580e+01 1.5947e+02 4.8274e+01 1.3447e+02 -1.5060e+02 -7.6796e+01 1.3185e+02 -5.3275e+01 2.5539e+01 -6.5738e+01 -6.6355e+01 4.8942e+01 -1.3249e+01 6.7675e+01 -1.2348e+01 -4.3005e+01 2.1516e+02 -2.3099e+01 -8.0767e+00 2.2402e+01 -5.9237e+01 4.4889e+00 -1.2909e+02 4.5721e+01 -9.9285e+01 5.9332e+01 -5.7431e+01 -3.6852e+01 -1.7430e+02
3c18FH_A.pdb A 5 285 1.2576e+02 6.3883e+00 1.3145e+01 8.2794e+01 -5.0494e+01 5.9305e+01 1.4713e+01 6.8420e+00 6.6720e+01 5.1087e+00 -1.7846e+01 7.4458e+00 -1.9514e+00 7.8637e+00 -2.9961e+00 -7.0192e+00 9.0216e-02 -7.2202e+00 1.4839e+01 -4.0826e+00 1.3694e+01 -2.8499e+00 4.2015e+00 -6.8598e-01 5.8514e+00 -7.3843e+00 5.2737e-02 -4.9425e-03 2.9360e+00 4.7973e+00 6.2879e+00
...

输出有6000多行(每个pdb文件一行),我正在尝试将其转换为格式为[6000,35]的数组,以便每行包含新文件的数据(在这个示例中,将是三个文件"3cp0FH_A.pdb","1xhdFH_A.pdb"和"3c18FH_A.pdb"),每列将是文件的一个数据点(除了前4列)。数组的第一行将如下所示:

[3cp0FH_A.pdb, A, 1, 62, 7.5635e+01, 8.9632e+01, 1.9255e+00, 1.9154e+02, 5.2270e+01, 1.7820e+02, -9.6401e+01, -3.8095e+01, 1.5210e+02, etc.]

我已经找出如何将"output.file"作为列表获取,其中每个条目都是输出文件的一行。我甚至能够使用逗号分隔值。因此,如果我输入:

>>> list[0]

我将得到:

'3cp0FH_A.pdb,A,1,62,7.5635e+01,8.9632e+01,1.9255e+00,1.9154e+02,5.2270e+01,1.7820e+02,-9.6401e+01,-3.8095e+01,1.5210e+02,-5.4532e+01,2.6628e+01,-1.0989e+01,-8.1933e+01,-6.6642e-01,1.8158e+01,2.2515e+01,-5.9261e+00,6.8567e+00,7.2896e+00,1.2575e+01,-1.1400e+01,1.7467e+01,4.1609e+00,-6.0523e+00,-1.8691e+01,3.5305e+01,4.0516e+00,2.9715e+00,1.0701e+01,-1.3146e+01,-1.1101e+00\n'

我无法弄清如何将此列表转换为数组,以便由逗号分隔的每个字符串/值都在自己的列中。

英文:

This might be a trivial question, but I can't seem to find a good solution.

I have the output of a program in the format "output.file". It looks like this:

3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+02 5.2270e+01 1.7820e+02 -9.6401e+01 -3.8095e+01 1.5210e+02 -5.4532e+01 2.6628e+01 -1.0989e+01 -8.1933e+01 -6.6642e-01 1.8158e+01 2.2515e+01 -5.9261e+00 6.8567e+00 7.2896e+00 1.2575e+01 -1.1400e+01 1.7467e+01 4.1609e+00 -6.0523e+00 -1.8691e+01 3.5305e+01 4.0516e+00 2.9715e+00 1.0701e+01 -1.3146e+01 -1.1101e+00
1xhdFH_A.pdb A 3 169 1.0565e+02 -9.1260e+01 -9.3580e+01 1.5947e+02 4.8274e+01 1.3447e+02 -1.5060e+02 -7.6796e+01 1.3185e+02 -5.3275e+01 2.5539e+01 -6.5738e+01 -6.6355e+01 4.8942e+01 -1.3249e+01 6.7675e+01 -1.2348e+01 -4.3005e+01 2.1516e+02 -2.3099e+01 -8.0767e+00 2.2402e+01 -5.9237e+01 4.4889e+00 -1.2909e+02 4.5721e+01 -9.9285e+01 5.9332e+01 -5.7431e+01 -3.6852e+01 -1.7430e+02
3c18FH_A.pdb A 5 285 1.2576e+02 6.3883e+00 1.3145e+01 8.2794e+01 -5.0494e+01 5.9305e+01 1.4713e+01 6.8420e+00 6.6720e+01 5.1087e+00 -1.7846e+01 7.4458e+00 -1.9514e+00 7.8637e+00 -2.9961e+00 -7.0192e+00 9.0216e-02 -7.2202e+00 1.4839e+01 -4.0826e+00 1.3694e+01 -2.8499e+00 4.2015e+00 -6.8598e-01 5.8514e+00 -7.3843e+00 5.2737e-02 -4.9425e-03 2.9360e+00 4.7973e+00 6.2879e+00
.
.
.

The output has over 6000 rows (one row for each pdb file) and I am trying to convert this into an array in the format [6000,35], so that every row contains the data of a new file (here in the example those would be the three files "3cp0FH_A.pdb, "1xhdFH_A.pdb" and "3c18FH_A.pdb") and every column would be one data point of the file (except the first 4 columns). The first row of the array would, taking the example above, look like this:

[3cp0FH_A.pdb, A, 1, 62, 7.5635e+01, 8.9632e+01, 1.9255e+00, 1.9154e+02, 5.2270e+01, 1.7820e+02, -9.6401e+01, -3.8095e+01, 1.5210e+02, etc.]

I already figured out how to get the output.file as a list where every entry is one row of the output.file. I was even able to separate the values by commas. So if i'd type in:

>>> list[0]

I'd get:

'3cp0FH_A.pdb,A,1,62,7.5635e+01,8.9632e+01,1.9255e+00,1.9154e+02,5.2270e+01,1.7820e+02,-9.6401e+01,-3.8095e+01,1.5210e+02,-5.4532e+01,2.6628e+01,-1.0989e+01,-8.1933e+01,-6.6642e-01,1.8158e+01,2.2515e+01,-5.9261e+00,6.8567e+00,7.2896e+00,1.2575e+01,-1.1400e+01,1.7467e+01,4.1609e+00,-6.0523e+00,-1.8691e+01,3.5305e+01,4.0516e+00,2.9715e+00,1.0701e+01,-1.3146e+01,-1.1101e+00\n'

What I can't figure out is how to convert this list into an array so that each string/value that is separated by a comma is in it's own column.

答案1

得分: 1

现在你的列表索引是字符串,而你实际上希望它们是包含所有数据点的列表。要做到这一点,你可以执行以下操作:

for i in range(len(input_list)):
    new_row = input_list[i].split(',')
    # 可选地,将从第4列开始的数字转换为浮点数
    new_row[4:] = [float(v) for v in new_row[4:]]
    input_list[i] = new_row

这将在原地修改你的列表,以替换之前的内容。这也是一个纯粹的Python解决方案,不涉及NumPy(尽管这应该为你提供了一些如何使用NumPy解决方案的思路,如果需要的话)。

英文:

So right now your list indices are strings and what you actually want is for them to be lists containing all your data points. To do that you can do the following:

for i in range(len(input_list)):
    new_row = input_list[i].split(',')
    # Optionally, translate the numbers from column 4 on to floats
    new_row[4:] = [float(v) for v in new_row[4:]]
    input_list[i] = new_row

This would modify your list in place so that it replaces whatever was in it before. This is also a pure python solution, not involving numpy (though this should give you some ideas on how to get to a numpy solution if desired).

答案2

得分: 1

In [26]: txt = """3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+0
...
...: """

simplest load:

In [27]: np.genfromtxt(txt.splitlines())
Out[27]:
array([[ nan, nan, 1.0000e+00, 6.2000e+01, 7.5635e+01,
8.9632e+01, 1.9255e+00, 1.9154e+02, 5.2270e+01, 1.7820e+02,
-9.6401e+01, -3.8095e+01, 1.5210e+02, -5.4532e+01, 2.6628e+01,
-1.0989e+01, -8.1933e+01, -6.6642e-01, 1.8158e+01, 2.2515e+01,
-5.9261e+00, 6.8567e+00, 7.2896e+00, 1.2575e+01, -1.1400e+01,
1.7467e+01, 4.1609e+00, -6.0523e+00, -1.8691e+01, 3.5305e+01,
4.0516e+00, 2.9715e+00, 1.0701e+01, -1.3146e+01, -1.1101e+00],
...])
In [28]: _.shape
Out[28]: (3, 35)

The default load format is float, so the initial 2 columns are rendered as nan. loadtxt would throw an error for those entries.

You could separate out the integer column with:

In [32]: Out[27][:,2]
Out[32]: array([1., 3., 5.])

and the float data columns with:

In [33]: Out[27][:,2:].shape
Out[33]: (3, 33)

With usecols you could load the label columns separately:

In [35]: np.genfromtxt(txt.splitlines(), dtype=None, usecols=[0,1,2], encoding=None)
Out[35]:
array([('3cp0FH_A.pdb', 'A', 1), ('1xhdFH_A.pdb', 'A', 3),
('3c18FH_A.pdb', 'A', 5)],
dtype=[('f0', '<U12'), ('f1', '<U1'), ('f2', '<i8')])

英文:

Copy-n-paste your sample:

In [26]: txt = &quot;&quot;&quot;3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+0
 ...
    ...: &quot;&quot;&quot;

simplest load:

In [27]: np.genfromtxt(txt.splitlines())                                        
Out[27]: 
array([[        nan,         nan,  1.0000e+00,  6.2000e+01,  7.5635e+01,
         8.9632e+01,  1.9255e+00,  1.9154e+02,  5.2270e+01,  1.7820e+02,
        -9.6401e+01, -3.8095e+01,  1.5210e+02, -5.4532e+01,  2.6628e+01,
        -1.0989e+01, -8.1933e+01, -6.6642e-01,  1.8158e+01,  2.2515e+01,
        -5.9261e+00,  6.8567e+00,  7.2896e+00,  1.2575e+01, -1.1400e+01,
         1.7467e+01,  4.1609e+00, -6.0523e+00, -1.8691e+01,  3.5305e+01,
         4.0516e+00,  2.9715e+00,  1.0701e+01, -1.3146e+01, -1.1101e+00],
...])
In [28]: _.shape                                                                
Out[28]: (3, 35)

The default load format is float, so the intial 2 columns are rendered as nan. loadtxt would throw an error for those entries.

You could separate out the integer column with:

In [32]: Out[27][:,2]                                                           
Out[32]: array([1., 3., 5.])

and the float data columns with:

In [33]: Out[27][:,2:].shape                                                    
Out[33]: (3, 33)

With usecols you could load the label columns separately:

In [35]: np.genfromtxt(txt.splitlines(), dtype=None, usecols=[0,1,2], encoding=None)                                                                   
Out[35]: 
array([(&#39;3cp0FH_A.pdb&#39;, &#39;A&#39;, 1), (&#39;1xhdFH_A.pdb&#39;, &#39;A&#39;, 3),
       (&#39;3c18FH_A.pdb&#39;, &#39;A&#39;, 5)],
      dtype=[(&#39;f0&#39;, &#39;&lt;U12&#39;), (&#39;f1&#39;, &#39;&lt;U1&#39;), (&#39;f2&#39;, &#39;&lt;i8&#39;)])

huangapple
  • 本文由 发表于 2020年1月7日 01:27:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/59616429.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定