2020年1月7日 01:27:16go评论108阅读模式

英文:

How to convert an output file into an array

问题

这个问题可能很琐碎，但我似乎找不到一个好的解决方案。

我有一个以"output.file"格式的程序输出。它看起来像这样：

3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+02 5.2270e+01 1.7820e+02 -9.6401e+01 -3.8095e+01 1.5210e+02 -5.4532e+01 2.6628e+01 -1.0989e+01 -8.1933e+01 -6.6642e-01 1.8158e+01 2.2515e+01 -5.9261e+00 6.8567e+00 7.2896e+00 1.2575e+01 -1.1400e+01 1.7467e+01 4.1609e+00 -6.0523e+00 -1.8691e+01 3.5305e+01 4.0516e+00 2.9715e+00 1.0701e+01 -1.3146e+01 -1.1101e+00
1xhdFH_A.pdb A 3 169 1.0565e+02 -9.1260e+01 -9.3580e+01 1.5947e+02 4.8274e+01 1.3447e+02 -1.5060e+02 -7.6796e+01 1.3185e+02 -5.3275e+01 2.5539e+01 -6.5738e+01 -6.6355e+01 4.8942e+01 -1.3249e+01 6.7675e+01 -1.2348e+01 -4.3005e+01 2.1516e+02 -2.3099e+01 -8.0767e+00 2.2402e+01 -5.9237e+01 4.4889e+00 -1.2909e+02 4.5721e+01 -9.9285e+01 5.9332e+01 -5.7431e+01 -3.6852e+01 -1.7430e+02
3c18FH_A.pdb A 5 285 1.2576e+02 6.3883e+00 1.3145e+01 8.2794e+01 -5.0494e+01 5.9305e+01 1.4713e+01 6.8420e+00 6.6720e+01 5.1087e+00 -1.7846e+01 7.4458e+00 -1.9514e+00 7.8637e+00 -2.9961e+00 -7.0192e+00 9.0216e-02 -7.2202e+00 1.4839e+01 -4.0826e+00 1.3694e+01 -2.8499e+00 4.2015e+00 -6.8598e-01 5.8514e+00 -7.3843e+00 5.2737e-02 -4.9425e-03 2.9360e+00 4.7973e+00 6.2879e+00
...

输出有6000多行（每个pdb文件一行），我正在尝试将其转换为格式为[6000,35]的数组，以便每行包含新文件的数据（在这个示例中，将是三个文件"3cp0FH_A.pdb"，"1xhdFH_A.pdb"和"3c18FH_A.pdb"），每列将是文件的一个数据点（除了前4列）。数组的第一行将如下所示：

[3cp0FH_A.pdb, A, 1, 62, 7.5635e+01, 8.9632e+01, 1.9255e+00, 1.9154e+02, 5.2270e+01, 1.7820e+02, -9.6401e+01, -3.8095e+01, 1.5210e+02, etc.]

我已经找出如何将"output.file"作为列表获取，其中每个条目都是输出文件的一行。我甚至能够使用逗号分隔值。因此，如果我输入：

>>> list[0]

我将得到：

'3cp0FH_A.pdb,A,1,62,7.5635e+01,8.9632e+01,1.9255e+00,1.9154e+02,5.2270e+01,1.7820e+02,-9.6401e+01,-3.8095e+01,1.5210e+02,-5.4532e+01,2.6628e+01,-1.0989e+01,-8.1933e+01,-6.6642e-01,1.8158e+01,2.2515e+01,-5.9261e+00,6.8567e+00,7.2896e+00,1.2575e+01,-1.1400e+01,1.7467e+01,4.1609e+00,-6.0523e+00,-1.8691e+01,3.5305e+01,4.0516e+00,2.9715e+00,1.0701e+01,-1.3146e+01,-1.1101e+00\n'

我无法弄清如何将此列表转换为数组，以便由逗号分隔的每个字符串/值都在自己的列中。

英文:

This might be a trivial question, but I can't seem to find a good solution.

I have the output of a program in the format "output.file". It looks like this:

3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+02 5.2270e+01 1.7820e+02 -9.6401e+01 -3.8095e+01 1.5210e+02 -5.4532e+01 2.6628e+01 -1.0989e+01 -8.1933e+01 -6.6642e-01 1.8158e+01 2.2515e+01 -5.9261e+00 6.8567e+00 7.2896e+00 1.2575e+01 -1.1400e+01 1.7467e+01 4.1609e+00 -6.0523e+00 -1.8691e+01 3.5305e+01 4.0516e+00 2.9715e+00 1.0701e+01 -1.3146e+01 -1.1101e+00
1xhdFH_A.pdb A 3 169 1.0565e+02 -9.1260e+01 -9.3580e+01 1.5947e+02 4.8274e+01 1.3447e+02 -1.5060e+02 -7.6796e+01 1.3185e+02 -5.3275e+01 2.5539e+01 -6.5738e+01 -6.6355e+01 4.8942e+01 -1.3249e+01 6.7675e+01 -1.2348e+01 -4.3005e+01 2.1516e+02 -2.3099e+01 -8.0767e+00 2.2402e+01 -5.9237e+01 4.4889e+00 -1.2909e+02 4.5721e+01 -9.9285e+01 5.9332e+01 -5.7431e+01 -3.6852e+01 -1.7430e+02
3c18FH_A.pdb A 5 285 1.2576e+02 6.3883e+00 1.3145e+01 8.2794e+01 -5.0494e+01 5.9305e+01 1.4713e+01 6.8420e+00 6.6720e+01 5.1087e+00 -1.7846e+01 7.4458e+00 -1.9514e+00 7.8637e+00 -2.9961e+00 -7.0192e+00 9.0216e-02 -7.2202e+00 1.4839e+01 -4.0826e+00 1.3694e+01 -2.8499e+00 4.2015e+00 -6.8598e-01 5.8514e+00 -7.3843e+00 5.2737e-02 -4.9425e-03 2.9360e+00 4.7973e+00 6.2879e+00
.
.
.

The output has over 6000 rows (one row for each pdb file) and I am trying to convert this into an array in the format [6000,35], so that every row contains the data of a new file (here in the example those would be the three files "3cp0FH_A.pdb, "1xhdFH_A.pdb" and "3c18FH_A.pdb") and every column would be one data point of the file (except the first 4 columns). The first row of the array would, taking the example above, look like this:

[3cp0FH_A.pdb, A, 1, 62, 7.5635e+01, 8.9632e+01, 1.9255e+00, 1.9154e+02, 5.2270e+01, 1.7820e+02, -9.6401e+01, -3.8095e+01, 1.5210e+02, etc.]

I already figured out how to get the output.file as a list where every entry is one row of the output.file. I was even able to separate the values by commas. So if i'd type in:

&gt;&gt;&gt; list[0]

I'd get:

&#39;3cp0FH_A.pdb,A,1,62,7.5635e+01,8.9632e+01,1.9255e+00,1.9154e+02,5.2270e+01,1.7820e+02,-9.6401e+01,-3.8095e+01,1.5210e+02,-5.4532e+01,2.6628e+01,-1.0989e+01,-8.1933e+01,-6.6642e-01,1.8158e+01,2.2515e+01,-5.9261e+00,6.8567e+00,7.2896e+00,1.2575e+01,-1.1400e+01,1.7467e+01,4.1609e+00,-6.0523e+00,-1.8691e+01,3.5305e+01,4.0516e+00,2.9715e+00,1.0701e+01,-1.3146e+01,-1.1101e+00\n&#39;

What I can't figure out is how to convert this list into an array so that each string/value that is separated by a comma is in it's own column.

答案1

得分: 1

现在你的列表索引是字符串，而你实际上希望它们是包含所有数据点的列表。要做到这一点，你可以执行以下操作：

for i in range(len(input_list)):
    new_row = input_list[i].split(',')
    # 可选地，将从第4列开始的数字转换为浮点数
    new_row[4:] = [float(v) for v in new_row[4:]]
    input_list[i] = new_row

这将在原地修改你的列表，以替换之前的内容。这也是一个纯粹的Python解决方案，不涉及NumPy（尽管这应该为你提供了一些如何使用NumPy解决方案的思路，如果需要的话）。

英文:

So right now your list indices are strings and what you actually want is for them to be lists containing all your data points. To do that you can do the following:

for i in range(len(input_list)):
    new_row = input_list[i].split(&#39;,&#39;)
    # Optionally, translate the numbers from column 4 on to floats
    new_row[4:] = [float(v) for v in new_row[4:]]
    input_list[i] = new_row

This would modify your list in place so that it replaces whatever was in it before. This is also a pure python solution, not involving numpy (though this should give you some ideas on how to get to a numpy solution if desired).

答案2

得分: 1

In [26]: txt = """3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+0
...
...: """

simplest load:

In [27]: np.genfromtxt(txt.splitlines())
Out[27]:
array([[ nan, nan, 1.0000e+00, 6.2000e+01, 7.5635e+01,
8.9632e+01, 1.9255e+00, 1.9154e+02, 5.2270e+01, 1.7820e+02,
-9.6401e+01, -3.8095e+01, 1.5210e+02, -5.4532e+01, 2.6628e+01,
-1.0989e+01, -8.1933e+01, -6.6642e-01, 1.8158e+01, 2.2515e+01,
-5.9261e+00, 6.8567e+00, 7.2896e+00, 1.2575e+01, -1.1400e+01,
1.7467e+01, 4.1609e+00, -6.0523e+00, -1.8691e+01, 3.5305e+01,
4.0516e+00, 2.9715e+00, 1.0701e+01, -1.3146e+01, -1.1101e+00],
...])
In [28]: _.shape
Out[28]: (3, 35)

The default load format is float, so the initial 2 columns are rendered as nan. loadtxt would throw an error for those entries.

You could separate out the integer column with:

In [32]: Out[27][:,2]
Out[32]: array([1., 3., 5.])

and the float data columns with:

In [33]: Out[27][:,2:].shape
Out[33]: (3, 33)

With usecols you could load the label columns separately:

In [35]: np.genfromtxt(txt.splitlines(), dtype=None, usecols=[0,1,2], encoding=None)
Out[35]:
array([('3cp0FH_A.pdb', 'A', 1), ('1xhdFH_A.pdb', 'A', 3),
('3c18FH_A.pdb', 'A', 5)],
dtype=[('f0', '<U12'), ('f1', '<U1'), ('f2', '<i8')])

英文:

Copy-n-paste your sample:

In [26]: txt = &quot;&quot;&quot;3cp0FH_A.pdb A 1 62 7.5635e+01 8.9632e+01 1.9255e+00 1.9154e+0
 ...
    ...: &quot;&quot;&quot;

simplest load:

In [27]: np.genfromtxt(txt.splitlines())                                        
Out[27]: 
array([[        nan,         nan,  1.0000e+00,  6.2000e+01,  7.5635e+01,
         8.9632e+01,  1.9255e+00,  1.9154e+02,  5.2270e+01,  1.7820e+02,
        -9.6401e+01, -3.8095e+01,  1.5210e+02, -5.4532e+01,  2.6628e+01,
        -1.0989e+01, -8.1933e+01, -6.6642e-01,  1.8158e+01,  2.2515e+01,
        -5.9261e+00,  6.8567e+00,  7.2896e+00,  1.2575e+01, -1.1400e+01,
         1.7467e+01,  4.1609e+00, -6.0523e+00, -1.8691e+01,  3.5305e+01,
         4.0516e+00,  2.9715e+00,  1.0701e+01, -1.3146e+01, -1.1101e+00],
...])
In [28]: _.shape                                                                
Out[28]: (3, 35)

The default load format is float, so the intial 2 columns are rendered as nan. loadtxt would throw an error for those entries.

You could separate out the integer column with:

In [32]: Out[27][:,2]                                                           
Out[32]: array([1., 3., 5.])

and the float data columns with:

In [33]: Out[27][:,2:].shape                                                    
Out[33]: (3, 33)

With usecols you could load the label columns separately:

In [35]: np.genfromtxt(txt.splitlines(), dtype=None, usecols=[0,1,2], encoding=None)                                                                   
Out[35]: 
array([(&#39;3cp0FH_A.pdb&#39;, &#39;A&#39;, 1), (&#39;1xhdFH_A.pdb&#39;, &#39;A&#39;, 3),
       (&#39;3c18FH_A.pdb&#39;, &#39;A&#39;, 5)],
      dtype=[(&#39;f0&#39;, &#39;&lt;U12&#39;), (&#39;f1&#39;, &#39;&lt;U1&#39;), (&#39;f2&#39;, &#39;&lt;i8&#39;)])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将输出文件转换为数组

问题

答案1

答案2

无法使用 `streamlit` 对包含多个标签的数据集进行标注。

使用密钥对消息进行异或操作：TypeError：’int’对象不可调用

Pytest在并行运行测试时，使用多进程锁未按预期工作。

Regex to parse badly formatted polynomials

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。