2023年5月25日 09:03:04go评论82阅读模式

英文:

Find nearest range higher and lower inside another range

问题

我理解了你的问题，你想要提高你的awk脚本的速度。然而，由于这个问题涉及到具体的代码优化和性能调整，这可能需要更深入地了解你的数据和环境。你可以考虑以下几个方向来提高脚本的性能：

减少文件操作次数：你的脚本中涉及了多次文件读写操作，特别是在awk '{print > "split/"$4".txt"}' non_coding.txt这一步。你可以考虑减少这些操作，尽可能在内存中处理数据。
减少循环次数：如果可能的话，尽量减少循环的次数。这可能需要对脚本的逻辑进行一些调整。
使用更高效的数据结构：在处理大量数据时，选择合适的数据结构会对性能产生重要影响。如果可能的话，尝试使用更高效的数据结构。
并行处理：如果你的机器有多个处理器核心，可以考虑使用并行处理技术，将任务分配给多个核心来加快处理速度。
避免不必要的计算：在脚本中避免重复计算相同的值，尽可能缓存结果。
使用更快的算法：在某些情况下，使用不同的算法可能会显著提高性能。

这些是一些可能提高脚本性能的一般性建议，但具体实施还需要根据你的数据和环境来调整。如果你在实施这些优化时遇到了具体的问题，也可以随时向我求助。

英文:

I have rewritten this question many times, as while writing it out, I have come up with some ideas and have managed to get a very slow, but working solution. It seems writing out what one needs to do tends to help with the problem. However I am still having some issues with speed and would very much like some insight/input/comments/help. Anyway, here is what I am trying to do:

I have two files of genome coordinate data. File1 contains non-coding coordinates and file2 contains coding coordinates. Both contain five fields. The first field contains a uniqID. The second and third fields contain the start and stop coordinates respectively. The fourth field contains the chromosome ID. There can sometimes be many lines with the same chromosome ID (as there are multiple non-coding and coding ranges per chromosome). The fifth field contains the length/span of the coordinates.

For each set of coordinates in file2 I wish to find the closest (below and above, but currently just working on the "below" first and will reuse code to deal with the "above") set of coordinates in file1 matching the same ID that also has a length/span large enough to hold the coordinates from file2.

I then want to offset the coordinates by a random number that can have a min value of 1 through to a max value of the difference in size of the two coordinate span lengths.

To explain some more, here are some lines for each file.

File1 - non-coding

chr_17_0_3614        0      3614   chr_17 3614
chr_17_16177_17009   16177  17009  chr_17 832
chr_20_0_103982      0      103982 chr_20 103982
chr_20_105978_107383 105978 107383 chr_20 1405
chr_20_107502_108744 107502 108744 chr_20 1242
chr_20_108897_110439 108897 110439 chr_20 1542

File2 - coding

chr_17_3614_4570     3614   4570   chr_17 956
chr_17_7901_8585     7901   8585   chr_17 684
chr_17_10351_10698   10351  10698  chr_17 347
chr_20_104882_104971 104882 104971 chr_20 89
chr_20_105763_105978 105763 105978 chr_20 215
chr_20_107383_107502 107383 107502 chr_20 119

What I think I need to do is:

Split non-coding file into multiple files based on chromosome ID.
Run a loop that takes in each line of coding.txt and assigns variables to each field.
Check if coding 'start' value is higher or equal to non-coding 'end' and the 'span' of coding coordinates is less than the 'span' of the non-coding.
If true, then set variables for holding needed values for printing.
Once the coding 'start' value is less than the non-coding 'end' value we can stop looking and print out the necessary values (after some modification for randomness) and then move onto the next line.

I have written some awk that looks like it works, but it is in need of some speed increases. Here is some output from the above I get with this script:

OffsetBelow_chr_17_3614_4570     1621   2577   chr_17
OffsetBelow_chr_17_7901_8585     293    977    chr_17
OffsetBelow_chr_17_10351_10698   3010   3357   chr_17
OffsetBelow_chr_20_104882_104971 77516  77605  chr_20
OffsetBelow_chr_20_105763_105978 11841  12056  chr_20
OffsetBelow_chr_20_107383_107502 106549 106668 chr_20

And here is the script I wrote:

#!/usr/bin/bash
# split non-coding into individual chromosome files
awk &#39;{print &gt; &quot;split/&quot;$4&quot;.txt&quot;}&#39; non_coding.txt
# find nearest coords
while read -r uniqId start stop chrID span; do
 awk -v rseed=&quot;$RANDOM&quot; -v qStart=&quot;$start&quot; -v qStop=&quot;$stop&quot; -v uId=&quot;$uniqId&quot; -v span=&quot;$span&quot; &#39;BEGIN {
  belowEnd = 0; belowStart = 0; availSpan = 0; belowUniqId = &quot;none&quot;; srand(rseed)
  }
  {if (qStart &lt; $3) {next} else if (qStart &gt;= $3 &amp;&amp; span &lt;= $5)
  {belowEnd = $3; belowStart = $3-span; availSpan = $5; belowUniqId = $1;
  min=1; max=$5-span; offset=int(min+rand()*(max-min+1))}} END {
  print &quot;OffsetBelow_&quot;uId, belowStart-offset, belowEnd-offset, $4}&#39; split/&quot;$chrID&quot;.txt
done &lt; coding.txt

So, it works, but is super slow as I'm running over the same file many, many times (coding.txt is about 100,000 lines long), and I'm just not sure where to go from here to make it work better. Also, I need to now make this work for the nearest span 'above' (which I'm pretty sure should be mostly a jostling of variables and operators), and maybe somehow that can work in with making it quicker... Anyway, some comments/thoughts/ideas to make this process run quicker would be much appreciated. Thanks for your time and brain power!

答案1

得分: 1

所以，它能够运行，但速度非常慢，因为我多次运行相同的文件（coding.txt大约有100,000行），我只是不确定如何才能让它更好地运行。

通过将非编码区域分割成每个染色体文件，您可以节省一些时间，但这里的根本问题是您为每个编码区域执行单独的遍历。更清晰和更快的方法是通过一次联合遍历原始文件来提取所有编码区域所需的信息。

有几种安排细节的方法，但我可能会选择以下方式之一：

向每个文件的每一行添加一个字段，以指示它是编码区域还是非编码区域，然后
将这两个文件一起排序为一个文件，按（染色体ID，起始坐标）排序
当您逐行读取该文件时，跟踪上次看到的非编码区域
当您看到编码区域时，您知道要么
- 没有前置的非编码区域（如果尚未跟踪非编码区域，或者如果跟踪的区域属于不同的染色体），或者
- 直接在此编码区域之前的非编码区域是您当前正在跟踪的区域

稍微注意一下，您还可以在同一次遍历中从中提取每个编码区域的最接近的后续非编码区域，但您也可以通过反转排序顺序并执行与之前完全相同的操作来获得后续区域。

部分示例：

#!/bin/bash
sort -k4,4 -k2,2n \
  <(sed '/^\s*$/d; s/$/ N/' non_coding.txt) \
  <(sed '/^\s*$/d; s/$/ C/' coding.txt) |
  while read -r uid start stop chr length kind; do
    case $kind in
      C) # 处理编码区域
        ;;
      N) # 处理非编码区域
        ;;
      *) # 报告错误
        ;;
    esac
  done

sort -k4,4 -k2,2n 表示逐行排序（命令的标准输入），使用跨越字段4到4的主键和跨越字段2到2的次要数值键。结果写入命令的标准输出。

<(sed '/^\s*$/d; s/$/ N/' non_coding.txt) 是bash的扩展，将 sed '/^\s*$/d; s/$/ N/' non_coding.txt 的输出文件的名称扩展为可以读取的文件名称。sed 命令从 non_coding.txt 读取输入，删除空白行 (/^\s*$/d)，并在所有其他行末尾附加一个空格和 'N'，然后将其写入stdout。

<(sed '/^\s*$/d; s/$/ C/' coding.txt) 与前面的类似。

read -r uid start stop chr length kind 读取一行，将其分成shell单词，并按顺序分配给指定的shell变量。特别是，这将区域类型指示符放入 $kind，在循环内部使用它来选择适当的行为。

...至此，我将其余的细节留给您。

英文:

> So, it works, but is super slow as I'm running over the same file many, many times (coding.txt is about 100,000 lines long), and I'm just not sure where to go from here to make it work better.

You do save yourself some time by splitting the non-coding regions into per-chromosome files, but the fundamental problem here is that you are performing a separate pass for each coding region. It would be cleaner and much faster to make a single, joint pass through both of the original files to extract the wanted information for all the coding regions in one go.

There are several ways to arrange the details, but I'd probably go with something along these lines:

Add a field to each line of each of the files to indicate whether it is a coding or non-coding region, then
sort the two files together into a single one, ordered by (chromosome id, start coordinate)
As you read through that file line by line, track the last-seen non-coding region
When you see a coding region, you know that either
- there is no preceding non-coding region (if there is not yet a non-coding region tracked, or if the tracked one belongs to a different chromosome), OR
- the non-coding region immediately preceding this coding region is the one you currently have tracked

With a bit more attention you could also extract each coding region's nearest subsequent non-coding region from the same pass, but you could also just get the subsequent regions by reversing the sort order and doing exactly the same thing as before.

Partial example:

#!/bin/bash
sort -k4,4 -k2,2n \
  &lt;(sed &#39;/^\s*$/d; s/$/ N/&#39; non_coding.txt) \
  &lt;(sed &#39;/^\s*$/d; s/$/ C/&#39; coding.txt) |
  while read -r uid start stop chr length kind; do
    case $kind in
      C) # handle a coding region
        ;;
      N) # handle a non-coding region
        ;;
      *) # report an error
        ;;
    esac
  done

The sort -k4,4 -k2,2n expresses a line-by-line sort (of the command's standard input), using a primary key spanning fields 4 through 4, and a secondary, numeric key spanning fields 2 through 2. Results are written to the command's standard output.

The <(sed '/^\s*$/d; s/$/ N/' non_coding.txt) is a bash-ism that expands to the name of a file from which the output of sed '/^\s*$/d; s/$/ N/' non_coding.txt can be read. The sed command reads input from non_coding.txt, drops empty / blank lines (/^\s*$/d), and appends a space and an 'N' to all other lines (s/$/ N/), which are written to stdout.

The <(sed '/^\s*$/d; s/$/ C/' coding.txt) is analogous to the preceding.

The read -r uid start stop chr length kind reads one line, splits it into shell words, and assigns those, in order, to the specified shell variables. In particular, this puts the region-type indicator into $kind, which is used inside the loop to choose the appropriate behavior.

... and that's where I leave the rest of the details to you.

答案2

得分: 1

以下是翻译好的内容：

考虑到John和Ed的帮助，我认为我已经成功提出了一个解决方案。

首先，将两个文件合并成一个文件，在第六个字段中添加编码/非编码的标识符，并按字段4然后字段2排序。然后运行一个awk脚本。这非常快速。再次感谢John和Ed在方向上的帮助。

看起来它的工作方式与我想要的一样，但我需要进行更多的测试才能完全确定。无论如何，对于那些感兴趣的人以及其他偶然发现这个问题的人，这就是我想出来的东西。如果有人发现任何问题，请告诉我。

英文:

Taking into account John and Ed's help, I think I've managed to come up with a solution.

First, add the two files together into one, add an identifier of coding/non-coding as a sixth field and sort on field 4 then 2. Then run an awk script. This is super fast. So thanks again for your help in direction John and Ed.

It seems to work as I want, but I'll need to some more testing to be fully sure. Anyway, for those who are interested and anyone else who stumbles upon this, this is what I came up with. If anyone sees any issues with this, let me know.

#!/bin/bash
awk &#39;
BEGIN {OFS=&quot;\t&quot;}
{
    # Store each line in an array indexed by record id
    lines[$1] = $0;
    
    # Store record ids, start, ref, length, and type in separate arrays
    ids[NR] = $1;
    starts[$1] = $2;
    refs[$1] = $4;
    lengths[$1] = $5;
    types[$1] = $6;
}
END {
    srand();
    for(i = 1; i &lt;= NR; i++) { 
        # Only consider coding regions
        if(types[ids[i]] == &quot;C&quot;) {
            
            # Find nearest below non-coding region
            for(j = i - 1; j &gt; 0; j--) {
                if(types[ids[j]] == &quot;N&quot; &amp;&amp; refs[ids[i]] == refs[ids[j]]) {
                    OffNeg = ids[j];
                    break;
                }
            }
            # Find nearest above non-coding region
            for(j = i + 1; j &lt;= NR; j++) {
                if(types[ids[j]] == &quot;N&quot; &amp;&amp; refs[ids[i]] == refs[ids[j]]) {
                    OffPos = ids[j];
                    break;
                }
            }
            # Check that we found both above and below regions
            if(OffPos &amp;&amp; OffNeg) {
                # Determine length of coding region
                len = lengths[ids[i]];
                # Make sure both non-coding regions are long enough
                if(lengths[OffPos] &gt;= len &amp;&amp; lengths[OffNeg] &gt;= len) {
                    # Calculate random start positions within each non-coding region
                    # that allow for a segment of equal length to the coding region
                    OffPosStart = starts[OffPos] + int(rand() * (lengths[OffPos] - len + 1));
                    OffNegStart = starts[OffNeg] + int(rand() * (lengths[OffNeg] - len + 1));
                    # Output the new record ids with the calculated start and end positions
                    print &quot;OffNeg_&quot; ids[i], OffNegStart, OffNegStart+len, refs[ids[i]]
                    print &quot;OffPos_&quot; ids[i], OffPosStart, OffPosStart+len, refs[ids[i]]
                }
            }
            # Reset OffPos and OffNeg for the next coding region
            OffPos = &quot;&quot;;
            OffNeg = &quot;&quot;;
        }
    }
}&#39; file.txt

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

寻找另一个范围内的最近较高和较低范围

问题

答案1

答案2

如何在Java中运行我自己的个性化bash shell？

数组在if语句中使用时不扩展。

如何覆盖 Helm 值的YAML文件？

WebAssembly / Go（tinygo）函数执行时间非常慢

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。