如何在R中找到并删除大数据集中的一系列数值?

huangapple go评论59阅读模式
英文:

How to find and delete range of values from an ENORMOUS dataset in R?

问题

我已经尝试解决这个问题两天了,我都快抓狂了。
我有一个包含近1500万个数据点的数据集。我有一些数据点是需要从数据集中移除的噪点。
我知道删除我需要删除的行的语法:
DataNoArtifacts <- Data[-(5039761:5041201), ]
这段代码过去对我有用,现在仍然有效。我的问题是找到需要删除的实际数值,以便我可以得到行号或行号范围,然后放入代码中。
当我尝试过滤数据集以找到需要过滤掉的确切日期和分钟时,我可以轻松找到时间。然而,过滤数据集会为它们分配新的行号。我需要能够过滤数据并查看原始行号,但无法实现。
因此,我尝试通过手动滚动我的1500万行数据集来解决这个问题,因为这似乎是唯一的选择。问题是,如果我一次滚动超过一次点击,数据集将上下跳动几千行,这几乎让我不可能找到任何特定日期的行号,更不用说我需要找到的确切小时和分钟了。如果我最终接近了离我需要找到并删除的数据点几周的范围内,确保我能够找到我需要的数据点的唯一方法就是一次点击一次...浏览两周左右的数据,数据按分钟划分。
有些日期的数据我需要删除有一个巨大的范围(例如:2003年12月11日有几个时间段,每次都有设备错误需要删除),而有些日期的数据我需要删除只有一些在整天内随机分布的有问题的分钟(例如:2014年3月10日,恒定风速在0和3 m/s之间,在整天的1440分钟中有大约40个异常值,如20 m/s)。
长话短说:我知道如何删除我需要删除的值。但无论如何,R都不愿意合作帮助我找到这些行。

不幸的是,我需要删除的数据点并不完全是在某个特定值的上下方的天/小时/分钟,我无法将其过滤出来。而是在特定点周围的时间表明了我需要删除的噪点。

英文:

I've been trying to solve this problem for two days and I'm tearing my hair out.
I have a dataset with nearly 15 million points. I have a few days of data points that are artifacts that I need to remove from the dataset.
I know the syntax for deleting rows that I need to delete from my dataset:
DataNoArtifacts &lt;- Data[-(5039761:5041201), ]
This code has worked for me in the past and continues to work for me. My problem is FINDING the actual values that I need to delete so that I can get the row numbers, or range of row numbers, to put in the code.
When I try to filter the dataset to find the exact date and minutes I need to filter out, I can easily find the times. However, filtering the dataset to get them assigns them new row numbers. I need to be able to filter the data and see the original row numbers but cannot.
So I tried to solve this by scrolling through my 15,000,000 row dataset to find the rows manually since that seems to be the only option. The problem is, if I scroll more than one click or so at a time, the dataset will jump up/down a few thousand rows, which makes it near impossible for me to find the row number for any specific day, MUCH less the exact hour and minute I need to find. If I finally get within the range of a couple of weeks from the data point I need to find and delete, the only way to ensure I'll be able to find the datapoint I need is to click one click at a time... through 2 weeks or so of data that is broken up by the minute.
I have some days with data I need to delete with a huge range (example: 12/11/2003 has a few ranges of many hours at a time with equipment error that I need to delete), and some of the days with data I need to delete have only a few randomly interspersed minutes within the entire day that are problematic (example: 3/10/2014 with constant wind speeds between 0 and 3 m/s, with about 40 random blips in the data of the 1,440 minutes throughout the day that are anomalies like 20 m/s).
Long story short: I know how to delete the values I need to delete. But for the life of me R will not cooperate to help me find the rows.

Unfortunately, the data points I need to delete are not exclusively days/hours/minutes above or below a certain value I can filter out. It's the times AROUND specific points that indicated artifacts that I need to delete.

答案1

得分: 1

在基础R中,当进行子集操作时,行名称被保留:

dd &lt;- data.frame(x = LETTERS[1:3])
print(dd[dd$x != &quot;B&quot;, , drop  = FALSE])
  x
1 A
3 C

你可能正在使用tidyverse,正如你所指出的,它的工作方式不同。

library(tidyverse)
dd &lt;- as_tibble(dd)
filter(dd, x != &quot;B&quot;)
# A tibble: 2 &#215; 1
  x    
  &lt;chr&gt;
1 A    
2 C    

如果你的内存足够处理(1500万个整数索引只需要60 Mb),这个问题的简单解决方案是添加自己的row列:

dd |&gt; mutate(row = seq(n()), .before = 1) |&gt; filter(x != &quot;B&quot;)
# A tibble: 2 &#215; 2
    row x    
  &lt;int&gt; &lt;chr&gt;
1     1 A    
2     3 C    
英文:

In base R, row names are preserved when subsetting:

dd &lt;- data.frame(x = LETTERS[1:3])
print(dd[dd$x != &quot;B&quot;, , drop  = FALSE])
  x
1 A
3 C

You are presumably working with tidyverse, which as you have pointed out works differently.

library(tidyverse)
dd &lt;- as_tibble(dd)
filter(dd, x != &quot;B&quot;)
# A tibble: 2 &#215; 1
  x    
  &lt;chr&gt;
1 A    
2 C    

The easy solution to this, if you have enough memory to handle it (15 million integer indices is only 60 Mb), is to add your own row column:

dd |&gt; mutate(row = seq(n()), .before = 1) |&gt; filter(x != &quot;B&quot;)
# A tibble: 2 &#215; 2
    row x    
  &lt;int&gt; &lt;chr&gt;
1     1 A    
2     3 C    

huangapple
  • 本文由 发表于 2023年7月28日 01:24:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/76782144.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定