2023年7月28日 01:24:22go评论96阅读模式

英文:

How to find and delete range of values from an ENORMOUS dataset in R?

问题

我已经尝试解决这个问题两天了，我都快抓狂了。
我有一个包含近1500万个数据点的数据集。我有一些数据点是需要从数据集中移除的噪点。
我知道删除我需要删除的行的语法：
DataNoArtifacts <- Data[-(5039761:5041201), ]
这段代码过去对我有用，现在仍然有效。我的问题是找到需要删除的实际数值，以便我可以得到行号或行号范围，然后放入代码中。
当我尝试过滤数据集以找到需要过滤掉的确切日期和分钟时，我可以轻松找到时间。然而，过滤数据集会为它们分配新的行号。我需要能够过滤数据并查看原始行号，但无法实现。
因此，我尝试通过手动滚动我的1500万行数据集来解决这个问题，因为这似乎是唯一的选择。问题是，如果我一次滚动超过一次点击，数据集将上下跳动几千行，这几乎让我不可能找到任何特定日期的行号，更不用说我需要找到的确切小时和分钟了。如果我最终接近了离我需要找到并删除的数据点几周的范围内，确保我能够找到我需要的数据点的唯一方法就是一次点击一次...浏览两周左右的数据，数据按分钟划分。
有些日期的数据我需要删除有一个巨大的范围（例如：2003年12月11日有几个时间段，每次都有设备错误需要删除），而有些日期的数据我需要删除只有一些在整天内随机分布的有问题的分钟（例如：2014年3月10日，恒定风速在0和3 m/s之间，在整天的1440分钟中有大约40个异常值，如20 m/s）。
长话短说：我知道如何删除我需要删除的值。但无论如何，R都不愿意合作帮助我找到这些行。

不幸的是，我需要删除的数据点并不完全是在某个特定值的上下方的天/小时/分钟，我无法将其过滤出来。而是在特定点周围的时间表明了我需要删除的噪点。

英文:

I've been trying to solve this problem for two days and I'm tearing my hair out.
I have a dataset with nearly 15 million points. I have a few days of data points that are artifacts that I need to remove from the dataset.
I know the syntax for deleting rows that I need to delete from my dataset:
DataNoArtifacts <- Data[-(5039761:5041201), ]
This code has worked for me in the past and continues to work for me. My problem is FINDING the actual values that I need to delete so that I can get the row numbers, or range of row numbers, to put in the code.
When I try to filter the dataset to find the exact date and minutes I need to filter out, I can easily find the times. However, filtering the dataset to get them assigns them new row numbers. I need to be able to filter the data and see the original row numbers but cannot.
So I tried to solve this by scrolling through my 15,000,000 row dataset to find the rows manually since that seems to be the only option. The problem is, if I scroll more than one click or so at a time, the dataset will jump up/down a few thousand rows, which makes it near impossible for me to find the row number for any specific day, MUCH less the exact hour and minute I need to find. If I finally get within the range of a couple of weeks from the data point I need to find and delete, the only way to ensure I'll be able to find the datapoint I need is to click one click at a time... through 2 weeks or so of data that is broken up by the minute.
I have some days with data I need to delete with a huge range (example: 12/11/2003 has a few ranges of many hours at a time with equipment error that I need to delete), and some of the days with data I need to delete have only a few randomly interspersed minutes within the entire day that are problematic (example: 3/10/2014 with constant wind speeds between 0 and 3 m/s, with about 40 random blips in the data of the 1,440 minutes throughout the day that are anomalies like 20 m/s).
Long story short: I know how to delete the values I need to delete. But for the life of me R will not cooperate to help me find the rows.

Unfortunately, the data points I need to delete are not exclusively days/hours/minutes above or below a certain value I can filter out. It's the times AROUND specific points that indicated artifacts that I need to delete.

答案1

得分: 1

在基础R中，当进行子集操作时，行名称会被保留：

dd &lt;- data.frame(x = LETTERS[1:3])
print(dd[dd$x != &quot;B&quot;, , drop  = FALSE])

  x
1 A
3 C

你可能正在使用tidyverse，正如你所指出的，它的工作方式不同。

library(tidyverse)
dd &lt;- as_tibble(dd)
filter(dd, x != &quot;B&quot;)

# A tibble: 2 &#215; 1
  x    
  &lt;chr&gt;
1 A    
2 C

如果你的内存足够处理（1500万个整数索引只需要60 Mb），这个问题的简单解决方案是添加自己的row列：

dd |&gt; mutate(row = seq(n()), .before = 1) |&gt; filter(x != &quot;B&quot;)

# A tibble: 2 &#215; 2
    row x    
  &lt;int&gt; &lt;chr&gt;
1     1 A    
2     3 C

英文:

In base R, row names are preserved when subsetting:

dd &lt;- data.frame(x = LETTERS[1:3])
print(dd[dd$x != &quot;B&quot;, , drop  = FALSE])

  x
1 A
3 C

You are presumably working with tidyverse, which as you have pointed out works differently.

library(tidyverse)
dd &lt;- as_tibble(dd)
filter(dd, x != &quot;B&quot;)

# A tibble: 2 &#215; 1
  x    
  &lt;chr&gt;
1 A    
2 C

The easy solution to this, if you have enough memory to handle it (15 million integer indices is only 60 Mb), is to add your own row column:

dd |&gt; mutate(row = seq(n()), .before = 1) |&gt; filter(x != &quot;B&quot;)

# A tibble: 2 &#215; 2
    row x    
  &lt;int&gt; &lt;chr&gt;
1     1 A    
2     3 C

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在R中找到并删除大数据集中的一系列数值？

问题

答案1

在R中分组两列并计数

使用group_by从所有组中减去一组值。

strptime() 在不同系统上处理夏令时(DST)的方式不同。

创建一个多列的饼图

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。