Improving the Speed of Pairwise Calculations

huangapple go评论54阅读模式
英文:

R: Improving the Speed of Pairwise Calculations

问题

你的并行尝试基本上是正确的,但有一些细微的问题,你需要考虑一下。下面是修正后的并行代码:

library(parallel)

# Set the number of CPU cores to use in parallel
num_cores <- 4  # You can adjust this number based on your system's capacity

# Initialize a cluster for parallel processing
cl <- makeCluster(num_cores)

# Export necessary functions and data to the cluster
clusterExport(cl, c("df_1", "df_2", "haversine_distance"))

# Create a matrix to store results
distances <- matrix(nrow = nrow(df_1), ncol = nrow(df_2))

# calculate the distances in parallel
distances <- parLapply(cl, 1:nrow(df_1), function(i) {
  sapply(1:nrow(df_2), function(j) {
    haversine_distance(df_1$lon[i], df_1$lat[i], df_2$lon[j], df_2$lat[j])
  })
})

# Stop the cluster
stopCluster(cl)

# Convert the result to a matrix
distances <- do.call(cbind, distances)
colnames(distances) <- df_2$name_2
rownames(distances) <- df_1$name_1

这段代码会将计算分布到多个核心上,提高了计算速度。你可以根据你的计算机性能来调整 num_cores 的值以获得最佳性能。

请确保已经安装了 parallel 包以便使用这些函数。此外,要注意并行计算可能会增加内存使用量,因此在处理大型数据框时,请确保你的系统具有足够的内存来处理。

英文:

I am working with the R programming language.

Suppose I have the following two data frames:

set.seed(123)

df_1 &lt;- data.frame(
  name_1 = c(&quot;john&quot;, &quot;david&quot;, &quot;alex&quot;, &quot;kevin&quot;, &quot;trevor&quot;, &quot;xavier&quot;, &quot;tom&quot;, &quot;michael&quot;, &quot;troy&quot;, &quot;kelly&quot;, &quot;chris&quot;, &quot;henry&quot;, &quot;taylor&quot;, &quot;ryan&quot;, &quot;peter&quot;),
  lon = rnorm(15, mean = -74.0060, sd = 0.01),
  lat = rnorm(15, mean = 40.7128, sd = 0.01)
)

df_2 &lt;- data.frame(
  name_2 = c(&quot;matthew&quot;, &quot;tyler&quot;, &quot;sebastian&quot;, &quot;julie&quot;, &quot;anna&quot;, &quot;tim&quot;, &quot;david&quot;, &quot;nigel&quot;, &quot;sarah&quot;, &quot;steph&quot;, &quot;sylvia&quot;, &quot;boris&quot;, &quot;theo&quot;, &quot;malcolm&quot;),
  lon = rnorm(14, mean = -74.0060, sd = 0.01),
  lat = rnorm(14, mean = 40.7128, sd = 0.01)
)

My Problem: For each person in df_1, I am trying to find out the 5 closest people (haversine distance) to this person in df_1 and record various distance statistics (e.g. mean, median, max, min standard deviation).

Here is my own attempt at solving this problem.

First I defined a function that calculates the distance between each pair of points:

## PART 1
 library(geosphere)

haversine_distance &lt;- function(lon1, lat1, lon2, lat2) {
  distHaversine(c(lon1, lat1), c(lon2, lat2))
}

Then, I used a loop to calculate all comparasions:

## PART 2
# Create a matrix to store results
distances &lt;- matrix(nrow = nrow(df_1), ncol = nrow(df_2))

# calculate the distances
for (i in 1:nrow(df_1)) {
    for (j in 1:nrow(df_2)) {
        distances[i, j] &lt;- haversine_distance(df_1$lon[i], df_1$lat[i], df_2$lon[j], df_2$lat[j])
    }
}

# Create final
final &lt;- data.frame(
    name_1 = rep(df_1$name_1, each = nrow(df_2)),
    lon_1 = rep(df_1$lon, each = nrow(df_2)),
    lat_1 = rep(df_1$lat, each = nrow(df_2)),
    name_2 = rep(df_2$name_2, nrow(df_1)),
    lon_2 = rep(df_2$lon, nrow(df_1)),
    lat_2 = rep(df_2$lat, nrow(df_1)),
    distance = c(distances)
)

Finally, I kept the 5 minimum distances per person:

## PART 3
# Keep only first 5 rows for each unique value of final$name_1
final &lt;- final[order(final$name_1, final$distance), ]
final &lt;- final[ave(final$distance, final$name_1, FUN = seq_along) &lt;= 5, ]


# Calculate summary statistics for each unique person in final$name_1
final_summary &lt;- aggregate(distance ~ name_1,
                           data = final,
                           FUN = function(x) c(min = min(x),
                                               max = max(x),
                                               mean = mean(x),
                                               median = median(x),
                                               sd = sd(x)))
final_summary &lt;- do.call(data.frame, final_summary)
names(final_summary)[-(1)] &lt;- c(&quot;min_distance&quot;, &quot;max_distance&quot;, &quot;mean_distance&quot;, &quot;median_distance&quot;, &quot;sd_distance&quot;)


final_summary$closest_people &lt;- tapply(final$name_2,
                                       final$name_1,
                                       FUN = function(x) paste(sort(x), collapse = &quot;, &quot;))


# break closest_people column into multiple columns
n &lt;- 5
closest_people_split &lt;- strsplit(final_summary$closest_people, &quot;, &quot;)
final_summary[paste0(&quot;closest_&quot;, seq_len(n))] &lt;- do.call(rbind, closest_people_split)

My Question: The above code seems to work, but I am interesting in trying to improve the speed of this code (i.e. PART 2) when df_1 and df_2 become very large in size. As such, I am looking into options involving parallel computing using functionalities such as doParallel, parLapply, SNOW, etc.

As I am not overly familiar with this, I tried to look into an option with the doParallel (https://www.rdocumentation.org/packages/parallel/versions/3.4.1/topics/mclapply) library:

library(parallel)

distances &lt;- matrix(nrow = nrow(df_1), ncol = nrow(df_2))

# calculate the distances
distances &lt;- mclapply(1:nrow(df_1), function(i) {
  sapply(1:nrow(df_2), function(j) {
    haversine_distance(df_1$lon[i], df_1$lat[i], df_2$lon[j], df_2$lat[j])
  })
})

The code seems to have run - but I am not sure if what I have done is correct and if this is actually improving the speed of this code.

Can someone please show me how to do this? Is my parallel attempt correct?

Thanks!

答案1

得分: 4

虽然仍然是二次的,但以下部分可以重写为矢量化形式:

haversine_distance <- function(lon1, lat1, lon2, lat2) {
  distHaversine(cbind(lon1, lat1), cbind(lon2, lat2))
}

# 计算距离
distances <- expand.grid(i = seq_len(nrow(df_1)), j = seq_len(nrow(df_2)))
# 不要与自身比较
distances <- distances[distances$i != distances$j, ]
distances$dist <- haversine_distance(
  df_1$lon[distances$i], df_1$lat[distances$i],
  df_2$lon[distances$j], df_2$lat[distances$j]
)

这将更快,因为代码现在是矢量化的(在我的机器上,以下示例显示了100倍的加速):

# ...(之前的代码)

df_1 <- df_1[sample(nrow(df_1), 5E2, replace=TRUE), ]
df_2 <- df_2[sample(nrow(df_2), 5E2, replace=TRUE), ]

system.time({
  # PART 2
  # 创建一个矩阵来存储结果
  distances <- matrix(nrow = nrow(df_1), ncol = nrow(df_2))
  # 计算距离
  for (i in 1:nrow(df_1)) {
    for (j in 1:nrow(df_2)) {
      distances[i, j] <- haversine_distance(df_1$lon[i], df_1$lat[i], df_2$lon[j], df_2$lat[j])
    }
  }
})

# ...(之后的代码)

distances的形式还使得更容易进行下一步(选择前5个;使用来自OP的原始解决方案):

distances <- distances[order(distances$i, distances$dist), ]
distances <- distances[ave(distances$dist, distances$i, FUN = seq_along) <= 5, ]

并行版本

在我的系统上,对于上述示例数据集(每个500条记录)这比在单个CPU上运行慢,因为启动计算和将数据复制到节点需要时间。

# ...(之前的代码)

# 计算距离
distances <- expand.grid(i = seq_len(nrow(df_1)), j = seq_len(nrow(df_2)))
# 不要与自身比较
distances <- distances[distances$i != distances$j, ]
# 启动集群
cl <- parallel::makeCluster(4)
# 在集群节点上加载库
parallel::clusterEvalQ(cl, library(geosphere))
# 分成组
distances$node <- floor(seq(0, 4-0.0001, length.out = nrow(distances)))
distances <- split(distances, distances$node)
# 在节点上运行计算:
distances <- parallel::parLapply(cl, distances, function(distances, df_1, df_2, haversine_distance) {
  distances$dist <- haversine_distance(
    df_1$lon[distances$i], df_1$lat[distances$i],
    df_2$lon[distances$j], df_2$lat[distances$j]
  )
  distances
}, df_1, df_2, haversine_distance = haversine_distance)
# 合并结果
distances <- do.call(rbind, distances)
# 关闭集群
parallel::stopCluster(cl)
英文:

Although still quadratic the following:

haversine_distance &lt;- function(lon1, lat1, lon2, lat2) {
  distHaversine(c(lon1, lat1), c(lon2, lat2))
}

# calculate the distances
for (i in 1:nrow(df_1)) {
    for (j in 1:nrow(df_2)) {
        distances[i, j] &lt;- haversine_distance(df_1$lon[i], df_1$lat[i], df_2$lon[j], df_2$lat[j])
    }
}

can be rewritten in vectorised form like so:

haversine_distance &lt;- function(lon1, lat1, lon2, lat2) {
  distHaversine(cbind(lon1, lat1), cbind(lon2, lat2))
}

# calculate the distances
distances &lt;- expand.grid(i = seq_len(nrow(df_1)), j = seq_len(nrow(df_2)))
# Do not compare to self
distances &lt;- distances[distances$i != distances$j, ]
distances$dist &lt;- haversine_distance(
  df_1$lon[distances$i], df_1$lat[distances$i],
  df_2$lon[distances$j], df_2$lat[distances$j]
)

this will be a lot faster because the code is now vectorised (the example below shows a factor 100 speedup on my machine):

df_1 &lt;- df_1[sample(nrow(df_1), 5E2, replace=TRUE), ]
df_2 &lt;- df_2[sample(nrow(df_2), 5E2, replace=TRUE), ]

system.time({
## PART 2
# Create a matrix to store results
distances &lt;- matrix(nrow = nrow(df_1), ncol = nrow(df_2))
# calculate the distances
for (i in 1:nrow(df_1)) {
    for (j in 1:nrow(df_2)) {
        distances[i, j] &lt;- haversine_distance(df_1$lon[i], df_1$lat[i], df_2$lon[j], df_2$lat[j])
    }
}
})

##   user  system elapsed 
## 17.427   0.015  17.479  

system.time({
# calculate the distances
distances &lt;- expand.grid(i = seq_len(nrow(df_1)), j = seq_len(nrow(df_2)))
# Do not compare to self
distances &lt;- distances[distances$i != distances$j, ]
distances$dist &lt;- haversine_distance(
  df_1$lon[distances$i], df_1$lat[distances$i],
  df_2$lon[distances$j], df_2$lat[distances$j]
)
})
##   user  system elapsed 
##  0.058   0.024   0.082 

The form of distances makes also it easier to do the next step (selecting the top 5; using the original solution from the OP):

distances &lt;- distances[order(distances$i, distances$dist), ]
distances &lt;- distances[ave(distances$dist, distances$i, FUN = seq_along) &lt;= 5, ]

Parallel version

In my system with the example dataset above (each 500 records;) this is slower than running on a single CPU because of the time to start the calculations and copy the data to the nodes.

# calculate the distances
distances &lt;- expand.grid(i = seq_len(nrow(df_1)), j = seq_len(nrow(df_2)))
# Do not compare to self
distances &lt;- distances[distances$i != distances$j, ]
# Start the cluster
cl &lt;- parallel::makeCluster(4)
# Load libraries on cluster nodes
parallel::clusterEvalQ(cl, library(geosphere))
# Split into groups
distances$node &lt;- floor(seq(0, 4-0.0001, length.out = nrow(distances)))
distances &lt;- split(distances, distances$node)
# Run the computation on the nodes:
distances &lt;- parallel::parLapply(cl, distances, function(distances, df_1, df_2, haversine_distance) {
  distances$dist &lt;- haversine_distance(
    df_1$lon[distances$i], df_1$lat[distances$i],
    df_2$lon[distances$j], df_2$lat[distances$j]
  )
  distances
}, df_1, df_2, haversine_distance = haversine_distance)
# Combine the results
distances &lt;- do.call(rbind, distances)
# Close the cluster
parallel::stopCluster(cl)

答案2

得分: 1

你可以使用 apply 函数:

cbind(df_1[1], 
    t(apply(df_1[-1], 1, \(y)
       c(min = min(x <- sort(geosphere::distHaversine(y, df_2[-1]))[1:5]),
         max = max(x),
         mean = mean(x),
         median = median(x),
         sd = sd(x)))))

翻译好的部分如下:

  • name_1 min max mean median sd
  • 1 john 423.1875 1948.9521 1106.4374 1052.8789 674.69139
  • 2 david 602.9369 941.3102 752.1558 715.3872 159.37550
  • 3 alex 1765.7750 2428.5429 2013.7843 1828.6055 294.37805
  • 4 kevin 638.9259 834.5504 715.5252 644.2898 102.23793
  • 5 trevor 520.1834 650.9167 609.4363 631.9494 52.96026
  • 6 xavier 972.9730 1767.1953 1369.5604 1396.8569 371.03190
  • 7 tom 243.6729 530.4778 426.2490 447.8639 110.26649
  • 8 michael 581.9209 1504.5642 1057.1773 1012.5247 378.81712
  • 9 troy 549.4500 1035.0599 782.8799 828.5550 220.72034
  • 10 kelly 491.6430 1130.9239 717.7716 658.7015 248.96974
  • 11 chris 1389.1659 2106.7084 1644.0448 1455.8430 316.31565
  • 12 henry 394.8684 894.5358 647.1996 670.9220 236.69562
  • 13 taylor 170.5171 746.6206 470.0857 439.8022 227.39141
  • 14 ryan 342.8375 1243.7473 970.0721 1052.6759 367.08513
  • 15 peter 195.4891 1455.0204 834.2543 830.2758 539.69009
英文:

You could use apply:

cbind(df_1[1], 
    t(apply(df_1[-1], 1,\(y)
       c(min = min(x &lt;- sort(geosphere::distHaversine(y, df_2[-1]))[1:5]),
         max = max(x),
          mean = mean(x),
          median = median(x),
          sd = sd(x)))))

    name_1       min       max      mean    median        sd
1     john  423.1875 1948.9521 1106.4374 1052.8789 674.69139
2    david  602.9369  941.3102  752.1558  715.3872 159.37550
3     alex 1765.7750 2428.5429 2013.7843 1828.6055 294.37805
4    kevin  638.9259  834.5504  715.5252  644.2898 102.23793
5   trevor  520.1834  650.9167  609.4363  631.9494  52.96026
6   xavier  972.9730 1767.1953 1369.5604 1396.8569 371.03190
7      tom  243.6729  530.4778  426.2490  447.8639 110.26649
8  michael  581.9209 1504.5642 1057.1773 1012.5247 378.81712
9     troy  549.4500 1035.0599  782.8799  828.5550 220.72034
10   kelly  491.6430 1130.9239  717.7716  658.7015 248.96974
11   chris 1389.1659 2106.7084 1644.0448 1455.8430 316.31565
12   henry  394.8684  894.5358  647.1996  670.9220 236.69562
13  taylor  170.5171  746.6206  470.0857  439.8022 227.39141
14    ryan  342.8375 1243.7473  970.0721 1052.6759 367.08513
15   peter  195.4891 1455.0204  834.2543  830.2758 539.69009

答案3

得分: 1

Regarding speed, I think your approach with mclapply is fine and should indeed help speed up your calculations (although you don't really need to initialize an empty matrix beforehand). In case you want to know more about parallelization in general, maybe also check this answer.

I've seen other questions mentioning the Haversine distance in the past, and you can squeeze more performance if you optimize the calculations, either in R or C++ (and I link again to the proxy package, which can potentially facilitate your work).

One thing left to optimize is space (memory) utilization. If you're interested in k nearest neighbors and your data frames become really big, it can make a big difference to store only the distances you need instead of the whole distance matrix. Since I had already done some groundwork in the answer I linked above, I adapted the Rcpp code to your specific use case, although I didn't add a lot of edge-case handling (e.g. specifying more desired neighbors than there are rows in df_2).

英文:

Regarding speed, I think your approach with mclapply is fine and should indeed help speed up your calculations
(although you don't really need to initialize an empty matrix beforehand).
In case you want to know more about parallelization in general,
maybe also check this answer.

I've seen other questions mentioning the Haversine distance in the past,
and you can squeeze more performance if you optimize the calculations,
either in R or C++
(and I link again to the proxy package,
which can potentially facilitate your work).

One thing left to optimize is space (memory) utilization.
If you're interested in k nearest neighbors and your data frames become really big,
it can make a big difference to store only the distances you need instead of the whole distance matrix.
Since I had already done some groundwork in the answer I linked above,
I adapted the Rcpp code to your specific use case,
although I didn't add a lot of edge-case handling
(e.g. specifying more desired neighbors than there are rows in df_2):

// [[Rcpp::depends(RcppParallel,RcppThread)]]

#include &lt;algorithm&gt; // lower_bound
#include &lt;cstddef&gt; // size_t
#include &lt;math.h&gt; // sin, cos, sqrt, atan2, pow
#include &lt;limits&gt;
#include &lt;vector&gt;

#include &lt;RcppThread.h&gt;
#include &lt;Rcpp.h&gt;
#include &lt;RcppParallel.h&gt;

using namespace std;
using namespace Rcpp;
using namespace RcppParallel;

class HaversineCalculator : public Worker
{
public:
  HaversineCalculator(const DataFrame&amp; df_1,
                      const DataFrame&amp; df_2,
                      const int k,
                      const NumericVector&amp; distances)
    : lon1_(as&lt;NumericVector&gt;(df_1[&quot;lon&quot;]))
    , lat1_(as&lt;NumericVector&gt;(df_1[&quot;lat&quot;]))
    , lon2_(as&lt;NumericVector&gt;(df_2[&quot;lon&quot;]))
    , lat2_(as&lt;NumericVector&gt;(df_2[&quot;lat&quot;]))
    , k_(k)
    , dist_(distances)
    , cos_lat1_(df_1.nrows())
    , cos_lat2_(df_2.nrows())
    , neighbors_(distances.length(), numeric_limits&lt;size_t&gt;::max())
  {
    // terms for distance calculation
    for (size_t i = 0; i &lt; cos_lat1_.size(); ++i) {
      cos_lat1_[i] = cos(lat1_[i] * to_rad);
    }
    for (size_t i = 0; i &lt; cos_lat2_.size(); ++i) {
      cos_lat2_[i] = cos(lat2_[i] * to_rad);
    }
  }

  vector&lt;size_t&gt; neighbors_;

  void operator()(size_t begin, size_t end) {
    for (size_t i = begin; i &lt; end; ++i) { // iterate over our chunk of df_1
      if (RcppThread::isInterrupted()) return;

      auto const distances_begin = dist_.begin() + i * k_;
      auto const distances_end = distances_begin + k_;
      auto const neighbors_begin = neighbors_.begin() + i * k_;

      for (size_t j = 0; j &lt; lon2_.size(); ++j) { // iterate over all df_2 entries
        // haversine distance
        double d_lon = (lon2_[j] - lon1_[i]) * to_rad;
        double d_lat = (lat2_[j] - lat1_[i]) * to_rad;
        double d_hav = pow(sin(d_lat / 2), 2) + cos_lat1_[i] * cos_lat2_[j] * pow(sin(d_lon / 2), 2);
        if (d_hav &gt; 1) d_hav = 1;
        d_hav = 2 * atan2(sqrt(d_hav), sqrt(1 - d_hav)) * 6378137;

        auto dist_index = lower_bound(distances_begin, distances_end, d_hav); // std
        if (dist_index &lt; distances_end) {
          for (auto k = distances_end - 1; k &gt; dist_index; --k) {
            // shift potentially valid neighbors and their distances
            auto offset = k - distances_begin;
            *(neighbors_begin + offset) = *(neighbors_begin + offset - 1);
            *k = *(k - 1);
          }

          *dist_index = d_hav;
          *(neighbors_begin + (dist_index - distances_begin)) = j;
        }
      }
    }
  }

private:
  static double to_rad;

  const RVector&lt;double&gt; lon1_;
  const RVector&lt;double&gt; lat1_;
  const RVector&lt;double&gt; lon2_;
  const RVector&lt;double&gt; lat2_;

  const int k_;
  RVector&lt;double&gt; dist_;

  vector&lt;double&gt; cos_lat1_;
  vector&lt;double&gt; cos_lat2_;
};

double HaversineCalculator::to_rad = 3.1415926535897 / 180;

// [[Rcpp::export]]
DataFrame haversine_nn(const DataFrame&amp; df_1, const DataFrame&amp; df_2, const int k) {
  NumericVector distances(k * df_1.nrows(), numeric_limits&lt;double&gt;::max());
  HaversineCalculator hc(df_1, df_2, k, distances);

  // you could play around with operations_per_call
  // see also:
  // - https://rcppcore.github.io/RcppParallel/#grain_size
  // - https://stackoverflow.com/a/14878734/5793905
  unsigned int operations_per_call = 100;
  unsigned int grain = operations_per_call / df_2.nrows() + (operations_per_call % df_2.nrow() != 0);
  Rcout &lt;&lt; &quot;Processing &quot; &lt;&lt; grain &lt;&lt; &quot; row(s) from df_1 on each thread call.\n&quot;;
  parallelFor(0, df_1.nrows(), hc, grain);
  RcppThread::checkUserInterrupt();

  CharacterVector names_1(k * df_1.nrows());
  CharacterVector neighbors(k * df_1.nrows());

  CharacterVector name_1 = df_1[&quot;name_1&quot;];
  CharacterVector name_2 = df_2[&quot;name_2&quot;];

  for (size_t i = 0; i &lt; df_1.nrows(); ++i) {
    for (size_t j = 0; j &lt; k; ++j) {
      size_t offset = (i * k) + j;
      names_1[offset] = name_1[i];

      size_t name_2_index = hc.neighbors_[offset];
      if (name_2_index &lt; df_2.nrows()) {
        neighbors[offset] = name_2[name_2_index];
      } else {
        neighbors[offset] = &quot;logic_error&quot;;
      }
    }
  }

  return DataFrame::create(_[&quot;name_1&quot;] = names_1, _[&quot;name_2&quot;] = neighbors, _[&quot;distance&quot;] = distances);
}

I saved that in haversine_nn.cpp and then ran:

library(Rcpp)
library(RcppParallel)
library(RcppThread)

sourceCpp(&quot;haversine_nn.cpp&quot;)
haversine_nn(df_1, df_2, 5L)

I believe somewhere in your code you have an issue detecting the top 5 neighbors,
at least my results seem to differ,
but even looking at your distances matrix from part 2,
there seems to be something off.

And maybe some explicit things about the C++ code.

Note that the distance column of the returned data frame is allocated beforehand and used directly to avoid concatenating things in the end,
using offsets to figure out which indices are in the valid range depending on the row from df_1 that is being processed.

There's no synchronization because each chunk is completely disjoint from the others,
so if the logic is correct,
each thread will only write to parts of the vector that are never going to be accessed by the other threads.

Be careful with handling of size_t because it's unsigned,
so even comparing variables with that type to negative values can give you weird reults
(it certainly happened to me).

The grain and operations_per_call have a similar goal as mc.preschedule in mclapply,
although the former provide a bit finer control.
It's not easy to know what's optimal,
you'd probably have to profile with your own data,
ideally different data sets.

And for completeness,
this still has df_1.nrows() * df_2.nrows() runtime complexity,
I don't think there's a way to avoid that since you have to calculate all distances even if you only care about a subset of them.
It just has lower space complexity.

答案4

得分: -1

这是我已经提供过的答案,但完全没有成功,请参考这里。你可以根据你的需求进行适应。

对于非常大的数据,我会使用类似这样的方法(但不知道这个解决方案与其他提供的解决方案相比表现如何):

rm(list=ls())
gc()

# 需要的包
for (package in c('data.table', 'stringr', 'stringdist')) {
  if (!require(package, character.only = TRUE, quietly = TRUE)) {
    install.packages(package, dependencies = TRUE)
    library(package, character.only = TRUE)
  }
}

# 数据
tbl.data <- data.table(name = c("CANON PVT. LTD", "Antila,Thomas", "Greg", "St.Luke's Hospital",
                              "Z_SANDSTONE COOLING LTD", "St.Luke's Hospital",
                              "CANON PVT. LTD.",
                              "SANDSTONE COOLING LTD",
                              "Greg", "ANTILA,THOMAS"))

# 只包含字母的字符串
tbl.data[, string_ := str_replace_all(tolower(str_replace_all(name, "[^[:graph:]]", "")),"[[:punct:][:digit:]]", "")]

# 创建比较表 ----
################################ - 
tbl.a <- tbl.data[, .(string_1 = string_, names1 = name, one = 1, id1 = .I)]
tbl.b <- tbl.data[, .(string_2 = string_,  names2 =name, one = 1, id2 = .I)]
# 避免不必要的行:
tbl.a <- unique(tbl.a)
tbl.b <- unique(tbl.b)
# 所有组合的表:
tbl.dedoubl <- merge(tbl.a, tbl.b, by = "one",  allow.cartesian = TRUE)
# 避免不必要的比较:
tbl.dedoubl <- tbl.dedoubl[id1 < id2]

# 距离计算: ----
################################ - 
tbl.dedoubl[, distance := stringdist(string_1, string_2, method = "lv")]

# 可疑重复的行: ----
################################### - 
tbl.dedoubl[distance <= 2]
英文:

This is an answer I already gave, with absolutely no succes, here. You may adapt for your purpose.

For very heavy data, I would use something like this (but don't know how this solution behave compared to others solutions provided ) :

rm(list=ls())
gc()
#  packages needed
for (package in c(&#39;data.table&#39;, &#39;stringr&#39;, &#39;stringdist&#39;)) {
if (!require(package, character.only = TRUE, quietly = TRUE)) {
install.packages(package, dependencies = TRUE)
library(package, character.only = TRUE)
}
}
# Data
tbl.data &lt;- data.table(name = c(&quot;CANON PVT. LTD&quot;, &quot;Antila,Thomas&quot;, &quot;Greg&quot;, &quot;St.Luke&#39;s Hospital&quot;, 
&quot;Z_SANDSTONE COOLING LTD&quot;, &quot;St.Luke&#39;s Hospital&quot;, 
&quot;CANON PVT. LTD.&quot;,
&quot;SANDSTONE COOLING LTD&quot;,
&quot;Greg&quot;, &quot;ANTILA,THOMAS&quot;))
# A string with just letters
tbl.data[, string_ := str_replace_all(tolower(str_replace_all(name, &quot;[^[:graph:]]&quot;, &quot;&quot;)),&quot;[[:punct:][:digit:]]&quot;, &quot;&quot;)]
# Create a table for comprisons ----
################################ -
tbl.a &lt;- tbl.data[, .(string_1 = string_, names1 = name, one = 1, id1 = .I)]
tbl.b &lt;- tbl.data[, .(string_2 = string_,  names2 =name, one = 1, id2 = .I)]
# avoid unneeded lines :
tbl.a &lt;- unique(tbl.a)
tbl.b &lt;- unique(tbl.b)
# the table of all couples :
tbl.dedoubl &lt;- merge(tbl.a, tbl.b, by = &quot;one&quot;,  allow.cartesian = TRUE)
# avoid unneeded comparisons :
tbl.dedoubl &lt;- tbl.dedoubl[id1 &lt; id2]
# Distance calculations : ----
################################ -
tbl.dedoubl[, distance := stringdist(string_1, string_2, method = &quot;lv&quot;)]
# Lines suspected to be doubles : ----
################################### -
tbl.dedoubl[distance &lt;= 2]

huangapple
  • 本文由 发表于 2023年5月22日 03:05:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76301483.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定