2023年5月10日 20:06:12go评论140阅读模式

英文:

Mapping a rdd list to a function of two arguments

问题

我有一个比较同一文件夹中图像相似度的函数。函数运行在Python中很好，但我想利用pyspark的并行处理能力。

这里我使用Spark简单地将列表转换成RDD。

img_list = sc.parallelize(os.listdir(folder_dir))
f_img_list = img_list.filter(lambda f: f.endswith('.jpg') or f.endswith('.png'))

定义函数：

def compare_images(x1, x2):
  #预处理图像
  img_array1 = preprocess_image2(x1)
  img_array2 = preprocess_image2(x2)

  pred = compare(img_array1, img_array2)
  return pred

在这一点上，我想对RDD应用操作，要求文件夹中的图像不应与自身比较。

我的尝试是使用 "map"，但我不确定如何做到。下面是我的尝试，但这假设只有1个参数：

prediction = f_img_list.map(compare_images)
prediction.collect()

我也意识到我的尝试没有包括图像不应相互比较的要求 - 对此的帮助也将不胜感激。

英文:

I have a function that compares images from same folder, against themselves - with an output of a similarity prediction.
The function runs fine in python but I want to leverage the power of pyspark parellelisation.

Here, I use Spark by simply parallelizing the list i.e. turn it into an RDD.

img_list = sc.parallelize(os.listdir(folder_dir))
f_img_list = img_list.filter(lambda f: f.endswith(&#39;.jpg&#39;) or f.endswith(&#39;.png&#39;))

Defining the function:

def compare_images(x1,x2):
  #Preprocess images
  img_array1 = preprocess_image2(x1)
  img_array2 = preprocess_image2(x2)

  pred = compare(img_array1 , img_array2)
  return pred

At this point I want to apply operations on the RDD with a requirement that the images in the folder should not compare against itself.

My attempt is to use "map" but I'm unsure on how to do that. Below is my attempt but this assumes only 1 argument:

prediction = f_img_list.map(compare_images)
prediction.collect()

I'm also aware that my attempt does not include the requirement that the images should not compare against each other - assistance with that will also be appreciated.

答案1

得分: 0

你可以创建一个包含不同图像文件名对的列表，然后并行处理该列表，还要修改你的compare_images函数，使其接受单个参数而不是两个。

编辑：让我们尝试使用RDD的filter方法来过滤以'.jpg'或'.png'结尾的文件。

import os
import itertools
from pyspark import SparkContext

sc = SparkContext()

def preprocess_image2(image_path):
    pass

def compare(img_array1, img_array2):
    pass

img_list = sc.parallelize(os.listdir(folder_dir))
f_img_list = img_list.filter(lambda f: f.endswith('.jpg') or f.endswith('.png'))
f_img_list_local = f_img_list.collect()
image_pairs = list(itertools.combinations(f_img_list_local, 2))
image_pairs_rdd = sc.parallelize(image_pairs)

def compare_images(image_pair):
    x1, x2 = image_pair
    img_array1 = preprocess_image2(x1)
    img_array2 = preprocess_image2(x2)
    pred = compare(img_array1, img_array2)
    return pred

predictions = image_pairs_rdd.map(compare_images)
results = predictions.collect()

英文:

You could create a list of pairs of distinct image filenames, then parallelize that list, also modify your compare_images function to take a single argument instead of two.

edit: lets try to use the RDD's filter method to filter out the files which end with '.jpg' or '.png'

import os
import itertools
from pyspark import SparkContext

sc = SparkContext()

def preprocess_image2(image_path):
    pass

def compare(img_array1, img_array2):
    pass

img_list = sc.parallelize(os.listdir(folder_dir))
f_img_list = img_list.filter(lambda f: f.endswith(&#39;.jpg&#39;) or f.endswith(&#39;.png&#39;))
f_img_list_local = f_img_list.collect()
image_pairs = list(itertools.combinations(f_img_list_local, 2))
image_pairs_rdd = sc.parallelize(image_pairs)

def compare_images(image_pair):
    x1, x2 = image_pair
    img_array1 = preprocess_image2(x1)
    img_array2 = preprocess_image2(x2)
    pred = compare(img_array1, img_array2)
    return pred

predictions = image_pairs_rdd.map(compare_images)
results = predictions.collect()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将RDD列表映射到具有两个参数的函数。

问题

答案1

我怎样从线程发送数据到Gtk应用程序？

绘制一个字典的散点图。

plt.show()导致一个空白的图形。

你想在Python中根据特定条件对随机单词进行排序。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论