将RDD列表映射到具有两个参数的函数。

huangapple go评论56阅读模式
英文:

Mapping a rdd list to a function of two arguments

问题

我有一个比较同一文件夹中图像相似度的函数。函数运行在Python中很好,但我想利用pyspark的并行处理能力。

这里我使用Spark简单地将列表转换成RDD。

img_list = sc.parallelize(os.listdir(folder_dir))
f_img_list = img_list.filter(lambda f: f.endswith('.jpg') or f.endswith('.png'))

定义函数:

def compare_images(x1, x2):
  #预处理图像
  img_array1 = preprocess_image2(x1)
  img_array2 = preprocess_image2(x2)

  pred = compare(img_array1, img_array2)
  return pred

在这一点上,我想对RDD应用操作,要求文件夹中的图像不应与自身比较。

我的尝试是使用 "map",但我不确定如何做到。下面是我的尝试,但这假设只有1个参数:

prediction = f_img_list.map(compare_images)
prediction.collect()

我也意识到我的尝试没有包括图像不应相互比较的要求 - 对此的帮助也将不胜感激。

英文:

I have a function that compares images from same folder, against themselves - with an output of a similarity prediction.
The function runs fine in python but I want to leverage the power of pyspark parellelisation.

Here, I use Spark by simply parallelizing the list i.e. turn it into an RDD.

img_list = sc.parallelize(os.listdir(folder_dir))
f_img_list = img_list.filter(lambda f: f.endswith('.jpg') or f.endswith('.png'))

Defining the function:

def compare_images(x1,x2):
  #Preprocess images
  img_array1 = preprocess_image2(x1)
  img_array2 = preprocess_image2(x2)

  pred = compare(img_array1 , img_array2)
  return pred

At this point I want to apply operations on the RDD with a requirement that the images in the folder should not compare against itself.

My attempt is to use "map" but I'm unsure on how to do that. Below is my attempt but this assumes only 1 argument:

prediction = f_img_list.map(compare_images)
prediction.collect()

I'm also aware that my attempt does not include the requirement that the images should not compare against each other - assistance with that will also be appreciated.

答案1

得分: 0

你可以创建一个包含不同图像文件名对的列表,然后并行处理该列表,还要修改你的compare_images函数,使其接受单个参数而不是两个。

编辑:让我们尝试使用RDD的filter方法来过滤以'.jpg'或'.png'结尾的文件。

import os
import itertools
from pyspark import SparkContext

sc = SparkContext()

def preprocess_image2(image_path):
    pass

def compare(img_array1, img_array2):
    pass

img_list = sc.parallelize(os.listdir(folder_dir))
f_img_list = img_list.filter(lambda f: f.endswith('.jpg') or f.endswith('.png'))
f_img_list_local = f_img_list.collect()
image_pairs = list(itertools.combinations(f_img_list_local, 2))
image_pairs_rdd = sc.parallelize(image_pairs)

def compare_images(image_pair):
    x1, x2 = image_pair
    img_array1 = preprocess_image2(x1)
    img_array2 = preprocess_image2(x2)
    pred = compare(img_array1, img_array2)
    return pred

predictions = image_pairs_rdd.map(compare_images)
results = predictions.collect()
英文:

You could create a list of pairs of distinct image filenames, then parallelize that list, also modify your compare_images function to take a single argument instead of two.

edit: lets try to use the RDD's filter method to filter out the files which end with '.jpg' or '.png'

import os
import itertools
from pyspark import SparkContext

sc = SparkContext()

def preprocess_image2(image_path):
    pass

def compare(img_array1, img_array2):
    pass

img_list = sc.parallelize(os.listdir(folder_dir))
f_img_list = img_list.filter(lambda f: f.endswith('.jpg') or f.endswith('.png'))
f_img_list_local = f_img_list.collect()
image_pairs = list(itertools.combinations(f_img_list_local, 2))
image_pairs_rdd = sc.parallelize(image_pairs)

def compare_images(image_pair):
    x1, x2 = image_pair
    img_array1 = preprocess_image2(x1)
    img_array2 = preprocess_image2(x2)
    pred = compare(img_array1, img_array2)
    return pred

predictions = image_pairs_rdd.map(compare_images)
results = predictions.collect()

huangapple
  • 本文由 发表于 2023年5月10日 20:06:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76218209.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定