英文:
Mapping a rdd list to a function of two arguments
问题
我有一个比较同一文件夹中图像相似度的函数。函数运行在Python中很好,但我想利用pyspark的并行处理能力。
这里我使用Spark简单地将列表转换成RDD。
img_list = sc.parallelize(os.listdir(folder_dir))
f_img_list = img_list.filter(lambda f: f.endswith('.jpg') or f.endswith('.png'))
定义函数:
def compare_images(x1, x2):
#预处理图像
img_array1 = preprocess_image2(x1)
img_array2 = preprocess_image2(x2)
pred = compare(img_array1, img_array2)
return pred
在这一点上,我想对RDD应用操作,要求文件夹中的图像不应与自身比较。
我的尝试是使用 "map",但我不确定如何做到。下面是我的尝试,但这假设只有1个参数:
prediction = f_img_list.map(compare_images)
prediction.collect()
我也意识到我的尝试没有包括图像不应相互比较的要求 - 对此的帮助也将不胜感激。
英文:
I have a function that compares images from same folder, against themselves - with an output of a similarity prediction.
The function runs fine in python but I want to leverage the power of pyspark parellelisation.
Here, I use Spark by simply parallelizing the list i.e. turn it into an RDD.
img_list = sc.parallelize(os.listdir(folder_dir))
f_img_list = img_list.filter(lambda f: f.endswith('.jpg') or f.endswith('.png'))
Defining the function:
def compare_images(x1,x2):
#Preprocess images
img_array1 = preprocess_image2(x1)
img_array2 = preprocess_image2(x2)
pred = compare(img_array1 , img_array2)
return pred
At this point I want to apply operations on the RDD with a requirement that the images in the folder should not compare against itself.
My attempt is to use "map" but I'm unsure on how to do that. Below is my attempt but this assumes only 1 argument:
prediction = f_img_list.map(compare_images)
prediction.collect()
I'm also aware that my attempt does not include the requirement that the images should not compare against each other - assistance with that will also be appreciated.
答案1
得分: 0
你可以创建一个包含不同图像文件名对的列表,然后并行处理该列表,还要修改你的compare_images函数,使其接受单个参数而不是两个。
编辑:让我们尝试使用RDD的filter方法来过滤以'.jpg'或'.png'结尾的文件。
import os
import itertools
from pyspark import SparkContext
sc = SparkContext()
def preprocess_image2(image_path):
pass
def compare(img_array1, img_array2):
pass
img_list = sc.parallelize(os.listdir(folder_dir))
f_img_list = img_list.filter(lambda f: f.endswith('.jpg') or f.endswith('.png'))
f_img_list_local = f_img_list.collect()
image_pairs = list(itertools.combinations(f_img_list_local, 2))
image_pairs_rdd = sc.parallelize(image_pairs)
def compare_images(image_pair):
x1, x2 = image_pair
img_array1 = preprocess_image2(x1)
img_array2 = preprocess_image2(x2)
pred = compare(img_array1, img_array2)
return pred
predictions = image_pairs_rdd.map(compare_images)
results = predictions.collect()
英文:
You could create a list of pairs of distinct image filenames, then parallelize that list, also modify your compare_images function to take a single argument instead of two.
edit: lets try to use the RDD's filter method to filter out the files which end with '.jpg' or '.png'
import os
import itertools
from pyspark import SparkContext
sc = SparkContext()
def preprocess_image2(image_path):
pass
def compare(img_array1, img_array2):
pass
img_list = sc.parallelize(os.listdir(folder_dir))
f_img_list = img_list.filter(lambda f: f.endswith('.jpg') or f.endswith('.png'))
f_img_list_local = f_img_list.collect()
image_pairs = list(itertools.combinations(f_img_list_local, 2))
image_pairs_rdd = sc.parallelize(image_pairs)
def compare_images(image_pair):
x1, x2 = image_pair
img_array1 = preprocess_image2(x1)
img_array2 = preprocess_image2(x2)
pred = compare(img_array1, img_array2)
return pred
predictions = image_pairs_rdd.map(compare_images)
results = predictions.collect()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论