英文:
Feature extraction process using too much memory and causing a crash. What can I do?
问题
以下是要翻译的内容:
我正在使用Hugging Face的Transformer来进行一些图像特征提取,以便稍后用于一些相似性搜索功能。目前出现问题,因为在处理大约200张图像后,内存使用过多,导致系统崩溃...我做错了什么?我应该如何更改以修复这个问题。
这是我的特征提取类:
import numpy as np
from transformers import AutoProcessor, AutoModelForZeroShotImageClassification, AutoTokenizer, TFCLIPModel
# 其他代码...
class FeatureExtractor:
# 构造函数和其他方法...
def extract_features(self, img, grey=False):
"""
从输入图像中提取深层特征
Args:
img: 从PIL.Image.open(path)或tensorflow.keras.preprocessing.image.load_img(path)获取的图像
Returns:
feature (np.ndarray): 形状为(4096, )的深层特征
"""
try:
if grey:
img = get_greyscale_image(img)
else:
img = get_color_image(img)
inputs = self.processor(images=img, return_tensors="pt")
image_features = self.model.get_image_features(**inputs)
# 使用tensor.detach().numpy()来规范化
image_features /= image_features.norm(dim=-1, keepdim=True)
return image_features.detach().numpy() # 规范化
except Exception as e:
print(e)
# 其他方法...
这是我在循环中运行的函数,针对每个图像URL:
fe = FeatureExtractor(processor, model, tokenizer, text_model)
def get_features_for_image(image_meta):
id = image_meta["id"]
image_url = image_meta["image_url"]
# 获取图像的特征
try:
# 从URL打开图像
image = get_pil_image_from_url(image_url)
# 调整图像大小
image = image.resize((224, 224))
# 如果文件不在特征文件夹中
# 提取特征
if not os.path.exists("features/" + id + ".npy"):
image_features = fe.extract_features(image)
np.save("features/" + id + ".npy", image_features)
del image_features
del image
gc.collect()
# 将特征写入JSON文件
# 保存特征到文件 features/id.npy
return True
except Exception as e:
print("提取图像特征出错,图像ID:", id, " 错误:", e)
内存泄漏可能发生在以下几个地方:
- 您在循环中创建了一个新的
FeatureExtractor
对象,但未正确释放资源。请确保您在不再需要它时调用适当的清理方法。 - 可能是因为
del image_features
和del image
未完全释放内存。确保没有其他地方引用这些对象,并尝试使用gc.collect()
来强制垃圾回收。 - 如果循环运行非常多次,可能会导致内存积累,即使每次都释放了内存,也可能会有问题。您可以尝试在处理一定数量的图像后,显式释放资源,然后继续处理下一批图像。
请检查这些问题,以确定内存泄漏的原因,并相应地修复它们。
英文:
I am using a hugging face transformer to do some image feature extraction to use later for some similarity search functionality. This is not working currently because after processing around 200 images too much memory is being used and crashes the system... what am I doing wrong? what can I change to fix this.
Here is my feature extraction class:
import numpy as np
from transformers import AutoProcessor, AutoModelForZeroShotImageClassification, AutoTokenizer, TFCLIPModel
def expand_greyscale_image_channels(grey_pil_image):
grey_image_arr = np.array(grey_pil_image)
grey_image_arr = np.expand_dims(grey_image_arr, -1)
grey_image_arr_3_channel = grey_image_arr.repeat(3, axis=-1)
return grey_image_arr_3_channel
def get_color_image(img):
img = img.resize((224, 224))
img = img.convert('RGB')
return img
def get_greyscale_image(img):
img = img.resize((224, 224))
img = img.convert('L')
img = expand_greyscale_image_channels(img)
return img
class FeatureExtractor:
def __enter__(self):
return self
def __exit__(self, exc_type, exc_value, traceback):
pass
def __init__(self, processor=None, model=None, tokenizer=None, text_model=None):
self.processor = processor
self.model = model
self.tokenizer = tokenizer
self.text_model = text_model
def model(self):
return self.model
def processor(self):
return self.processor
def extract_features(self, img, grey=False):
"""
Extract a deep feature from an input image
Args:
img: from PIL.Image.open(path) or tensorflow.keras.preprocessing.image.load_img(path)
Returns:
feature (np.ndarray): deep feature with the shape=(4096, )
"""
try:
if grey:
img = get_greyscale_image(img)
else:
img = get_color_image(img)
inputs = self.processor(images=img, return_tensors="pt")
image_features = self.model.get_image_features(**inputs)
# Use tensor.detach().numpy() instead.
image_features /= image_features.norm(dim=-1, keepdim=True)
return image_features.detach().numpy() # Normalize
except Exception as e:
print(e)
def extract_text_features(self, text):
try:
inputs = self.tokenizer([text], padding=True, return_tensors="tf")
text_features = self.text_model.get_text_features(**inputs)
text_features = text_features / np.linalg.norm(text_features)
return text_features.numpy()
except Exception as e:
print(e)
Here is the function that I run in a loop over each image url:
fe = FeatureExtractor(processor, model, tokenizer, text_model)
def get_features_for_image(image_meta):
id = image_meta["id"]
image_url = image_meta["image_url"]
# get features for image
try:
# open image from url
image = get_pil_image_from_url(image_url)
# resize image
image = image.resize((224, 224))
# if file not in features folder
# extract features
if not os.path.exists("features/" + id + ".npy"):
# with FeatureExtractor(processor, model, tokenizer, text_model) as fe:
image_features = fe.extract_features(image)
np.save("features/" + id + ".npy", image_features)
del image_features
del image
gc.collect()
# write features to the json file
# save featuers under file features/id.npy
return True
except Exception as e:
print("Error extracting features for image ", id, " error: ", e)
where is the memory leak? how can I fix it?
Here is the image of cpu usage. It is doing fine per image, as the number of images that features are extracted for in total increases so does the cpu usage. even if the model uses a lot of memory, shouldn't it recover the memory after the feature of each image is done extracting?
答案1
得分: 2
我明白了。在提取特征的代码中,问题出在 image_features.detach.numpy()
这一行,detach()
会创建一个独立的 numpy 数组副本。在这里并不需要它,它导致了内存泄漏。
英文:
Figured it out. Its the line image_features.detach.numpy() in extract_features
the detach() creates a separate copy of the numpy array. It is not needed here and created the leak.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论