Most motion deblurring algorithms rely on spatial-domain convolution models, which struggle with the complex, non-linear blur arising from camera shake and object motion. In contrast, we propose a novel single-image deblurring approach that treats motion blur as a temporal averaging phenomenon. Our core innovation lies in leveraging a pre-trained video diffusion transformer model to capture diverse motion dynamics within a latent space. It sidesteps explicit kernel estimation and effectively accommodates diverse motion patterns. We implement the algorithm within a diffusion-based inverse problem framework. Empirical results on synthetic and real-world datasets demonstrate that our method outperforms existing techniques in deblurring complex motion blur scenarios. This work paves the way for utilizing powerful video diffusion models to address single-image deblurring challenges.
Overview of the VDM-MD method: In the core iteration, the estimated 3D sharp video resides in the latent space, represented by green boxes. It is generated and refined by the pre-trained VDM, which includes several STDiT blocks, as shown in the structural diagram on the right. The latent video is then decoded and compared with the blurry image through the degradation model, indicated by red boxes. Their discrepancies are used to correct and enhance the video. Upon completion the latent video is decoded back to the visual space.
To analyze our algorithm’s performance without requiring an extensive, large-scale transformer, we used the CLEVRER dataset as a “toy world.” CLEVRER features relatively simple objects obeying basic physics, with minimal motion between consecutive frames. Each video clip thus approximates a high-frame-rate recording.
To evaluate our method on real-world data, we used the BAIR robot pushing dataset, which consists of 90K short video clips recorded by a real camera.
In some cases, motion is reversed in time due to the inherent ambiguity of single-image blur.
We compared our algorithm to three state-of-the-art single-image deblurring methods: MPRNet, MTRNN, Restormer and Restormer enhanced by ID-Blau (denoted by “Res+ID-Blau"), each producing only a single deblurred image. Because there is no exact single-frame ground truth for each blurred observation, we took the 5th (middle) frame of our recovered sequence for quantitative evaluation, then measured PSNR and SSIM against the corresponding middle ground-truth frame.
@misc{pang2025imagemotionblurremoval,
title={Image Motion Blur Removal in the Temporal Dimension with Video Diffusion Models},
author={Wang Pang and Zhihao Zhan and Xiang Zhu and Yechao Bai},
year={2025},
eprint={2501.12604},
archivePrefix={arXiv},
primaryClass={eess.IV},
url={https://arxiv.org/abs/2501.12604},
}