Image Motion Blur Removal in the Temporal Dimension with Video Diffusion Models

Abstract

Most motion deblurring algorithms rely on spatial-domain convolution models, which struggle with the complex, non-linear blur arising from camera shake and object motion. In contrast, we propose a novel single-image deblurring approach that treats motion blur as a temporal averaging phenomenon. Our core innovation lies in leveraging a pre-trained video diffusion transformer model to capture diverse motion dynamics within a latent space. It sidesteps explicit kernel estimation and effectively accommodates diverse motion patterns. We implement the algorithm within a diffusion-based inverse problem framework. Empirical results on synthetic and real-world datasets demonstrate that our method outperforms existing techniques in deblurring complex motion blur scenarios. This work paves the way for utilizing powerful video diffusion models to address single-image deblurring challenges.

Method

Overview of the VDM-MD method: In the core iteration, the estimated 3D sharp video resides in the latent space, represented by green boxes. It is generated and refined by the pre-trained VDM, which includes several STDiT blocks, as shown in the structural diagram on the right. The latent video is then decoded and compared with the blurry image through the degradation model, indicated by red boxes. Their discrepancies are used to correct and enhance the video. Upon completion the latent video is decoded back to the visual space.

Results

Synthetic Dataset

Each blurry inputs are generated by averaging 10 frames. Only the 0th, 3rd, 6th, and 9th frame of the GT and output videos are illustrated.

To analyze our algorithm’s performance without requiring an extensive, large-scale transformer, we used the CLEVRER dataset as a “toy world.” CLEVRER features relatively simple objects obeying basic physics, with minimal motion between consecutive frames. Each video clip thus approximates a high-frame-rate recording.

Blurry input Output GT

BAIR Dataset

To evaluate our method on real-world data, we used the BAIR robot pushing dataset, which consists of 90K short video clips recorded by a real camera.

Blurry input Output GT

Reverse Motion

In some cases, motion is reversed in time due to the inherent ambiguity of single-image blur.

Blurry input Output GT

Compare with other state-of-the-art

We compared our algorithm to three state-of-the-art single-image deblurring methods: MPRNet, MTRNN, Restormer and Restormer enhanced by ID-Blau (denoted by “Res+ID-Blau"), each producing only a single deblurred image. Because there is no exact single-frame ground truth for each blurred observation, we took the 5th (middle) frame of our recovered sequence for quantitative evaluation, then measured PSNR and SSIM against the corresponding middle ground-truth frame.

BibTeX


@INPROCEEDINGS{11084505,
  author={Pang, Wang and Zhan, Zhihao and Zhu, Xiang and Bai, Yechao},
  booktitle={2025 IEEE International Conference on Image Processing (ICIP)}, 
  title={Image Motion Blur Removal In The Temporal Dimension With Video Diffusion Models}, 
  year={2025},
  volume={},
  number={},
  pages={325-330},
  keywords={Deblurring;Visualization;Technological innovation;Dynamics;Estimation;Training data;Transformer cores;Diffusion models;Transformers;Kernel;Motion deblurring;video diffusion model;diffusion transformer},
  doi={10.1109/ICIP55913.2025.11084505}}