JointDiT

AVSync15 Demos

For each of the 15 categories in the AVSync test set, we selected one example video to show the results of different generation methods.
It is recommended to use earphones to hear the demos videos, raise the volume and zoom in the videos.

Baby babbling crying.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Cap gun shooting.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Chicken crowing.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Dog barking.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Frog croaking.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Hammering.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Lions roaring.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Machine gun shooting.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Playing cello.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Playing trombone.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Playing trumpet.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Playing violin fiddle.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Sharpen knife.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Striking bowling.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Toilet flushing.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours

Greatesthits Demos

We selected an 5 cases from Greatesthits test dataset to show the results of different generation methods.
It is recommended to use earphones to hear the demos videos, raise the volume and zoom in the videos.

Case 1.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Case 2.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Case 3.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Case 4.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Case 5.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours

Landscape Demos

For each of the 9 categories in the Landscape test set, we selected one example video to show the results of different generation methods.
It is recommended to use earphones to hear the demos videos, raise the volume and zoom in the videos.

Explosion.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Fire crackling.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Raining.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Splashing water.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Squishing water.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Thunder.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Underwater bubbling.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Waterfall burbling.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours
Wind noise.
I2V+I2A I2V2A I2A2V I2VA I2VA
Input Image SVD+AudioLDM-v SVD+SeeingHearing AudioLDM-v+AVSyncD CoDi Ours

Comparison

Below cases further illustrate the comparison between Vanilla CFG and Enhanced Joint-CFG, demonstrating the former's ability to generate more varied visuals (as seen in case 1 with the forging hammer), maintain image clarity (as observed in case 2 with the rooster's head movement), create more coherent scenes (as in case 3 with the undisturbed bowling pins), and produce superior sound quality (as in case 4 with the more pure sound).

Case1
Vanilla CFG Joint CFG
Case2
Vanilla CFG Joint CFG
Case3
Vanilla CFG Joint CFG
Case4
Vanilla CFG Joint CFG