Demonstration results of multi-modal instruction. The first row lists the visual stimulus, whereas the second row depicts our intermediate reconstructions. The manipulation results via the instruction ...
Results that may be inaccessible to you are currently showing.
Hide inaccessible results