Unified Multimodal Understanding and Generation Models with advanced capabilities in both multimodal understanding and text-to-image generation.
Enables bidirectional image understanding and generation via an autoregressive framework with a unified Transformer architecture.
Outperforms leading models like DALL-E 3 and Stable Diffusion in benchmarks (GenEval score 0.80 vs DALL-E 3's 0.67).
Offers 1B/7B parameter variants under MIT license, hosted on Hugging Face and GitHub for rapid deployment.
Processes images at 384×384 resolution, integrating the SigLIP-L vision encoder and MLP adapters.
Combines lightweight 7B-parameter design with competitive pricing, reducing computational resource consumption.
Leverages extended datasets and stability-enhanced training techniques to improve output accuracy.
Model | Sequence Length | Download |
---|---|---|
Janus-1.3B | 4096 | |
JanusFlow-1.3B | 4096 | |
Janus-Pro-1B | 4096 | |
Janus-Pro-7B | 4096 |
@minchoi
@xenovacom