Skip to content Skip to sidebar Skip to footer

AI News

What are Optical Character Recognition (OCR) Models? Top Open-Source OCR Models

Optical Character Recognition (OCR) is the process of turning images that contain text—such as scanned pages, receipts, or photographs—into machine-readable text. What began as brittle rule-based systems has evolved into a rich ecosystem of neural architectures and vision-language models capable of reading complex, multi-lingual, and handwritten documents. How OCR Works? Every OCR system tackles three…

Read More

AI and the Brain: How DINOv3 Models Reveal Insights into Human Visual Processing

Introduction Understanding how the brain builds internal representations of the visual world is one of the most fascinating challenges in neuroscience. Over the past decade, deep learning has reshaped computer vision, producing neural networks that not only perform at human-level accuracy on recognition tasks but also seem to process information in ways that resemble our…

Read More

Qwen Team Introduces Qwen-Image-Edit: The Image Editing Version of Qwen-Image with Advanced Capabilities for Semantic and Appearance Editing

In the domain of multimodal AI, instruction-based image editing models are transforming how users interact with visual content. Just released in August 2025 by Alibaba’s Qwen Team, Qwen-Image-Edit builds on the 20B-parameter Qwen-Image foundation to deliver advanced editing capabilities. This model excels in semantic editing (e.g., style transfer and novel view synthesis) and appearance editing…

Read More

VL-Cogito: Advancing Multimodal Reasoning with Progressive Curriculum Reinforcement Learning

Multimodal reasoning, where models integrate and interpret information from multiple sources such as text, images, and diagrams, is a frontier challenge in AI. VL-Cogito is a state-of-the-art Multimodal Large Language Model (MLLM) proposed by DAMO Academy (Alibaba Group) and partners, introducing a robust reinforcement learning pipeline that fundamentally upgrades the reasoning skills of large models…

Read More

NASA Releases Galileo: The Open-Source Multimodal Model Advancing Earth Observation and Remote Sensing

Introduction Galileo is an open-source, highly multimodal foundation model developed to process, analyze, and understand diverse Earth observation (EO) data streams—including optical, radar, elevation, climate, and auxiliary maps—at scale. Galileo is developed with the support from researchers from McGill University, NASA Harvest Ai2, Carleton University, University of British Columbia, Vector Institute, and Arizona State University.…

Read More

VLM2Vec-V2: A Unified Computer Vision Framework for Multimodal Embedding Learning Across Images, Videos, and Visual Documents

Embedding models act as bridges between different data modalities by encoding diverse multimodal information into a shared dense representation space. There have been advancements in embedding models in recent years, driven by progress in large foundation models. However, existing multimodal embedding models are trained on datasets such as MMEB and M-BEIR, with most focus only…

Read More

How Radial Attention Cuts Costs in Video Diffusion by 4.4× Without Sacrificing Quality

Introduction to Video Diffusion Models and Computational Challenges Diffusion models have made impressive progress in generating high-quality, coherent videos, building on their success in image synthesis. However, handling the extra temporal dimension in videos significantly increases computational demands, especially since self-attention scales poorly with sequence length. This makes it difficult to train or run these…

Read More

ByteDance Researchers Introduce VGR: A Novel Reasoning Multimodal Large Language Model (MLLM) with Enhanced Fine-Grained Visual Perception Capabilities

Why Multimodal Reasoning Matters for Vision-Language Tasks Multimodal reasoning enables models to make informed decisions and answer questions by combining both visual and textual information. This type of reasoning plays a central role in interpreting charts, answering image-based questions, and understanding complex visual documents. The goal is to make machines capable of using vision as…

Read More

Multimodal LLMs Without Compromise: Researchers from UCLA, UW–Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities

LLMs have made significant strides in language-related tasks such as conversational AI, reasoning, and code generation. However, human communication extends beyond text, often incorporating visual elements to enhance understanding. To create a truly versatile AI, models need the ability to process and generate text and visual information simultaneously. Training such unified vision-language models from scratch…

Read More