Computer Vision Category - MarkTechPost https://www.marktechpost.com/category/technology/artificial-intelligence/computer-vision/ An Artificial Intelligence News Platform Thu, 20 Jun 2024 06:39:28 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.4 https://www.marktechpost.com/wp-content/uploads/2022/04/cropped-Favicon-512-x-512-1-1-32x32.png Computer Vision Category - MarkTechPost https://www.marktechpost.com/category/technology/artificial-intelligence/computer-vision/ 32 32 127842392 MINT-1T: An Open-Source Trillion Token Multimodal Interleaved Dataset and a Key Component for Training Large Multimodal Models LMMs https://www.marktechpost.com/2024/06/19/mint-1t-an-open-source-trillion-token-multimodal-interleaved-dataset-and-a-key-component-for-training-large-multimodal-models-lmms/ https://www.marktechpost.com/2024/06/19/mint-1t-an-open-source-trillion-token-multimodal-interleaved-dataset-and-a-key-component-for-training-large-multimodal-models-lmms/#respond Thu, 20 Jun 2024 06:50:00 +0000 https://www.marktechpost.com/?p=58760 Large open-source pre-training datasets are important for the research community in exploring data engineering and developing transparent, open-source models. However, there’s a major shift from frontier labs to training large multimodal models (LMMs) that need large datasets containing both images and texts. The capabilities of these frontier models are advancing quickly, creating a large gap […]

The post MINT-1T: An Open-Source Trillion Token Multimodal Interleaved Dataset and a Key Component for Training Large Multimodal Models LMMs appeared first on MarkTechPost.

]]>

Large open-source pre-training datasets are important for the research community in exploring data engineering and developing transparent, open-source models. However, there’s a major shift from frontier labs to training large multimodal models (LMMs) that need large datasets containing both images and texts. The capabilities of these frontier models are advancing quickly, creating a large gap between the multimodal training data available for closed and open-source models. Current open-source multimodal datasets are smaller and less diverse compared to text-only datasets, making it challenging to develop strong open-source LMMs and widening the gap in performance between open and closed-source models.

Some of the related works discussed in this paper are Multimodal Interleaved Data, Large Open-source Pre-training Datasets, and LMMs. Multimodal interleaved datasets were first presented in Flamingo and CM3. The first open-source versions of these datasets were Multimodal-C4 and OBELICS. Recent works like Chameleon and MM1 have scaled OBELICS to train state-of-the-art multimodal models. The second approach is the backbone of open-source research and is important for training strong open-source multimodal models. In LMMs, researchers aim to pre-train language models using large-scale multimodal interleaved and image-text datasets. This was introduced by Flamingo and adopted by open-source models like OpenFlamingo, Idefics, and Emu.

Researchers from the University of Washington, Salesforce Research, Stanford University, the University of Texas at Austin, and the University of California, Berkeley have proposed Multimodal INTerleaved (MINT-1T). Currently, MINT-1T is the largest and most diverse open-source multimodal interleaved dataset, which contains one trillion text tokens and three billion images, collected from various sources such as HTML, PDFs, and ArXiv. LLMs trained on MINT-1T offer 10 times improvement in scale and potentially it outperform models trained on the best existing open-source dataset, OBELICS which contains a 115 billion text token dataset with 353M images sourced only from HTML.

MINT-1T has created a large open-source dataset by collecting diverse sources of mixed documents, including PDFs and ArXiv papers, and the final dataset contains 965B HTML document tokens, 51B PDF tokens, and 10B ArXiv tokens. For filtering text quality, not using model-based heuristics helps in the efficient scaling of tex-only models. This includes eliminating non-English documents using Fasttext’s language identification model with a confidence threshold of 0.65. Further, documents containing URLs with NSFW substrings are removed to avoid pornographic and undesirable content, and text filtering methods from RefinedWeb are applied to remove documents with excessive duplicate n-grams.

To enhance the performance of In-Context Learning, models are prompted with 1 to 15 examples and executed a single trial per shot count for each evaluation benchmark. The results show that the model trained on MINT-1T performs better than the model trained on the HTML subset of MINT-1T for all shots. Further, MINT-1T models perform similarly to the OBELICS from 1 to 10 but outperform after 10 shots. When evaluating performance on MMMU for each domain, MINT-1T outperforms OBELICS and HTML baseline of MINT-1T, except in the Business domain. The method shows enhanced performance in Science and Technology domains due to the high representation of these domains in ArXiv and PDF documents.

In this paper, researchers have introduced MINT-1T, the first open-source trillion token multimodal interleaved dataset and an important component for training large multimodal models. This method is an important resource for the research community to do open science on multimodal interleaved datasets. MINT-1T outperforms the previous largest open-source dataset in this domain, OBELICS that contains a 115 billion text token dataset with 353M images sourced only from HTML. Future work includes training models on larger subsets of MINT-1T, and developing multimodal document filtering methods to enhance data quality.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit

The post MINT-1T: An Open-Source Trillion Token Multimodal Interleaved Dataset and a Key Component for Training Large Multimodal Models LMMs appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/06/19/mint-1t-an-open-source-trillion-token-multimodal-interleaved-dataset-and-a-key-component-for-training-large-multimodal-models-lmms/feed/ 0 58760
Apple Releases 4M-21: A Very Effective Multimodal AI Model that Solves Tens of Tasks and Modalities https://www.marktechpost.com/2024/06/18/apple-releases-4m-21-a-very-effective-multimodal-ai-model-that-solves-tens-of-tasks-and-modalities/ https://www.marktechpost.com/2024/06/18/apple-releases-4m-21-a-very-effective-multimodal-ai-model-that-solves-tens-of-tasks-and-modalities/#respond Tue, 18 Jun 2024 12:00:00 +0000 https://www.marktechpost.com/?p=58684 Large language models (LLMs) have made significant strides in handling multiple modalities and tasks, but they still need to improve their ability to process diverse inputs and perform a wide range of tasks effectively. The primary challenge lies in developing a single neural network capable of handling a broad spectrum of tasks and modalities while […]

The post Apple Releases 4M-21: A Very Effective Multimodal AI Model that Solves Tens of Tasks and Modalities appeared first on MarkTechPost.

]]>

Large language models (LLMs) have made significant strides in handling multiple modalities and tasks, but they still need to improve their ability to process diverse inputs and perform a wide range of tasks effectively. The primary challenge lies in developing a single neural network capable of handling a broad spectrum of tasks and modalities while maintaining high performance across all domains. Current models, such as 4M and UnifiedIO, show promise but are constrained by the limited number of modalities and tasks they are trained on. This limitation hinders their practical application in scenarios requiring truly versatile and adaptable AI systems.

Recent attempts to solve multitask learning challenges in vision have evolved from combining dense vision tasks to integrating numerous tasks into unified multimodal models. Methods like Gato, OFA, Pix2Seq, UnifiedIO, and 4M transform various modalities into discrete tokens and train Transformers using sequence or masked modeling objectives. Some approaches enable a wide range of tasks through co-training on disjoint datasets, while others, like 4M, use pseudo labeling for any-to-any modality prediction on aligned datasets. Masked modeling has proven effective in learning cross-modal representations, crucial for multimodal learning, and enables generative applications when combined with tokenization.

Researchers from Apple and the Swiss Federal Institute of Technology Lausanne (EPFL) build their method upon the multimodal masking pre-training scheme, significantly expanding its capabilities by training on a diverse set of modalities. The approach incorporates over 20 modalities, including SAM segments, 3D human poses, Canny edges, color palettes, and various metadata and embeddings. By using modality-specific discrete tokenizers, the method encodes diverse inputs into a unified format, enabling the training of a single model on multiple modalities without performance degradation. This unified approach expands existing capabilities across several key axes, including increased modality support, improved diversity in data types, effective tokenization techniques, and scaled model size. The resulting model demonstrates new possibilities for multimodal interaction, such as cross-modal retrieval and highly steerable generation across all training modalities.

This method adopts the 4M pre-training scheme, expanding it to handle a diverse set of modalities. It transforms all modalities into sequences of discrete tokens using modality-specific tokenizers. The training objective involves predicting one subset of tokens from another, using random selections from all modalities as inputs and targets. It utilizes pseudo-labeling to create a large pre-training dataset with multiple aligned modalities. The method incorporates a wide range of modalities, including RGB, geometric, semantic, edges, feature maps, metadata, and text. Tokenization plays a crucial role in unifying the representation space across these diverse modalities. This unification enables training with a single pre-training objective, improves training stability, allows full parameter sharing, and eliminates the need for task-specific components. Three main types of tokenizers are employed: ViT-based tokenizers for image-like modalities, MLP tokenizers for human poses and global embeddings, and a WordPiece tokenizer for text and other structured data. This comprehensive tokenization approach allows the model to handle a wide array of modalities efficiently, reducing computational complexity and enabling generative tasks across multiple domains.

The 4M-21 model demonstrates a wide range of capabilities, including steerable multimodal generation, multimodal retrieval, and strong out-of-the-box performance across various vision tasks. It can predict any training modality by iteratively decoding tokens, enabling fine-grained and multimodal generation with improved text understanding. The model performs multimodal retrievals by predicting global embeddings from any input modality, allowing for versatile retrieval capabilities. In out-of-the-box evaluations, 4M-21 achieves competitive performance on tasks such as surface normal estimation, depth estimation, semantic segmentation, instance segmentation, 3D human pose estimation, and image retrieval. It often matches or outperforms specialist models and pseudo-labelers while being a single model for all tasks. The 4M-21 XL variant, in particular, demonstrates strong performance across multiple modalities without sacrificing capability in any single domain.

Researchers examine the scaling characteristics of pre-training any-to-any models on a large set of modalities, comparing three model sizes: B, L, and XL. Evaluating both unimodal (RGB) and multimodal (RGB + Depth) transfer learning scenarios. In unimodal transfers, 4M-21 maintains performance on tasks similar to the original seven modalities while showing improved results on complex tasks like 3D object detection. The model demonstrates better performance with increased size, indicating promising scaling trends. For multimodal transfers, 4M-21 effectively utilizes optional depth inputs, significantly outperforming baselines. The study reveals that training on a broader set of modalities does not compromise performance on familiar tasks and can enhance capabilities on new ones, especially as model size increases.

This research demonstrates the successful training of an any-to-any model on a diverse set of 21 modalities and tasks. This achievement is made possible by employing modality-specific tokenizers to map all modalities to discrete sets of tokens, coupled with a multimodal masked training objective. The model scales to three billion parameters across multiple datasets without compromising performance compared to more specialized models. The resulting unified model exhibits strong out-of-the-box capabilities and opens new avenues for multimodal interaction, generation, and retrieval. However, the study acknowledges certain limitations and areas for future work. These include the need to further explore transfer and emergent capabilities, which remain largely untapped compared to language models. 


Check out the Paper, Project, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit

The post Apple Releases 4M-21: A Very Effective Multimodal AI Model that Solves Tens of Tasks and Modalities appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/06/18/apple-releases-4m-21-a-very-effective-multimodal-ai-model-that-solves-tens-of-tasks-and-modalities/feed/ 0 58684
NYU Researchers Propose Inter- & Intra-Modality Modeling (I2M2) for Multi-Modal Learning, Capturing both Inter-Modality and Intra-Modality Dependencies https://www.marktechpost.com/2024/06/18/nyu-researchers-propose-inter-intra-modality-modeling-i2m2-for-multi-modal-learning-capturing-both-inter-modality-and-intra-modality-dependencies/ https://www.marktechpost.com/2024/06/18/nyu-researchers-propose-inter-intra-modality-modeling-i2m2-for-multi-modal-learning-capturing-both-inter-modality-and-intra-modality-dependencies/#respond Tue, 18 Jun 2024 10:30:00 +0000 https://www.marktechpost.com/?p=58689 In supervised multi-modal learning, data is mapped from various modalities to a target label using information about the boundaries between the modalities. Different fields have been interested in this issue: autonomous vehicles, healthcare, robots, and many more. Although multi-modal learning is a fundamental paradigm in machine learning, its efficacy differs depending on the task at […]

The post NYU Researchers Propose Inter- & Intra-Modality Modeling (I2M2) for Multi-Modal Learning, Capturing both Inter-Modality and Intra-Modality Dependencies appeared first on MarkTechPost.

]]>

In supervised multi-modal learning, data is mapped from various modalities to a target label using information about the boundaries between the modalities. Different fields have been interested in this issue: autonomous vehicles, healthcare, robots, and many more. Although multi-modal learning is a fundamental paradigm in machine learning, its efficacy differs depending on the task at hand. In some situations, a multi-modal learner performs better than a uni-modal learner. Still, in other cases, it might not be better than a single uni-modal learner or a mixture of only two. These conflicting findings highlight the need for a guiding framework to clarify the reasons behind the performance gaps between multi-modal models and to lay out a standard procedure for developing models that better use multi-modal data. 

Researchers from New York University, Genentech, and CIFAR are embarking on a groundbreaking journey to resolve these inconsistencies. They are introducing a novel, more principled approach to multi-modal learning, one that has never been explored before, and by identifying the underlying variables that cause them. Using a unique probabilistic perspective, they propose a mechanism that generates data and examines the supervised multi-modal learning problem.

Since this selection variable produces the interdependence between the modalities and the label, it is always set to one. This selection mechanism’s efficacy differs throughout datasets. Dependencies between modalities and labels, known as inter-modality dependencies, are amplified in cases of strong selection effects. In contrast, when the selection impact is modest, intra-modality dependencies—dependencies between individual modalities and the label—become increasingly important. 

The proposed paradigm assumes that labels are the primary source of modalities-specific data. It further specifies the connection between the label, the selection process, and the various modalities. From one use case to the next, the amount to which the output relies on data from different modalities and the relationships between them varies. A multi-modal system has to simulate the inter- and intra-modality dependencies because it’s important to know how strong these dependencies are regarding the ultimate goal. The team accomplished this by developing and merging classifiers for each modality to capture the dependencies within each modality and a classifier to capture the dependencies between the output label and the interactions across different modes. 

The I2M2 method is derived from the multi-modal generative model, a widely used approach in multi-modal learning. However, the prior research on multi-modal learning can be divided into two groups using the suggested framework. The methods of inter-modal modeling, which are grouped in the first group, rely heavily on detecting inter-modal relationships to predict the target. Despite their theoretical capability to capture connections between and within modalities, they often fail in practice due to unfulfilled assumptions about the multi-modal learning-generating model. The methods used in intra-modality modeling, which fall under the second group, rely solely on labels for interactions between modalities, limiting their effectiveness. 

In contradiction to the goal of multi-modal learning, these methods fail to grasp the interdependence of the modalities for prediction. When predicting the label, inter-modality methods work well when modalities exchange substantial information, but intra-modality methods work well when cross-modality information is scarce or nonexistent. 

Because it is not necessary to know in advance how strong these dependencies are, the suggested I2M2 architecture overcomes this drawback. Because it explicitly describes interdependence across and within modalities, it can adapt to different contexts and still be effective. The results demonstrate that I2M2 is not just superior, but a game-changer, to both intra- and inter-modality approaches by validating researcher’s claims on various datasets. Automatic diagnosis utilizing knee MRI scans and mortality and ICD-9 code prediction in the MIMIC-III dataset are two examples of the many healthcare jobs to which this technology is applied. Findings on vision-and-language tasks like NLVR2 and VQA further prove the transformative potential of I2M2.

Dependencies differ in strength between datasets, as our comprehensive evaluation indicates; the fastMRI dataset benefits more from intra-modality dependencies, whereas the NLVR2 dataset finds more relevance in inter-modality dependencies. The AV-MNIST, MIMIC-III, and VQA datasets are affected by both dependencies. In every respect, I2M2 succeeds, guaranteeing solid performance independent of the relative importance of its dependencies. This thorough research and its robust findings instill confidence in the effectiveness of I2M2.  


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit

The post NYU Researchers Propose Inter- & Intra-Modality Modeling (I2M2) for Multi-Modal Learning, Capturing both Inter-Modality and Intra-Modality Dependencies appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/06/18/nyu-researchers-propose-inter-intra-modality-modeling-i2m2-for-multi-modal-learning-capturing-both-inter-modality-and-intra-modality-dependencies/feed/ 0 58689
Pixel Transformer: Challenging Locality Bias in Vision Models https://www.marktechpost.com/2024/06/17/pixel-transformer-challenging-locality-bias-in-vision-models/ https://www.marktechpost.com/2024/06/17/pixel-transformer-challenging-locality-bias-in-vision-models/#respond Mon, 17 Jun 2024 12:00:00 +0000 https://www.marktechpost.com/?p=58648 The deep learning revolution in computer vision has shifted from manually crafted features to data-driven approaches, highlighting the potential of reducing feature biases. This paradigm shift aims to create more versatile systems that excel across various vision tasks. While the Transformer architecture has demonstrated effectiveness across different data modalities, it still retains some inductive biases. […]

The post Pixel Transformer: Challenging Locality Bias in Vision Models appeared first on MarkTechPost.

]]>

The deep learning revolution in computer vision has shifted from manually crafted features to data-driven approaches, highlighting the potential of reducing feature biases. This paradigm shift aims to create more versatile systems that excel across various vision tasks. While the Transformer architecture has demonstrated effectiveness across different data modalities, it still retains some inductive biases. Vision Transformer (ViT) reduces spatial hierarchy but maintains translation equivariance and locality through patch projection and position embeddings. The challenge lies in eliminating these remaining inductive biases to further improve model performance and versatility.

Previous attempts to address locality in vision architectures have been limited. Most modern vision architectures, including those aimed at simplifying inductive biases, still maintain locality in their design. Even pre-deep learning visual features like SIFT and HOG used local descriptors. Efforts to remove locality in ConvNets, such as replacing spatial convolutional filters with 1×1 filters, resulted in performance degradation. Other approaches like iGPT and Perceiver explored pixel-level processing but faced efficiency challenges or fell short in performance compared to simpler methods.

Researchers from FAIR, Meta AI and the University of Amsterdam challenge the conventional belief that locality is a fundamental inductive bias for vision tasks. They find that by treating individual pixels as tokens for the Transformer and using learned position embeddings, removing locality inductive biases leads to better performance than conventional approaches like ViT. They name this approach “Pixel Transformer” (PiT) and demonstrate its effectiveness across various tasks, including supervised classification, self-supervised learning, and image generation with diffusion models. Interestingly, PiT outperforms baselines equipped with locality inductive biases. However, the researchers acknowledge that while locality may not be necessary, it is still useful for practical considerations like computational efficiency. This study delivers a compelling message that locality is not an indispensable inductive bias for model design.

PiT closely follows the standard Transformer encoder architecture, processing an unordered set of pixels from the input image with learnable position embeddings. The input sequence is mapped to a sequence of representations through multiple layers of Self-Attention and MLP blocks. Each pixel is projected into a high-dimensional vector via a linear projection layer, and a learnable [cls] token is appended to the sequence. Content-agnostic position embeddings are learned for each position. This design removes the locality inductive bias and makes PiT permutation equivariant at the pixel level.

In empirical evaluations, PiT demonstrates competitive performance across various tasks. For image generation using diffusion models, PiT-L outperforms the baseline DiT-L/2 on multiple metrics, including FID, sFID, and IS. The effectiveness of PiT generalizes well across different tasks, architectures, and operating representations. Also the results on CIFAR100 with 32×32 inputs, PiT substantially outperforms ViT. Researchers found that for PiT, self-supervised pre-training with MAE improves accuracy compared to training from scratch. The gap between ViT and PiT, with pre-training, gets larger when moving from Tiny to Small models. This suggests PiT can potentially scale better than ViT.

While PiT demonstrates that Transformers can directly work with individual pixels as tokens, practical limitations remain due to computational complexity. Nonetheless, this exploration challenges the notion that locality is fundamental for vision models and suggests that patchification is primarily a useful heuristic trading efficiency for accuracy. This finding opens new avenues for designing next-generation models in computer vision and beyond, potentially leading to more versatile and scalable architectures that rely less on manually inducted priors and more on data-driven, learnable alternatives.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit

The post Pixel Transformer: Challenging Locality Bias in Vision Models appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/06/17/pixel-transformer-challenging-locality-bias-in-vision-models/feed/ 0 58648
Sketchpad: An AI Framework that Gives Multimodal Language Models LMs a Visual Sketchpad and Tools to Draw on the Sketchpad https://www.marktechpost.com/2024/06/17/sketchpad-an-ai-framework-that-gives-multimodal-language-models-lms-a-visual-sketchpad-and-tools-to-draw-on-the-sketchpad/ https://www.marktechpost.com/2024/06/17/sketchpad-an-ai-framework-that-gives-multimodal-language-models-lms-a-visual-sketchpad-and-tools-to-draw-on-the-sketchpad/#respond Mon, 17 Jun 2024 08:00:00 +0000 https://www.marktechpost.com/?p=58639 One of the main challenges in current multimodal language models (LMs) is their inability to utilize visual aids for reasoning processes. Unlike humans, who draw and sketch to facilitate problem-solving and reasoning, LMs rely solely on text for intermediate reasoning steps. This limitation significantly impacts their performance in tasks requiring spatial understanding and visual reasoning, […]

The post Sketchpad: An AI Framework that Gives Multimodal Language Models LMs a Visual Sketchpad and Tools to Draw on the Sketchpad appeared first on MarkTechPost.

]]>

One of the main challenges in current multimodal language models (LMs) is their inability to utilize visual aids for reasoning processes. Unlike humans, who draw and sketch to facilitate problem-solving and reasoning, LMs rely solely on text for intermediate reasoning steps. This limitation significantly impacts their performance in tasks requiring spatial understanding and visual reasoning, such as geometry, visual perception, and complex math problems. Addressing this challenge is crucial for advancing AI research, as it would enable LMs to mimic human-like reasoning more closely and improve their applicability in real-world scenarios.

Current methods to enhance LMs’ visual reasoning capabilities include text-to-image models and various multimodal tool-use paradigms. These methods allow LMs to generate visual content from text descriptions, aiming to facilitate better reasoning. However, they fall short in several aspects. Text-to-image models, for instance, do not enable dynamic interaction with the visual content created, which is essential for tasks requiring iterative reasoning. Additionally, existing methods often have high computational complexity, making them unsuitable for real-time applications. They also lack the flexibility to incorporate specialist vision models during the reasoning process, limiting their ability to handle diverse and complex visual tasks effectively.

A team of researchers from the University of Washington, the Allen Institute for AI, and the University of Pennsylvania propose SKETCHPAD, a novel framework that equips multimodal LMs with a visual sketchpad and the tools necessary for dynamic sketching. This approach addresses the limitations of existing methods by allowing LMs to draw lines, boxes, and marks, facilitating reasoning processes closer to human sketching. SKETCHPAD can integrate specialist vision models, such as object detection and segmentation models, to enhance visual perception and reasoning further. This innovative approach enables LMs to generate and interact with visual artifacts during reasoning, significantly improving their performance on various tasks. By providing a scaffold for sketch-based reasoning, SKETCHPAD represents a significant contribution to the field, offering a more efficient and accurate solution compared to existing methods.

The proposed method operates by synthesizing programs that generate visual sketches as intermediate reasoning steps. It uses common Python packages like Matplotlib and NetworkX for mathematical tasks and integrates specialist vision models for computer vision tasks. For instance, in geometry problems, SKETCHPAD enables the LM to draw auxiliary lines on diagrams to aid problem-solving. In tasks involving mathematical functions, it allow the LM to plot functions and analyze their properties visually. The framework requires no fine-tuning or training, making it readily applicable to existing multimodal LMs. SKETCHPAD’s ability to use specialist models for tasks like object detection and segmentation further enhances its visual reasoning capabilities.

The researchers present extensive experiments demonstrating SKETCHPAD’s effectiveness across a wide range of tasks, including geometry, graph algorithms, and complex visual reasoning tasks. Key performance metrics such as accuracy, precision, and recall are significantly improved with SKETCHPAD. For example, on math tasks, SKETCHPAD achieves an average gain of 12.7%, and on vision tasks, it yields an average gain of 8.6%. The table below from the paper showcases SKETCHPAD’s effectiveness in geometry problems, where it improves accuracy from 37.5% to 45.8% on geometry tasks using GPT-4 Turbo. The table compares different methods, including the proposed approach and existing baselines, with performance metrics columns. The improvement of the proposed method is statistically significant, highlighting its superiority.


In conclusion, the proposed method presents SKETCHPAD, a novel framework that significantly enhances the reasoning capabilities of multimodal LMs by integrating visual sketching tools. The proposed solution overcomes the critical limitations of existing methods, offering a more efficient and accurate approach to visual reasoning. The results demonstrate substantial performance gains across various tasks, indicating SKETCHPAD’s potential impact on the field of AI research by enabling more human-like multimodal intelligence.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit

The post Sketchpad: An AI Framework that Gives Multimodal Language Models LMs a Visual Sketchpad and Tools to Draw on the Sketchpad appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/06/17/sketchpad-an-ai-framework-that-gives-multimodal-language-models-lms-a-visual-sketchpad-and-tools-to-draw-on-the-sketchpad/feed/ 0 58639
TiTok: An Innovative AI Method for Tokenizing Images into 1D Latent Sequences https://www.marktechpost.com/2024/06/14/titok-an-innovative-ai-method-for-tokenizing-images-into-1d-latent-sequences/ https://www.marktechpost.com/2024/06/14/titok-an-innovative-ai-method-for-tokenizing-images-into-1d-latent-sequences/#respond Fri, 14 Jun 2024 11:30:00 +0000 https://www.marktechpost.com/?p=58539 In recent years, image generation has made significant progress due to advancements in both transformers and diffusion models. Similar to trends in generative language models, many modern image generation models now use standard image tokenizers and de-tokenizers. Despite showing great success in image generation, image tokenizers encounter fundamental limitations due to the way they are […]

The post TiTok: An Innovative AI Method for Tokenizing Images into 1D Latent Sequences appeared first on MarkTechPost.

]]>

In recent years, image generation has made significant progress due to advancements in both transformers and diffusion models. Similar to trends in generative language models, many modern image generation models now use standard image tokenizers and de-tokenizers. Despite showing great success in image generation, image tokenizers encounter fundamental limitations due to the way they are designed. These tokenizers are based on the assumption that the latent space should retain a 2D structure to maintain a direct mapping for locations between the latent tokens and image patches. 

This paper discusses three existing methods in the realm of image processing and understanding. Firstly, Image Tokenization has been a fundamental approach since the early days of deep learning, utilizing autoencoders to compress high-dimensional images into low-dimensional latent representations and then decode them back. The second approach is Tokenization for Image Understanding, which is used for image understanding tasks such as image classification, object detection, segmentation, and multimodal large language models (MLLMs). Last is the Image Generation, in which methods have evolved from sampling variational autoencoders (VAEs) to utilizing generative adversarial networks (GANs), diffusion models, and autoregressive models. 

Researchers from Technical University Munich and ByteDance have proposed an innovative approach that tokenizes images into 1D latent sequences, named Transformer-based 1-Dimensional Tokenizer (TiTok). TiTok consists of a Vision Transformer (ViT) encoder, a ViT decoder, and a vector quantizer, similar to typical Vector-Quantized (VQ) model designs. During the tokenization phase, the image is divided into patches, which are then flattened and combined into a 1D sequence of latent tokens. After the ViT encoder processes the image features, the resulting latent tokens form the image’s latent representation.

Along with the Image Generation task using a tokenizer, TiTok also shows its efficiency in image generation by using a typical pipeline. For the generation framework, MaskGIT is used because of its simplicity and effectiveness, which allows for training a MaskGIT model by simply replacing its VQGAN tokenizer with TiTok model. The process begins by pre-tokenizing the image into 1D discrete tokens, and a random ratio of the latent tokens is replaced with mask tokens at each training step. After that, a bidirectional transformer takes this masked token sequence as input and predicts the corresponding discrete token IDs for the masked tokens.

TiTok provides a more compact way for latent representation, making it much more efficient than traditional methods. For example, a 256 × 256 × 3 image can be reduced to just 32 discrete tokens, compared to the 256 or 1024 tokens used by earlier techniques. Using the same generator framework, TiTok achieves a gFID score of 1.97, outperforming the MaskGIT baseline by 4.21 on the ImageNet 256 × 256 benchmark. TiTok’s advantages are even more significant at higher resolutions. On the ImageNet 512 × 512 benchmark, TiTok not only outperforms the leading diffusion model DiT-XL/2 but also reduces the number of image tokens by 64 times, resulting in a generation process that is 410 times faster.

In this paper, researchers have introduced an innovative method that tokenizes images into 1D latent sequences called TiTok. It can be used for reconstructing and generating natural images. A compact formulation is provided to tokenize an image into a 1D latent sequence. The proposed method can represent an image with 8 to 64 times fewer tokens than the commonly used 2D tokenizers. Moreover, the compact 1D tokens enhance the training and inference speed of the generation model, as well as obtain a competitive FID on the ImageNet benchmarks. The future direction will focus on more efficient image representation and generation models with 1D image tokenization.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit

The post TiTok: An Innovative AI Method for Tokenizing Images into 1D Latent Sequences appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/06/14/titok-an-innovative-ai-method-for-tokenizing-images-into-1d-latent-sequences/feed/ 0 58539
DeepStack: Enhancing Multimodal Models with Layered Visual Token Integration for Superior High-Resolution Performance https://www.marktechpost.com/2024/06/11/deepstack-enhancing-multimodal-models-with-layered-visual-token-integration-for-superior-high-resolution-performance/ https://www.marktechpost.com/2024/06/11/deepstack-enhancing-multimodal-models-with-layered-visual-token-integration-for-superior-high-resolution-performance/#respond Wed, 12 Jun 2024 05:16:31 +0000 https://www.marktechpost.com/?p=58451 Most LMMs integrate vision and language by converting images into visual tokens fed as sequences into LLMs. While effective for multimodal understanding, this method significantly increases memory and computation demands, especially with high-resolution photos or videos. Various techniques, like spatial grouping and token compression, aim to reduce the number of visual tokens but often compromise […]

The post DeepStack: Enhancing Multimodal Models with Layered Visual Token Integration for Superior High-Resolution Performance appeared first on MarkTechPost.

]]>

Most LMMs integrate vision and language by converting images into visual tokens fed as sequences into LLMs. While effective for multimodal understanding, this method significantly increases memory and computation demands, especially with high-resolution photos or videos. Various techniques, like spatial grouping and token compression, aim to reduce the number of visual tokens but often compromise on detailed visual information. Despite these efforts, the fundamental approach remains the same: visual tokens are transformed into a 1D sequence and input into LLMs, inherently increasing processing overhead.

Researchers from Fudan University and Microsoft have developed “DeepStack,” a new architecture for LMMs. Instead of feeding a long sequence of visual tokens into the language model’s first layer, DeepStack distributes these tokens across multiple layers, aligning each group with a corresponding layer. This bottom-to-top approach enhances the model’s ability to process complex visual inputs without increasing computational costs. After testing the LLaVA-1.5 and LLaVA-Next models, DeepStack shows significant performance gains across various benchmarks, particularly in high-resolution tasks, and can handle more tokens efficiently than traditional methods.

Recent advancements in LLMs like BERT, T5, and GPT have revolutionized natural language processing (NLP) using transformers and pretraining-then-finetuning strategies. These models excel in various tasks, from text generation to question answering. Simultaneously, LMMs like CLIP and Flamingo effectively integrate vision and language by aligning them in a shared semantic space. However, handling high-resolution images and complex visual inputs remains challenging due to high computational costs. The new “DeepStack” approach addresses this by distributing visual tokens across multiple LLMs or Vision Transformers (ViTs) layers, enhancing performance and reducing overhead.

DeepStack enhances LMMs using a dual-stream approach to incorporate fine-grained visual details without increasing context length. It divides image processing into a global view stream for overall information and a high-resolution stream that adds detailed image features across LLM layers. High-resolution tokens are upsampled and dilated, then fed into different LLM layers. This strategy significantly improves the model’s ability to handle complex visual inputs efficiently. Unlike traditional methods that concatenate visual tokens, DeepStack integrates them across layers, maintaining efficiency and enhancing the model’s visual processing capabilities.

The experiments on DeepStack demonstrate its efficacy in enhancing multi-modal language models by integrating high-resolution visual tokens. Utilizing a two-stage training process, it leverages the CLIP image encoder to mosaic high-res image patches into whole-image features. During pre-training, the model uses 558k samples from LAION and other datasets, while fine-tuning incorporates 748k samples, adapting LLaVA’s pipeline. DeepStack consistently outperforms baselines like LLaVA on various VQA and multi-modal benchmarks, proving its capability to handle detailed visual information. It excels in text-oriented and zero-shot video QA tasks, confirming that early and strategic layer insertion of visual tokens significantly enhances model performance without extra computational cost.

In conclusion, DeepStack introduces an innovative approach to enhancing LMMs by stacking visual tokens across multiple model layers rather than feeding them all into the first layer. This method reduces computational and memory demands while significantly improving performance on high-resolution tasks. By distributing visual tokens across different layers of the transformer, DeepStack enables more effective interactions between these tokens across layers. This results in substantial gains, outperforming traditional models like LLaVA on various benchmarks. The technique proves particularly advantageous in tasks demanding detailed visual comprehension, paving the way for more efficient and powerful multimodal models.


Check out the Paper, GitHub, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit

The post DeepStack: Enhancing Multimodal Models with Layered Visual Token Integration for Superior High-Resolution Performance appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/06/11/deepstack-enhancing-multimodal-models-with-layered-visual-token-integration-for-superior-high-resolution-performance/feed/ 0 58451
NVIDIA’s Autoguidance: Improving Image Quality and Variation in Diffusion Models https://www.marktechpost.com/2024/06/06/nvidias-autoguidance-improving-image-quality-and-variation-in-diffusion-models/ https://www.marktechpost.com/2024/06/06/nvidias-autoguidance-improving-image-quality-and-variation-in-diffusion-models/#respond Fri, 07 Jun 2024 02:27:52 +0000 https://www.marktechpost.com/?p=58282 Improving image quality and variation in diffusion models without compromising alignment with given conditions, such as class labels or text prompts, is a significant challenge. Current methods often enhance image quality at the expense of diversity, limiting their applicability in various real-world scenarios such as medical diagnosis and autonomous driving, where both high quality and […]

The post NVIDIA’s Autoguidance: Improving Image Quality and Variation in Diffusion Models appeared first on MarkTechPost.

]]>

Improving image quality and variation in diffusion models without compromising alignment with given conditions, such as class labels or text prompts, is a significant challenge. Current methods often enhance image quality at the expense of diversity, limiting their applicability in various real-world scenarios such as medical diagnosis and autonomous driving, where both high quality and variability are crucial. Overcoming this challenge can enhance the performance of AI systems in generating realistic and diverse images, pushing the boundaries of current AI capabilities.

The existing method to address this challenge has been classifier-free guidance (CFG), which uses an unconditional model to guide a conditional one. CFG improves prompt alignment and image quality but reduces image variation. This trade-off occurs because the effects of image quality and variation are inherently entangled, making it difficult to control them independently. Furthermore, CFG is limited to conditional generation and suffers from task discrepancy problems, leading to skewed image compositions and oversimplified images. These limitations hinder the method’s performance and restrict its use in generating diverse and high-quality images.

Researchers from NVIDIA propose a novel method called auto-guidance, which involves guiding the generation process using a smaller, less-trained version of the main model instead of an unconditional model. This approach addresses the limitations of CFG by decoupling image quality from variation, thus allowing for better control over these aspects. Autoguidance maintains the same conditioning as the main model, ensuring consistency in the generated images. This innovative method significantly improves image generation quality and variation, setting new records in benchmark tests such as ImageNet-512 and ImageNet-64, and can be applied to both conditional and unconditional models.

The core of the proposed method involves training a smaller version of the main model with reduced capacity and training time. This guiding model is used to influence the main model during the generation process. The paper details the denoising diffusion process, which generates synthetic images by reversing a stochastic corruption process. The models are evaluated using metrics like Fréchet Inception Distance (FID) and FDDINOv2, showing significant improvements in image generation quality. For instance, using the small model (EDM2-S) in ImageNet-512, auto-guidance improves FID from 2.56 to 1.34, outperforming existing methods.

Extensive quantitative results demonstrate the effectiveness of auto-guidance. The proposed method achieves record FIDs of 1.01 for 64×64 and 1.25 for 512×512 image resolutions on publicly available networks. These results indicate a significant improvement in image quality without compromising variation. The evaluation includes tables comparing different methods, showcasing the superior performance of auto-guidance over CFG and other baselines. For instance, the proposed method achieved an accuracy of 87.5% on the ImageNet dataset, surpassing the previous state-of-the-art by 5%.

In conclusion, the novel method to improve image quality in diffusion models without compromising variation involves using a smaller, less-trained version of the model for guidance. The proposed autoguidance method overcomes the limitations of existing approaches like CFG. This innovative approach achieves state-of-the-art results in benchmark tests, significantly advancing the field of AI research by providing a more efficient and effective solution for generating high-quality and diverse images.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform

The post NVIDIA’s Autoguidance: Improving Image Quality and Variation in Diffusion Models appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/06/06/nvidias-autoguidance-improving-image-quality-and-variation-in-diffusion-models/feed/ 0 58282
SignLLM: A Multilingual Sign Language Model that can Generate Sign Language Gestures from Input Text https://www.marktechpost.com/2024/05/30/signllm-a-multilingual-sign-language-model-that-can-generate-sign-language-gestures-from-input-text/ https://www.marktechpost.com/2024/05/30/signllm-a-multilingual-sign-language-model-that-can-generate-sign-language-gestures-from-input-text/#respond Fri, 31 May 2024 06:30:00 +0000 https://www.marktechpost.com/?p=58010 The primary goal of Sign Language Production (SLP) is to create sign avatars that resemble humans using text inputs. The standard procedure for SLP methods based on deep learning involves several steps. First, the text is translated into gloss, a language that represents postures and gestures. This gloss is then used to generate a video […]

The post SignLLM: A Multilingual Sign Language Model that can Generate Sign Language Gestures from Input Text appeared first on MarkTechPost.

]]>

The primary goal of Sign Language Production (SLP) is to create sign avatars that resemble humans using text inputs. The standard procedure for SLP methods based on deep learning involves several steps. First, the text is translated into gloss, a language that represents postures and gestures. This gloss is then used to generate a video that mimics sign language. The resulting video is further processed to create more interesting avatar movies that appear more like real people. Acquiring and processing data in sign language is challenging due to the complexity of these processes. 

Over the past decade, most studies have grappled with the challenges of a German sign language (GSL) dataset called PHOENIX14T and other lesser-known language datasets for Sign Language Production, Recognition, and Translation tasks (SLP, SLR, and SLT). These challenges, which include the lack of standardized tools and the slow progress in research on minority languages, have significantly dampened researchers’ enthusiasm. The complexity of the problem is further underscored by the fact that studies using the American Sign Language (ASL) dataset are still in their infancy.

Thanks to the current mainstream datasets, a lot of progress has been made in the sector. Nevertheless, they fail to tackle the new problems that are appearing:

  1. Pre-existing datasets sometimes include files in complicated forms, such as pictures, scripts, OpenPose skeleton key points, graphs, and perhaps other formats used for preprocessing. Directly trainable actionable data is absent from these forms. 
  2. Annotating glosses by hand is a tedious and time-consuming process. 
  3. After obtaining several sign video datasets from sign language experts, the data is transformed into various forms, making expanding the dataset very difficult.

Researchers from Rutgers University, Australian National University, Data61/CSIRO, Carnegie Mellon University, University of Texas at Dallas, and University of Central Florida present Prompt2Sign, a groundbreaking dataset that tracks the upper body movements of sign language demonstrators on a massive scale. This is a significant step forward in the field of multilingual sign language recognition and generation, as it is the first comprehensive dataset combining eight distinct sign languages and using publicly available online videos and datasets to address the shortcomings of earlier efforts.

The researchers begin by standardizing the posture information of video frames (the original material of the tool) into the preset format using OpenPose, a video processing application so that they can construct this dataset. Reducing redundancy and making training with seq2seq and text2text models easier can be achieved by storing key information in their standardized format. Then, to make it more cost-effective, they auto-generate prompt words to decrease the need for human annotations. Lastly, to address the issues with manual preprocessing and data collecting, they enhance the tools’ processing level of automation, making them extremely efficient and lightweight. This improves their data processing capabilities without the need for further model loading. 

The team highlights that the current model could benefit from some tweaks because the new datasets present different obstacles while training models. Due to the variances in sign language from country to country, it is not possible to train several sets of sign language data simultaneously. Managing additional languages and larger datasets makes training more complex and time-consuming, making downloading, storing, and loading data more painful. Therefore, investigating training techniques at fast speeds is essential. Furthermore, it is important to investigate under-researched topics like multilingual SLP, efficient training, and the capacity to understand prompts since the current model structure cannot comprehend more languages and more complicated, natural human conversational inputs. This pertains to questions like improving the large model’s generalization capability and fundamental understanding prompts ability.

To address these issues, the team presented SignLLM, the initial large-scale multilingual SLP model built on the Prompt2Sign dataset. It generates the skeletal poses of eight different sign languages given texts or suggestions. There are two other modes for SignLLM: (i) The Multi-Language Switching Framework (MLSF), which dynamically adds encoder-decoder groups to generate many sign languages in tandem. (ii) The Prompt2LangGloss module enables SignLLM to generate static encoder-decoder pairs. 

The team aims to use their new dataset to set a standard for multilingual recognition and generation. Their latest loss function incorporates a novel module grounded in the Reinforcement Learning (RL) idea to expedite model training on more languages and larger datasets, thereby resolving the prolonged training time caused by these factors. A large number of tests and ablation investigations were carried out. The outcomes prove that the SignLLM outperforms baseline methods on both the development and test sets for a total of eight sign languages.

Even though their work has made great strides in automating the processing and capture of data in sign language, it still needs to provide a comprehensive end-to-end solution. For example, the team highlights that to utilize one’s private dataset, one must use OpenPose to extract 2D keypoint json files and then update them manually.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform

The post SignLLM: A Multilingual Sign Language Model that can Generate Sign Language Gestures from Input Text appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/05/30/signllm-a-multilingual-sign-language-model-that-can-generate-sign-language-gestures-from-input-text/feed/ 0 58010
MedVersa: A Generalist Learner that Enables Flexible Learning and Tasking for Medical Image Interpretation https://www.marktechpost.com/2024/05/30/medversa-a-generalist-learner-that-enables-flexible-learning-and-tasking-for-medical-image-interpretation/ https://www.marktechpost.com/2024/05/30/medversa-a-generalist-learner-that-enables-flexible-learning-and-tasking-for-medical-image-interpretation/#respond Fri, 31 May 2024 05:30:00 +0000 https://www.marktechpost.com/?p=58007 Despite the advancement of artificial intelligence in the field of medical science, these systems have limited application. This limitation creates a gap in developing AI solutions for specific tasks. Researchers from Harvard Medical School, USA; Jawaharlal Institute of Postgraduate Medical Education and Research, India; and Scripps Research Translational Institute, USA, proposed MedVersa to address the […]

The post MedVersa: A Generalist Learner that Enables Flexible Learning and Tasking for Medical Image Interpretation appeared first on MarkTechPost.

]]>

Despite the advancement of artificial intelligence in the field of medical science, these systems have limited application. This limitation creates a gap in developing AI solutions for specific tasks. Researchers from Harvard Medical School, USA; Jawaharlal Institute of Postgraduate Medical Education and Research, India; and Scripps Research Translational Institute, USA, proposed MedVersa to address the challenges in medical artificial intelligence systems, hindering their widespread adoption in clinical practice. The task-specific approach of the existing models is the key issue that causes their inability to adapt to healthcare settings’ diverse and complex needs. MedVersa, a generalist learner capable of multifaceted medical image interpretation, aims to solve these challenges.

Current medical AI systems are predominantly designed for specific tasks, such as identifying chest pathologies or classifying skin diseases. However, these task-specific approaches limit their adaptability and usability in real-world clinical scenarios. In contrast, MedVersa, the proposed solution, is a generalist learner that leverages a large language model as a learnable orchestrator. The unique architecture of MedVersa enables it to learn from both visual and linguistic supervision, supporting multimodal inputs and real-time task specification. Unlike previous generalist medical AI models that focus solely on natural language supervision, MedVersa integrates vision-centric capabilities, allowing it to perform tasks such as detection and segmentation crucial for medical image interpretation.

MedVersa’s method involves three key components: the multimodal input coordinator, the large language model-based learnable orchestrator, and various learnable vision modules. The multimodal input coordinator processes both visual and textual inputs, while the large language model orchestrates the execution of tasks using language and vision modules. This architecture enables MedVersa to excel in both vision-language tasks, like generating radiology reports, and vision-centric challenges, including detecting anatomical structures and segmenting medical images. For training the model, researchers combined more than 10 publicly available medical datasets for various tasks, such as MIMIC-CXR, Chest ImaGenome, and Medical-Diff-VQA, into one multimodal dataset, MedInterp. 

MedVersa employs advanced multimodal input coordination using distinct vision encoders and an orchestrator optimized for medical tasks. For the 2D and 3D vision encoders, researchers utilized the base version of the Swin Transformer pre-trained on ImageNet and the encoder architecture from the 3D UNet, respectively. They cropped 50–100% of the original images, resized them to 224 x 224 pixels with three channels, and further applied various augmentations for specific tasks. Additionally, the system implements two distinct linear projectors for 2D and 3D data. MedVersa uses the Low-Rank Adaptation (LoRA) strategy to train the orchestrator. LoRA uses the idea of low-rank matrix decomposition to achieve proximity to a large weight matrix in neural network layers. By setting the rank and alpha values of LoRA to 16, the method ensures efficient training while modifying only a fraction of the model parameters

MedVersa outperforms existing state-of-the-art across multiple tasks, in areas such as radiology report generation and chest pathology classification. MedVersa’s ability to adapt to impromptu task specifications, as well as its consistent performance across external cohorts, indicate its robustness and generalization. MedVersa demonstrates superior performance over DAM in chest pathology classification, with an average F1 score of 0.615, notably higher than DAM’s 0.580. For detection tasks, MedVersa surpasses YOLOv5 in detecting a variety of anatomical structures, with most IoU scores on certain structures, especially in detecting lung zones. By incorporating vision-centric training alongside vision-language training, the model achieved an average improvement of 4.1% compared to models trained solely on vision-language data.

In conclusion, the study presents a state-of-the-art generalist medical AI (GMAI) model to support multimodal inputs, outputs, and on-the-fly task specification. By integrating visual and linguistic supervision within its learning processes, MedVersa demonstrates superior performance across a wide range of tasks and modalities. Its adaptability and versatility make it an important resource in medical AI, paving the way for more thorough and efficient AI-assisted clinical decision-making.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform

The post MedVersa: A Generalist Learner that Enables Flexible Learning and Tasking for Medical Image Interpretation appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/05/30/medversa-a-generalist-learner-that-enables-flexible-learning-and-tasking-for-medical-image-interpretation/feed/ 0 58007