Accurate segmentation is essential for echocardiography-based assessment of cardiovascular diseases (CVDs).
However, the variability among sonographers and the inherent challenges of ultrasound images hinder precise segmentation.
By leveraging the joint representation of image and text modalities, Vision-Language Segmentation Models (VLSMs) can incorporate rich contextual information, potentially aiding in accurate and explainable segmentation.
However, the lack of readily available data in echocardiography hampers the training of VLSMs.
In this study, we explore using synthetic datasets from Semantic Diffusion Models (SDMs) to enhance VLSMs for echocardiography segmentation.
We evaluate results for two popular VLSMs (CLIPSeg and CRIS) using seven different kinds of language prompts derived from several attributes, automatically extracted from echocardiography images, segmentation masks, and their metadata.
Our results show improved metrics and faster convergence when pretraining VLSMs on SDM-generated synthetic images before finetuning on real images.
The code and prompts are released at https://github.com/naamiinepal/synthetic-boost
@inproceedings{adhikari2023synthetic,title={Synthetic Boost: Leveraging Synthetic Data for Enhanced Vision-Language Segmentation in Echocardiography},author={Adhikari, Rabin and Dhakal, Manish and Thapaliya, Safal and Poudel, Kanchan and Bhandari, Prasiddha and Khanal, Bishesh},booktitle={International Workshop on Advances in Simplifying Medical Ultrasound},pages={89--99},year={2023},organization={Springer},}
arXiv
Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models
Kanchan Poudel, Manish Dhakal, Prasiddha Bhandari, and
3 more authors
Medical image segmentation with deep learning is an important and widely studied topic because segmentation enables quantifying target structure size and shape that can help in disease diagnosis, prognosis, surgery planning, and understanding.
Recent advances in the foundation Vision-Language Models (VLMs) and their adaptation to segmentation tasks in natural images with Vision-Language Segmentation Models (VLSMs) have opened up a unique opportunity to build potentially powerful segmentation models for medical images that enable providing helpful information via language prompt as input, leverage the extensive range of other medical imaging datasets by pooled dataset training, adapt to new classes, and be robust against out-of-distribution data with human-in-the-loop prompting during inference.
Although transfer learning from natural to medical images for imageonly segmentation models has been studied, no studies have analyzed how the joint representation of vision-language transfers to medical images in segmentation problems and understand gaps in leveraging their full potential.
We present the first benchmark study on transfer learning of VLSMs to 2D medical images with thoughtfully collected 11 existing 2D medical image datasets of diverse modalities with carefully presented 9 types of language prompts from 14 attributes.
Our results indicate that VLSMs trained in natural image-text pairs transfer reasonably to the medical domain in zero-shot settings when prompted appropriately for non-radiology photographic modalities; when finetuned, they obtain comparable performance to conventional architectures, even in X-rays and ultrasound modalities.
However, the additional benefit of language prompts during finetuning may be limited, with image features playing a more dominant role; they can better handle training on pooled datasets combining diverse modalities and are potentially more robust to domain shift than the conventional segmentation models.
The code and prompts are released at https://github.com/naamiinepal/medvlsm
@article{poudel2023exploring,title={Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models},author={Poudel, Kanchan and Dhakal, Manish and Bhandari, Prasiddha and Adhikari, Rabin and Thapaliya, Safal and Khanal, Bishesh},journal={arXiv preprint arXiv:2308.07706},year={2023},eprint={2308.07706},archiveprefix={arXiv},primaryclass={cs.CV},doi={https://doi.org/10.48550/arXiv.2308.07706},}
2022
IEEE
Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet
Manish Dhakal, Arman Chhetri, Aman Kumar Gupta, and
3 more authors
In 2022 International Conference on Inventive Computation Technologies (ICICT), 2022
This paper presents an end-to-end deep learning model for Automatic Speech Recognition (ASR) that transcribes Nepali speech to text.
The model was trained and tested on the OpenSLR (audio, text) dataset.
The majority of the audio dataset have silent gaps at both ends which are clipped during dataset preprocessing for a more uniform mapping of audio frames and their corresponding texts.
Mel Frequency Cepstral Coe cients (MFCCs) are used as audio features to feed into the model.
The model having Bidirectional LSTM paired with ResNet and one-dimensional CNN produces the best results for this dataset out of all the models (neural networks with variations of LSTM, GRU, CNN, and ResNet) that have been trained so far.
This novel model uses Connectionist Temporal Classification (CTC) function for loss calculation during training and CTC beam search decoding for predicting characters as the most likely sequence of Nepali text.
On the test dataset, the character error rate (CER) of 17.06 percent has been achieved.
The code has been released at https://github.com/manishdhakal/ASR-Nepali-using-CNN-BiLSTM-ResNet.
@inproceedings{dhakal2022automatic,title={Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet},author={Dhakal, Manish and Chhetri, Arman and Gupta, Aman Kumar and Lamichhane, Prabin and Pandey, Suraj and Shakya, Subarna},booktitle={2022 International Conference on Inventive Computation Technologies (ICICT)},pages={515--521},year={2022},organization={IEEE},doi={https://doi.org/10.1109/ICICT54344.2022.9850832},}