‘BioGPT’ – Microsoft’s new ChatBot that can perform tasks such as answering questions, extracting relevant data, and generating text relevant to biomedical literature

BioGPT is a type of generative language model, trained on millions of previously published biomedical research articles.

Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT.

BioGPT relies on deep learning, where artificial neural networks—meant to mimic neurons in the human brain—learn to process increasingly complex data on their own. As a result, the new AI program is a type of “black box” technology, meaning developers do not know how individual components of neural networks work together to create the output.

To assess the accuracy of generative AI models, researchers have developed tests to measure natural language processing (NLP)—or the ability to understand text and spoken language. Microsoft’s recent paper assessed BioGPT along six scales of NLP, reporting that the new model outperformed previous models on most tasks. This includes the well-established scale PubMedQA, in which Microsoft reported BioGPT achieved human parity.

The rise of BioGPT forms part of the wider push towards AI solutions in healthcare and the clinical trials industry. Recently, AI has shown the potential to improving clinical trial patient selection, predicting drug development outcomes, and developing digital biomarkers.

The team studied the prompt design and target sequence design when applying BioGPT to downstream tasks and found that target sequences with natural language semantics are better than structured prompts explored in previous works.

The team designed and examined the prompt and the target sequence format while applying pre-trained BioGPT to downstream tasks based on GPT-2 and pre-trained on 15 million PubMed abstracts corpus. It performs better than earlier models on most of the six biomedical NLP tasks it evaluates.

In PubMedQA, users must answer “yes,” “no,” or “maybe” to a series of biomedical questions based on corresponding abstracts from the database PubMed. For example, one PubMedQA prompt asks, “Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?”

BioGPT-Large, the most extensive version of the AI program, achieved a record 81% accuracy on PubMedQA, compared to an accuracy of 78% for a single human annotator. Most other NLP programs, including Google’s BERT family of language models, have not surpassed human accuracy.



NIH relevant articles –