‘Falcon’ LLM Model similar to ChatGPT

Falcon was built using custom tooling and leverages a unique data pipeline that can extract high-quality content out of web data and use it for training a custom codebase, independent from the works of NVIDIA, Microsoft, or HuggingFace.

A particular focus was put on data quality at scale. LLMs are notoriously sensitive to the quality of their training data, so significant care was taken in building a data pipeline that would both scale to tens of thousands of CPU cores for fast processing, and that would extract high-quality content from the web using extensive filtering and deduplication.

The architecture of Falcon was optimized for performance and efficiency. Combining high-quality data with these optimizations, Falcon significantly outperforms GPT-3 for only 75% of the training compute budget—and requires a fifth of the compute at inference time.

Falcon matches the performance of state-of-the-art LLMs from DeepMind, Google, and Anthropic.

Falcon was trained on –

Falcon is a 40 billion parameters autoregressive decoder-only model trained on 1 trillion tokens. It was trained on 384 GPUs on AWS over the course of two months.

Pretraining data was collected from public crawls of the web to build the pretraining dataset of Falcon. Using dumps from CommonCrawl, after significant filtering (to remove machine generated text and adult content) and deduplication, a pretraining dataset of nearly five trillion tokens was assembled.

To broaden Falcon abilities, this dataset was then extended with a few curated sources such as research papers and conversations from social media.

Finally, Falcon’s performance was validated against open-source benchmarks such as EAI Harness, HELM, and BigBench.

Potential UseCase –

Generate creative text and solve complex problems.

Used in chatbots, customer service operations, virtual assistants, language translation, content generation, and sentiment analysis.

Broad use cases are foreseen by Falcon, although we are most excited about applications to reduce and automate “repetitive” work.

Falcon will help Emirati companies and startups become more efficient, streamlining internal processes and giving back time for employees to focus on what matters.

At an individual level, chatbots embedding Falcon will be able to assist users in their daily lives.

Falcon 40B is Open Sourced

Technology Innovation Institute has publicly released the model’s weights for research and commercial use.
For researchers and developers this will make Falcon 40B, 7B more accessible, with it being based on released under the Apache License Version 2.0 (https://www.apache.org/licenses/LICENSE-2.0).