Since the beginning of the year, the ChatGPT media frenzy seems to be sweeping everything in its path. However, in recent weeks, announcements from the open-source community have shown the vitality and inventiveness of these solutions. What is the role of open source in the current landscape of generative AI?
Death of a duel announced
On one side, Google, the monopolistic data processing giant, and its AI research activities embodied by their emblematic subsidiary, DeepMind. On the other side, OpenAI, supported by Microsoft, has made ChatGPT open to the public. LLMs were previously confined to laboratories.
The behemoth against the challenger.
The forces present in the booming landscape of generative text-based AI seem well identified. Google against OpenAI/Microsoft in the ring, and in the gym, the challengers warming up: Meta, Amazon, Anthropic. The almost exponential number of registrations for ChatGPT is a warning signal for the Mountain View firm, which knows all too well that the early bird often gets the worm. This confrontation, which is ultimately quite traditional between data competitors, is completely disrupted in a matter of weeks by the emergence of a new player: open source.
In early May, a Google researcher publishes an internal discussion on Discord in which he claims that neither Google nor OpenAI are in a position to dominate the market of Large Language Models (LLMs), which are at the core of ChatGPT and Bard. According to him, the open-source wave is unstoppable. This prediction may seem a bit exaggerated when considering the almost limitless financial resources of the tech giants, but it is based on a cascade of open-source solutions announcements that are disrupting the capabilities and applications of AI. Auto GPT, GPT4ALL, Stable LM, Baby AGI, Alpaca, GPT-J, open LLaMa, Vicuna: these innovations, some of which are major, have appeared not in a few months, but in a few weeks. Auto GPT, in particular, has accumulated over 40,000 stars on GitHub in 48 hours...
To explain the explosion of open-source offerings, which were virtually non-existent a few months ago, it is necessary to take a step back.
LoRA and Chinchilla
2022: DeepMind announces the release of Chinchilla, an NLP model apparently as effective as GPT-3 but with almost three times fewer parameters. The model remains on the shelf, but it lays the theoretical foundation for a slimming program.
Another development: the LoRA (Low Rank Adaptation) method, initiated in Microsoft's laboratories, involves training smaller models, hence with fewer parameters, and combining them with those of a larger model like GPT-3. A "small" AI synthesizes the parameters of a larger one for greater efficiency: it's AI parameterizing AI.
The Lama escapes
In February 2023, Meta announces the release of LLaMA, their latest language model, open to the scientific community but closed to the general public. Meta's goal is to receive feedback from specialists to improve the model's robustness but also to study the multiple biases, fabrications, and other falsehoods that afflict LLMs.
A week after its release, the model weights leak on 4chan and become freely accessible on torrent servers. Biases are what largely contribute to the value of an AI model; they are what modify the input data in a node. Having the weights (parameters can also be generalized) of a model means being able to modify and adapt it to one's objectives. Anyone can create their own version of Meta's model: the Lama has definitely escaped.
Always Bigger
So far, the race for power in LLM models has been quantitative: it's about absorbing datasets of increasingly massive size and adapting the billions of resulting parameters. To achieve the best possible performance, more data, more parameters, and more computational capacity are constantly needed. The cost of training amounts to millions of dollars, and OpenAI has to turn to Microsoft: designing an LLM is inaccessible without deep pockets.
The results are spectacular for the general public, who are discovering an AI that deserves (at least in part) its name. But the gigantism of LLMs poses new problems. It becomes impossible to obtain satisfactory explainability of the AI's responses. This is the famous black-box effect of neural networks taken to the extreme. Without explainability, there is a trust issue in the results. Ethical and security filters resemble emergency patches rather than a genuine mastery of AI. Data confidentiality and control issues become more acute. Processing times lengthen, datasets become scarce, and hallucinations persist. In short, LLMs are overweight.
Value over Quantity
The open-source community quickly grasped the value of LLaMA that they had "recovered." A model that can be modified, rationalized parameters, and modest computational needs for fine-tuning. The combination is radical: a developer or data scientist armed with a moderately powerful laptop can fine-tune LLaMA overnight and run it locally.
The problem of dataset size remains. Researchers discover that data quality takes precedence over quantity. While significant quantities are still necessary for obtaining good results, high-quality datasets partly compensate for the immensity of those from the giants.
In March, a team from Stanford University decides to train the 7B model of LLaMA by having it ingest a dataset of 52,000 generated question/answer pairs from GPT3. The resulting model is surprisingly reduced, and the quality of its responses brings it closer to the original: welcome to Alpaca.
Meta's Revenge
The Google article mentions that, in the end, among the tech giants, the only ones coming out as winners from the "democratization" of LLMs are Meta. Certainly, it is a victory they did not seek, but since LLaMA is the natural parent of most open-source models (notably GPT-J and GPT4ALL), it becomes a de facto standard. LLaMA has become the technical foundation of most open LLMs, and Meta could continue to improve their neural network to establish a kind of de facto platform. And the accompanying marketplace... An ecosystem of models optimized by users, lightweight and efficient, would offer a quality-cost ratio that seems impossible to catch up for the GAFA companies, which have to conduct their training on $100,000-per-hour supercomputers and incur exorbitant infrastructure costs. The question remains whether Meta and the laboratory led by Yann LeCun have this vision... The intellectual property void surrounding the weights creates a new situation conducive to the rise of open source, and models like GPT-J offer commercial licenses.
It's ultimately a case of the hunter becoming the hunted: while those who produce quality data accuse OpenAI and Google of using their datasets to train their massive models, these models, in turn, are easily cloned at low cost.
The Irresistible Open Source?
The development of inexpensive, efficient open-source models supported by an active community cannot be matched, even by deep-pocketed laboratories. Open source seems to be the short- and medium-term future of AI. If we can use open models, train them at a low cost, what's the point of paying OpenAI or Google? Why send data to California when we can integrate a local and private AI, tailored to our own knowledge corpus? The adaptation of LLaMA by the open-source community has use cases such as mobile and also opens the door to the next big step in AI: embedded engines. Think of robotics and IoT. In fact, it's in this context that the notable absence of another tech giant, Apple, can be interpreted. The Apple brand possesses both software mastery and control over its hardware. The M1 chips have parallel computing units ideal for machine learning. It's highly likely that Apple will soon announce an embedded AI where users can configure the data they want to share, unlocking the full potential of their iPhone or the long-awaited AR glasses.
Freedom of Choice
The wave of open source, which shapes a landscape of multiple AIs optimized for user needs, will undoubtedly be the lever that drives the growth of a true decentralized ecosystem. However, instead of pitting open source against the major players in AI, at Smile, we believe that the customer should have a choice. As specialists in open-source solutions, we believe that the open-source ecosystem will continue to diversify and offer new services and applications to our clients. But we also think that utilizing robust and secure infrastructures can be beneficial when dealing with vast sources of data or when establishing an IT system based on proven solutions like those offered by Microsoft, Google, or AWS. These solutions, too, continue to evolve and open up new possibilities (open AI is witnessing the emergence of a new ecosystem of plugins).
The diversity of solutions is a sign of a healthy and evolving market, and it's positive to see that open source is already an interesting alternative.
As with any new technology, it can sometimes be challenging to see clearly and understand what it can concretely bring to your business. If you want to explore the possibilities offered by LLMs, the possible technical choices, and the use cases that can be addressed, please don't hesitate to contact us. We'll be delighted to discuss with you the best options available for your business!