Since ChatGPT opened its free trial to users in late November 2022, it has quickly become a hit. Its ability to generate text, essays, jokes, poems, and even code based on user input, coupled with an intuitive user experience that feels like interacting with a normal human, has sparked a surge in popularity. According to research by UBS Group, ChatGPT reached 100 million active users by January 2023, making it the fastest-growing consumer application in history.
This trend of generative AI applications, sparked by ChatGPT, immediately attracted the attention of international giants including Microsoft and Google, who integrated related technologies into their products. For example, after Microsoft announced the upgrade of its new search engine Bing with ChatGPT, its stock price rose by more than 41 TP3T, and its market value soared by more than $80 billion overnight, demonstrating the significant influence of generative AI.
In the past decade, various industries have actively used AI and achieved great results. However, ChatGPT's use of Large Language Models (LLM) to develop generative AI has yielded unexpectedly good results, which also heralds the arrival of the next AI 2.0 era.
The threshold to entering the AI 2.0 era
However, to grasp the AI 2.0 trend and obtain an LLM foundation model suitable for your own field or application, you need to overcome many hurdles. To start LLM, practitioners must first be familiar with the technology of distributed training of large-scale models and know how to train a large model together on different nodes. These include Pipeline Parallelism (PP), Tensor Parallelism (TP), and Data Parallelism (DP), three parameters that are very important in cross-node model training. Since large models and required datasets are quite large, the memory of a single GPU cannot fully accommodate them. It is necessary to appropriately divide the model's width (TP), depth (PP), and dataset (DP) so that the memory of multiple GPUs can jointly accommodate the model and dataset for efficient computation. Therefore, optimizing TP, DP, and PP is one of the keys to the training performance of large models. In addition, effective memory management is also a key factor in training performance. In the field of parallel computing, Zero Redundancy technology can effectively manage memory usage and reduce the use of redundant memory. In addition, the 1F1B (One Forward One Backward) strategy can activate memory utilization, reduce memory idleness, and effectively improve training performance.
Secondly, there needs to be a corresponding amount of computing power to support it, because the FLOPs of large models are constantly increasing. For example, GPT-3 175B requires as much as 3.64 x 103 Petaflops/s-days of computation. Moreover, not only is large computing power required, but also a high-efficiency storage system such as GPFS is necessary to effectively start LLM training.
The third hurdle is understanding fine-tuning and prompt tuning techniques. For example, training an LLM base model using in-context learning transforms downstream tasks into prompt inputs, reducing parameter storage and enhancing the model's task understanding. This leads to generalization capabilities, approaching human thinking patterns and shifting from large-scale dataset learning to zero-shot or few-shot learning. Prompt tuning for specific domains or goals involves developing domain-specific prompt strategies to guide the model in generating text that matches the desired style and objectives. Developing prompt templates tailored to the usage context increases the model's learning speed and accelerates the training process. While AI models can generate high-quality content, the generated text may still not meet user expectations in some cases. Prompt tuning improves the quality of generated content, saves time and costs, increases content diversity, and enhances user interaction, significantly improving the practicality and effectiveness of AI-generated content.
The fourth hurdle to overcome is the challenge of large model inference, because the deployment and inference of LLMs require an optimized environment. Since LLMs are so large that a single GPU cannot handle them, a multi-GPU inference architecture is needed to achieve low latency requirements. It also requires support for improving GPU core performance, such as multi-dimensional fusion technology that supports vertical, horizontal and memory integration.
The final hurdle is to prepare a high-performance system environment, including computing, networking, and storage, all of which must be able to work together to achieve the goal of optimizing the model training environment.
Open-source large language models can help popularize AI 2.0.
It is evident that the development threshold for LLM is extremely high. Even for international giants like Microsoft and Google, launching an LLM model on their own is not an easy task. Therefore, for various business and other reasons, these international giants often restrict their clients' access to and use of the complete model.
Fortunately, the BigScience research team, comprised of thousands of researchers worldwide, trained the BLOOM LLM (BigScience Large Open-science Open-access Multilingual Language Model) for 117 days using the French supercomputer Jean Zay. With 176 billion parameters and a parameter count/architecture similar to GPT3, the BLOOM LLM was completed in July 2022. The dataset contains 1.5TB of data in 46 languages and 13 programming languages, including Spanish, Japanese, German, Chinese, and various Indian and African languages. Its main tasks include article classification, dialogue generation, text generation, translation, knowledge answering (semantic search), and article summarization. Users can select a language and request BLOOM to write recipes, translations, or summaries, or even request BLOOM to write code.
It's worth noting that BLOOM is the first "open-source" large language model, giving academia, non-profit organizations, and SMEs the opportunity to use resources that are typically only available to a few major international companies. However, due to the sheer volume of data and the scale of the model, users still face development and maintenance challenges. Furthermore, the lack of training experience and talent makes launching an LLM model even more difficult.
The chief scientist at deep learning company Lambda Labs estimates that training the GPT-3 model would cost at least $4.6 million and take 355 years to complete. Therefore, even though BLOOM LLM is open source, most businesses still need the assistance of information consulting service providers who can help them cross the threshold of AI 2.0.
AI 2.0 advisory services help overcome development hurdles.
Because BLOOM has as many as 176 billion parameters, it cannot be trained directly on any GPU. Parallel techniques are needed to accurately segment the model, optimize TP+DP+PP, and efficiently distribute training to accelerate training results. World-class supercomputers such as AIHPC provided by TWSC are required to provide massive BLOOM model training and inference to run quickly on cloud platforms.
Traditional cross-node parallel computing suffers from performance degradation as the number of nodes increases. For example, if the computing power of one node is 100, according to linear theory, two nodes should have 200, but in reality, it may only be 180. This is because the efficiency of communication and transmission between nodes decreases.
However, because TWSC's cross-node parallel computing environment effectively leverages the collaborative operation between nodes using the InfiniBand architecture, it can achieve near-linear acceleration with a cross-node linear performance when implementing BLOOM execution results, providing near-perfect high-performance verification. This can help users fully utilize computing performance, and training time will gradually decrease as the number of nodes increases.
Using 105 nodes and 840 GPUs, the model was precisely allocated for a large number of parallel computations, resulting in excellent training outcomes, with each GPU card running at maximum performance. This demonstrates that the TWSC cloud platform's performance in training large BLOOM models not only helps optimize large model inference systems but also successfully overcomes the challenges of multi-node inference.
Based on the aforementioned BLOOM achievements, Taizhi Cloud has also begun to provide a one-stop integrated service called "AI 2.0 Large Computing Power Consulting Service". This service provides AI experts, AIHPC technical environment resources, and large language model LLM development services. It integrates and optimizes related kits and environments, helping customers to directly launch LLM projects with zero risk, accelerating the transformation of requirements into usable models and applications, and building large language models exclusive to customers. Enterprises can reduce huge time investment, technical costs, development risks, hardware equipment and human resource investment costs, saving at least millions of dollars, ensuring that every penny of investment is spent wisely.
● Learn about the "AI 2.0 High-Performance Computing Consulting Service":https://tws.twcc.ai/ai-llm
● Register now for the AIHPC x LLM Large Language Model Showcase on March 17th:https://tws.twcc.ai/2023/02/23/llm2/