MAP-Neo:
Highly Capable and Transparent Bilingual
Large Language Model Series

M-A-P, University of Waterloo, Wuhan AI Research, 01.AI

MAP-Neo shows impressive performance on base (Left) and chat (Right) models compared to both popular open-weight and recent transparent large language models with similar sizes.

Abstract

Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details  (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline,  checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.

Matrix Data Pile

Statistics of the Matrix Pile Data Distribution: The inner pie chart represents the language distribution, while the outer loop indicates the proportion of meta-categories in the corpus.

  • Compositions. The final composition of the corpus is as follows: 52.55% from Common Crawl, 22.29% from programming code, and the rest from academic papers, books, and other printed materials, as illustrated in figure above.

Document Conversion Pipeline

The document conversion framework is composed of various sub-models for different parts.

  • The Pipeline of Document Conversion. The documents are usually better formatted, in concentrated topics, and with more consistent expressions compared to noisy web content. However, it seems to be a gold mine of high-quality corpus except that the golds lie deeply under the digital dirt. Such digital documents are mostly stored as standard PDFs with diverse layouts or scanned images with inconsistent quality, making it challenging to build datasets upon. We observe two core issues in designing an effective conversion pipeline to extract plain text from documents: i) analyzing layout information and identifying different layout elements including text, titles, captions, images, tables, and formulas, and ii) recognizing the relationships among these layout components.

Pre-training

The data mixture ratios in MAP-Neo pre-training stage. The left is the fundamental phase and the right shows the decay phase.

  • The Data Ratio of Pre-training. In the pre-training process, we employ a two-stage pre-training strategy to train the MAP-Neo model. The first stage termed the fundamental phase, involves training the model on a vast corpus of generic texts to develop its general text generation capability. Subsequently, during the decay phase, we focus on enhancing the reliability of the model's generated content by incorporating high-quality data and mode code data. The distribution of data used across different phases is depicted in Figure above. Note that we increase the volume of code data in the decay phase. Specifically, during the fundamental phase, since Stack V2 was not yet available, we utilized Stack V1 and repeated the dataset twice to achieve a balanced data ratio. In the decay phase, with the release of Stack V2, we incorporated it as the code component for training. Moreover, we perform further data distribution tuning including duplicated high-quality data sources, such as books, judicial decisions, and government reports for training, to improve the model's performance.

Scaling Law of MAP-Neo

The training loss value is represented by the blue line. The Chinchilla law prediction is shown in yellow, and the NEO scaling law prediction is depicted in green. We fit the Chinchilla law and NEO law on 250M, 460M, and 980M and predict the model behavior on both training samples and samples from the 7B model.

  • MAP-Neo Scaling Law. We train models with sizes of 250M, 460M, and 980M parameters using 1000B tokens of training data. These models are then used to predict the scaling law, which guides the training of a model with 7.8B parameters on 3.07T (3065B) tokens during phase 1. To evaluate the fit of the scaling law, we employ the Huber loss ($\delta=1e-3$) between the actual log-loss and the predicted log-loss, along with the $R^2$ value between the true loss and predicted loss. Optimization of the scaling law is performed using the L-BFGS algorithm. This approach is applied consistently across the Chinchilla law and the symbolic music scaling law. By leveraging these methods, we aim to ensure the accuracy and reliability of our scaling law predictions, enabling efficient training of large-scale language models.

Generalization of NEO Scaling Law

The loss curve of Chinchilla Law prediction and the NEO Scaling law prediction for the DeepSeek LLM. We use loss values from both 7B and 67B for fitting and prediction.

  • Generalization of NEO Scaling Law. The NEO scaling law can be applicable to a broader range of models beyond MAP-Neo. Specifically, in Figure above, we illustrate the fit results of the Chinchilla scaling law (yellow dashed line) and the NEO scaling law (red solid line) to the DeepSeek LLM with the 7B and 67B parameters, which also pre-trained on a dataset with multiple corpura including Chinese, English and codes. We observe that for the largest model sizes (i.e. MAP-Neo-7B and DeepSeek-67B), the predictions of Chinchilla Law tend to underestimate the actual loss when the dataset size (D) is small and overestimate the actual loss as model parameters and training data scale up. In contrast, our predictions of our NEO Scaling Law produce better fitting results when compared with the results of Chinchilla Law for MAP-Neo-7B and DeepSeek-67B.

The performance of base models on different benchmarks

  • Data Quality MAP-Neo demonstrates significantly better performance on math, code, and complex reasoning by incorporating high-quality data, compared to previous transparent LLMs, e.g. Amber and Pythia, adopting (presumably) lower quality data.

  • Gap between our MAP-Neo and other transparent LLMs
    We note that transparent LLMs still significantly lag behind the performance of frontier industrial Open-weight LLMs with similar sizes (e.g. LLama3-8B, Mistral-7B). In contrast, our MAP-Neo can match or even surpass them on part of the automatic benchmarks about math, code, and Chinese knowledge. We call for increased participation in the development of transparent LLMs to further advance the LLM democratization.

The performance of aligned models on different benchmarks

  • The effectiveness of Iterative DPO In Table 10, when compared to Neo-7B-SFT, Neo-7B-Instruct shows significant improvement on the chat-related benchmark datasets (e.g., AlignBench, AlpacaEval, Arena-Hard, and CHC-Bench), which further demonstrates the effectiveness of our Iterative DPO.

  • The performance of the chat model
    Table 10 shows that Amber-7B-Chat and OLMo-7B-Instruct perform poorly on Chat Benchmarks. We assume that the limited capabilities of the base model may severely limit the performance of corresponding instruction-tuned models on chat benchmarks.

Contact Information

For further communications, please scan the following WeChat and Discord QR code:
WeChat Discord
WeChat QR Code Discord QR Code

BibTeX

Please kindly cite our paper if you use our code, data, models or results:


@article{zhang2024mapneo,
    title   = {MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series},
    author  = {Ge Zhang and Scott Qu and Jiaheng Liu and Chenchen Zhang and Chenghua Lin and Chou Leuang Yu and Danny Pan and Esther Cheng and Jie Liu and Qunshu Lin and Raven Yuan and Tuney Zheng and Wei Pang and Xinrun Du and Yiming Liang and Yinghao Ma and Yizhi Li and Ziyang Ma and Bill Lin and Emmanouil Benetos and Huan Yang and Junting Zhou and Kaijing Ma and Minghao Liu and Morry Niu and Noah Wang and Quehry Que and Ruibo Liu and Sine Liu and Shawn Guo and Soren Gao and Wangchunshu Zhou and Xinyue Zhang and Yizhi Zhou and Yubo Wang and Yuelin Bai and Yuhan Zhang and Yuxiang Zhang and Zenith Wang and Zhenzhu Yang and Zijian Zhao and Jiajun Zhang and Wanli Ouyang and Wenhao Huang and Wenhu Chen},
    year    = {2024},
    journal = {arXiv preprint arXiv: 2405.19327}
}