2024-06-14-2024-EB-14: The Code Switcher

🔷 Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

Here’s today at a glance:

EB-14: The Code Switcher

Who

Junyang “Justin” Lin (LinkedIn), the Chief Evangelist Officer at Alibaba’s Qwen open-source language model program.

What

  • Qwen 2, the latest iteration of Alibaba's Qwen language model series, is presented with a focus on its capabilities in coding, mathematics, and multilingualism.

  • Data Curation: The interview dives deep into the importance of data selection and processing, with a particular emphasis on using smaller models to test different data recipes and improving the quality of coding data.

  • Commit Data: The conversation highlights the value of commit data from code repositories like GitHub, which provides valuable insights into the reasoning process behind code changes.

  • Benchmarking: The team uses a variety of benchmarks, including MMLU and Arena, to measure model performance at different stages of development.

  • Alignment: The podcast highlights the key concept of aligning pre-training and instruction-tuning to ensure the model performs well on desired tasks.

  • Multilingualism: The team actively invests in acquiring and annotating instruction-tuning data for specific languages to improve multilingual performance.

When

June 7th, 2024

Highlights

  • On the initial mission when they began the project: “We hope the model can model the distribution of human knowledge."

  • On where they stand in comparison with American AI developers: “Our Qwen2-70B is at least competitive with [Meta’s] Llama3-70B.“

  • On their focus for this model version: “We have some improvements in coding and mathematics… we have paid more attention to multilingualism, specifically 27 languages especially Western European and East Asian“

  • On the history of LLM development at Alibaba: “We have been developing for around one and a half years. Our first release was in April last year in mainland China. Then, in August last year, our first open-source, Qwen-7B, we accumulated some experience. Then in November, we have a Qwen-72B. Then, when we come to the next iteration, we have enough experience to solve a lot of problems. For Qwen 1.5, we have a lot of developers.”

  • On how improving data curation in pre-training means the models get better over time: “If you put too much coding data inside the pretraining data set without processing it very carefully, you'll find that the general language capability drops drastically. We suffered from this and did not know how to balance these things. In the next iteration, we found a lot of data in Github not related to coding. You need to remove irrelevant data very carefully. You also have some very precious data that you thought was noise. For example, the commit data.”

  • On why commit data is important for coding: “Because it reflects the process, how people think about the code, and how people fix the problems. If you would like to label instruction tuning data for how to fix a bug inside a GitHub repo it is hard because you need to check the repo and see how many files are in there, which files, and what content is there. You need to have a plan. What's the step-by-step? It is complex, and you have an interaction with the environment. If you do not have enough good data like commit data, you cannot align the model to align with human trajectory for fixing a bug.“

  • On what they learned from Greg Brockman’s Twitter: “Evaluation is all you need.”

  • On why evaluation is so important: “Because if you do not have evaluation, what you have done is meaningless because you don't know where the goal is. You can always find a lot of bad cases inside your instruction tuning data. You just sample something and you'll find that, well, it is rubbish. Why didn't you find this thing? Maybe you have some quality problem inside your training. This is our bitter lesson.“

  • On the GPU rich/poor problem: “Everyone still has the GPU poor problem even if they are very very [GPU] rich because when you have more GPUs you would like to build very, very large models. I think it is very, very good for people to test things on small models at the beginning. We were just so stupid at the beginning of last year because we used a 7 billion parameter model to do data experiments. When we used the 1.8 billion parameter model, it is quite good and we can do 10 times more experiments of data curation.”

  • On leaking test data into the training set: “We are actually quite afraid of contamination. When people find the contamination, we did not intentionally do it. We use 10 grams or 13 grams to filter the similar data. But as for what is actually inside the pre-training data, we cannot fix. Is that really bad for the large language models? If you put a lot of textbook examples into it, there are always a lot of similar cases.”

  • On how they track how accepted the models are: “We visit the customers, we find that they are actually using it, and then we have confidence. You can check the [huggingface] numbers, but the numbers do not really reflect things. We just try to feel how users feel about our model.”

  • On the quality of Chinese data: “For web data, English is better because there are a lot of people improving the English data and the Chinese web has not been developing for as long as the English web.”

  • On Chinese data stuck in closed platforms: “This problem is actually quite serious. There is good data inside WeChat, but people cannot access it. Except for Tencent.“

  • On the team’s open and collaborative approach: “At the beginning, everyone is messed up together. Generally, now they separate into groups. Now we have vision language groups, we have audio language groups. They were generally separate, but we try to make them communicate together. We make everyone equal. We did not say I am the tech lead, you should follow me, We say if you have a good idea, we'll follow you.”

  • On their biggest challenge being hiring: “If you have talents, you will have GPUs, you will have the money because they are really the talents. We are not international enough so it is a bit hard to hire foreign people to our company.“

Listen Here

EB-14, the fourteenth episode of our podcast, dropped this week. Before I continue, the rules of the game are:

  • Pods that CHART stay alive

  • Pods that get a Follow on Apple Podcasts CHART

So FIRST, CLICK on the link below (opens up your Apple Podcasts app) and click “+Follow” (in the upper right-hand corner)

Then go ahead and listen to the podcast any way you want to on your preferred app, capiche mon ami?

Listen on

Why

  • Improving Model Performance: The team aims to create a high-performing, general-purpose language model that excels in tasks like coding, mathematics, and multilingual understanding.

  • Open Source: The team believes in the power of open source to drive innovation and collaboration, making their models accessible to the broader AI community.

  • Addressing Challenges: The interview sheds light on the challenges of data contamination, acquiring high-quality data for specific languages, and dealing with code-switching issues.

How

  • Iterative Development: The Qwen series is built through iterative development, starting with small models to test different data strategies and gradually increasing the model size.

  • Benchmarking and Evaluation: The team uses a rigorous evaluation process, incorporating multiple benchmarks and continuous monitoring of model performance.

  • Attention to Detail: The team carefully considers the impact of specific decisions on model performance, such as the choice of tokenizer and the selection of pre-training data.

What are the limitations and what's next?

  • Data Availability: The availability of high-quality data remains a significant challenge, particularly for languages with limited online resources.

  • Code Switching: While progress has been made, the team continues to work on improving the model's ability to correctly handle code-switching.

  • Future Development: The team plans to continue improving the Qwen model series, focusing on expanding its capabilities and addressing emerging challenges in the field.

Why It Matters

  • Open-Source Innovation: The Qwen project promotes open-source innovation, allowing researchers and developers to build upon Alibaba's work and contribute to the advancement of language models.

  • Multilingual Accessibility: The Qwen team's focus on multilingualism aims to make AI technology more accessible and useful to a wider global audience.

  • Real-World Applications: The improved coding and mathematical capabilities of Qwen 2 have the potential to be applied to a wide range of real-world applications, from software development to scientific research.

Additional Notes

  • Junyang Lin mentions the importance of a diverse team for catching errors in data and tokenizers, especially in languages like Japanese and Chinese.

  • The interview highlights the vibrant AI community in mainland China, with companies like Baidu, Tencent, and ByteDance actively developing their own large language models.

  • There's a humorous anecdote about the challenges of choosing a model size, with the team eventually settling on 72 billion parameters due to a desire for a "better-looking" number.

🌠 Enjoying this edition of Emergent Behavior? Send this web link with a friend to help spread the word of technological progress and positive AI to the world!

Or send them the below subscription link:

🖼️ AI Artwork Of The Day

Marvel Characters as a 1990s Anime - thrilIstudios from midjourney


That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.