The How of the Data Matters

Apple internship paper discloses reformatting the training data for pre-training yields better results

đź”· Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

This Apple internship paper discloses a now classic technique: reformatting the training data for pre-training yields better results. Namely, in this case, rephrasing the queries as Wikipedia entries:

Rephrasing the training data prior

This process notably produced a pre-trained language model which was both more efficient and more competent.

While the Apple team, calls this synthetic data... at this point, it seems just like another step in data cleaning. We have long suspected that data gritwork pipelines of OpenAI, Mistral, and other large firms must be pretty impressive, as both datasets and algorithms are fairly generic at this point. Open source of course doesn’t mean disclose all your build secrets… some of which I’m sure may be intellectually embarrassing (it works!! but we don’t know why it works!).

Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.