2024-03-22: AI Test Automation

how quickly is this going to happen

đź”· Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

Here’s today at a glance:

🦾 Automate It All

Meta put into production an AI code generation tool for test code, and this paper describes what they did.

The system is an LLM-driven unit test generator. Once cases are generated, they are:

  • confirmed to be buildable; avoiding instances of LLM hallucination errors

  • confirmed they pass; failing tests may be failing because of a mistake in the test, or because of a bug in the code. Since they are targeting automatic code generation, they do not bother to find out which, simply discarding failing tests

  • confirmed they pass at least 5 times; to avoid “flaky“ tests that only sometimes fail

  • confirmed the addition of the test increased test coverage; they need a quality metric to judge the test worthy of addition to the test suite, and this is it.

A test that passes all the filters is a useful one for regression testing. Meta progressively tested this at larger and larger internal venues. The prompt is where they ended up.

Extend Coverage

Here is a Kotlin unit test class and the class that it tests: {existing_test_class} {class_under_test}. Write an

extended version of the test class that includes additional unit tests that will increase the test coverage of the

class under test.

Meta’s tests indicate that 1 in every 20 LLM-generated tests pass the three automated filters, after which 5 in every 6 (83%) were deemed acceptable by human reviewers and added to the test suite.

Meta noted:

  • Tests were often similar, in the sense that the LLM was clearly reproducing training data.

  • Test coverages sometimes increased in classes not under test, as the addition of mocking code by the test increased coverage.

  • Sometimes the LLMs just left TODOs indicating assertions were necessary, which increased coverage but didn’t actually test anything

  • Sometimes the LLMs could have re-prompted, but did not, leaving gaps in coverage

They find avenues for improvement in the future:

  • Assessing improvement - how do you judge one piece of code is better? The test coverage metric has obvious flaws, and Meta proposes mutation coverage, ie introducing small flaws in the code and checking whether the tests can detect them, but is uncertain whether it would work on Meta’s scale

  • Application-aware generation - The LLMs tend to generate similar pieces of code, which reduces the usefulness of the test. Meta wonders if higher temperature settings, to introduce more perplexity would help

  • LLMs are “fashion“ followers - they follow templates for coding styles very well, and this largely pleased human reviewers… unless the LLMs replicated deprecated standards

One can hope over time that this leads to better, more secure software for everyone.

🌠 Enjoying this edition of Emergent Behavior? Send this web link with a friend to help spread the word of technological progress and positive AI to the world!

Or send them the below subscription link:

🗞️ Things Happen

  • A man, paralyzed from the shoulders down, used his mind to control a computer and play Civilization 6. He was using Elon Musk’s Neuralink. Are we feeling the acceleration yet?

  • 8-year anniversary of Deepmind’s AlphaGo beating Go world champion Lee Sedol. And Go is now a post-AI game. A game, like Chess, that has adapted to non-human experts.

Go became even more popular in Korea. And AlphaGo itself has now completely changed how we set game records by setting the bar higher. Teaching Go in an AI era looks very different from the pre-AlphaGo world, because students can learn far more by studying games that have been played by AI.

🖼️ AI Artwork Of The Day

Deep In The Woods - u/MasterSlimFat in r/Midjourney

That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.