ATDD LEARNING MATERIALS: I have THREE online courses on acceptance testing designed to meet the needs of different software development roles. Pick the course that is right for you! ➡ courses.cd.training/pages/acceptance-testing FREE ACCEPTANCE TESTING TUTORIAL WEBINAR ➡ courses.cd.training/courses/acceptance-testing-webinar
Been using AI as my programming "partner" for almost a year now. It's a great new tool for the kit. Structured prompting is really where it's at, though I tend to use a conversational approach and decide on what tasks I want AI to work on, instead of the whole thing, but that's the next step. Would enjoy seeing a more indepth video of your own work with AI and how you went about wrangling it to do your bidding! Also, you might look into the 'Aider' pair programming tool. It's a layer between the model and you. I'm finding it has a lot to offer and has a fair bit of structure already built into it.
We did this in the 90's. The model was to define a Hoare triple {Q}S{R} where *Q* was the precondition, *R* was the postcondition and *S* was the program. If you were given *Q* and executed *S* you would get *R* So you defined Q and R, and with a calculus you would logically solve for S. The only troubles were (1) It was as hard to specify Q and R correctly as it was to write the program correctly, and (2) it was extremely difficult to solve for S for any non-trivial Q and R. Maybe AI will change that, but I'm from Missouri on that.
I felt similarly. So much of the work of program specification resulted in specifications that were hard to write, read, and understand than the code. But I like the idea of pivoting to acceptance testing as a primary focus.
@@adamfarquhar1279 We did that too in the 90's. Requirements documents that were hundreds of pages long and accepting the new application / module / etc. meant that testing that each and every requirement was met. Code checkins were tied to the requirement(s) they satisfied and code reviews were meant to check whether the code satisfied the requirements. It worked but it took a lot of time. I'm all for it if it can be done in a more efficient manner these days. Of course I'm retired now (Agile was too much for me to stomach) so it's all your problem these days ;)
I thought determining inputs & outputs for your application was just basic programming design that's introduced with the "black box" analogy. You make your program based on what goes in & what comes out. Automating input & output validation is, in my experience, just a standard part of quality assurance & test automation.
That's funny because that's actually how modern program synthesis methods work. They extract P and R from a symbolic execution of some reference code and solve for provably equivalent S using various search techniques. Hoare triples represent program blocks, and those get composed to produce verification conditions for the whole program. Adobe used this approach to modernize their fortran kernels a few years back, without needing AI. But AI will certainly be making it more effective in the coming years.
Agreed. Not all layers of abstraction are created equal. If/when AI prompts become the new programming language, will the same prompt result in the same outcome every time? Unlikely. Your tests would have to be robust AF to offset that risk, IMO. Shall we get AI to write the test too then? No thanks. 😬
The writing of the code. Which is to say... not much. Writing the code is usually the easy part. Most programs aren't "hard", not to say there aren't hard problems out there, but most business needs are met with pretty simplistic logic or can build off of already established methods and patterns. The challenge is typically being able to figure out (and translate) an imprecise description from the user/customer to something you know the computer can understand. With this change, that "something" the computer can understand is acceptance tests instead of standard computer languages. So it's still an improvement, and would speed things up a bit, but isn't the hard part (for an experienced programmer). It can certainly speed up some of the tedious parts.
Quite, that's the point of modern software development, we develop the details of the requirements as we write tests and code. Apart from legislation, rarely do we get detailed requirements from a customer or stakeholder, we work together as a process.
@@andersbodin1551 I hated Haskell when I first started to learn it, but fortunately my "educator" was a pioneer in it's development and now I love it as it's so powerful once you understand it.
I think AI would be a great tool for running code snippets in the background for analysis (something like real-time AI generrated unit testing) to analyze code while it is written by humans and giving notices when it spots something problematic : "When you call this function over the path x()->y()->z(), this will cause a nullptr exception", "this malloc-ed memory never gets freed in the available code paths.", "While your loop works correctly when called from x()->y(), when called from z()->y() it will cause an array out-of-bounds exception."
Recently did a major cloud migration project for a big fintech firm and it there was a big emphasis on acceptance testing, glad to say our project was the most successful post release compared to other orgs in the company
I tried this approach a couple months ago, for a small but real use-case I had. Feeding acceptance tests to o1 worked better than any other AI approach I have tried. Unfortunately, I don't think it worked well enough just yet. I had the same experience as Dave, that working with o1 was reasonably successful, but the hand-holding took the same amount of time as it would have required to do everything myself. I did bump into a pretty hard problem with the context window and with separation of concerns. If I asked for too much in a single prompt, the LLM would eventually go off the rails (I assume due to context window issues). But also, if I tried to manually limit my prompts to specific sub-problems to aid with separation of concerns, the LLM would go off the rails as well. I still think there is probably a sweet spot for the sizes of "concerns" that can work with LLMs, but it isn't the same as for humans, and I wasn't able to find something that worked without bumping into the other problem.
Totally agree with the concept, and I’ve been putting this forward as an answer to the problem for a few years but I’m not convinced that the AI models are where we need them. The overarching concept is the premise of 5th generation languages - describe the problem, not the solution. Well written tests are a more concise and less contextually sensitive way to describe the problem than plain, ambiguous natural language. The headache is the AI code has to be right, and as observed, the models still have problems. LLMs are probably not the answer as they want natural language, not acceptance tests. Then there is the old problem of making it work at scale. It’s where traditional 5th generation languages fell over and while the issue has moved on, it hasn’t gone away
I agree, but while current LLMs are not the answer acceptance tests and a standard set of values for good programming (eg. speed, resources used, percentage of mistakes, etc.) is an entire training course for new AI models anyways. On top of that unlike LLMs writing code having a consistent improvement condition (speed, resources, etc.) for a prompt can let the AI improve upon the code independently along those axis without any human intervention. You can see this in AI that are made for games with similar conditions (winning the game for AlphaGo for example), and they become incredibly capable incredibly fast. LLM's are just not the right tool for coding. Move the coding part to its own AI, have the LLM write out acceptance tests. That's a much better solution than having the LLM write code, while not even understanding what makes for quality code in the first place (due to being trained to predict human language instead of good code).
The latest generations of AI assistants are addressing the LLM problem, the "reasoners" models like o1 and o3 are dramatically more effective, and less likely to hallucinate. There does seem to be an exponential step change going on, so it is possible that this problem is on the way to being solved, certainly "alleviated"!
@@ContinuousDelivery My real point is how the model responds to coded test constraints as a prompt rather than the natural language they have been trained/developed/evolved for
I’ve written formal specs using logic and sometimes using algebraic specifications. For example part of the spec for a queue: Pop(push(a,q))=a It is quite easy to see how to convert that to a behaviour test. Another example, to define “sort” function: sorted_list=sort(list) Ordered(sorted_list) Permutation(list, sorted_list) In other words, a sorted list is an ordered permutation of the input. Again, very easy to create sort test cases. I’ve tried using this with AI with some success on some algorithms and an accounting problem (expenses under IFRS17). That sort spec is actually executable in some functional languages, extremely inefficiently breadth first search of all permutations. That can help build confidence in the spec (validation as opposed to verification). So to some extent what the AI is doing is not so much compiling the prompt, but creating an efficient, realistically executable version. A transformation rather than compilation. Compilers do optimise but they don’t try to change the algorithm and data structures.
Natural language is ambiguous and context sensitive. We still have the issues of making decisions on the way. Dunno if it's really an advantage to use natural language as means to communicate with the computer. The computer doesn't ask the right questions if you tell it to do something. The current LLMs are rather comparable to someone that memorizes a lot of examples without understanding the semantics of it. When you prompt it to do something it just tries to reproduce something from memory. If you delegate some work to someone that seems confident but doesn't understand the domain then you would give him a tasks once or twice but then work with someone else instead. It might help with code completion but you still have to review and correct it.
I think this is the best way to use AI for programming at the moment. Not only the prompt is way more precise than natural language, it's obvious how to verify if the generated code works (just run the tests). One caveat though is how to represent non functional requirements. For example, if the requirement is to sort an array, there is no guarantee it will choose the most optimal sorting algorithm. It may even choose to implement a bubble sort 😂
I totally agree, i made a post about it the other day just showing that a well written acceptance test in natural language using a DSL approach can drive AI generation success rate drastically. It creates a better prompt and the test coverage from test first allow an easy refactor of the generated code which while it can make the test pass most of the time, it is often mid. Using AI this way will make TDD approach drastically faster than it already is, the argument that TDD or testing is slow is starting to evolve the other way around, thanks to AI. We are not totally there yet, but i really think it will be the future of programming.
I do not think it is the future of programming, because only writing unit tests for code is living hell for most programmers. Businesses won't be able to lift off the ground because no one will be interested in "acceptance test" jobs.
@@Nicholas-qy5bu I've been pondering about bdd, executable spec as the middle ground of user and LLM. Am looking for my dev org to experiment w this and other process changing ways to develop, i.e. mob programming with lots of bespoke, promted llm helpers. Would be interesting to share ideas!
@@gamechannel1271 i think that writing these acceptance tests will feel completely different, because you know that you are building something by writing them
This approach is meant to be the complete reverse of that. We define the essential complexity, as executable specifications (using ATDD techniques) and the AI deals with the accidental.
This is programming in a extremely indirect way with a stochastic algorithm in between. You're making a (highly questionable imo) assumption that it will produce more correct code, faster, to specify your program by example. I.e. the system is an abstract function f of many variables and then you're suggesting to specify f by pointwise evaluation and letting a stochastic algorithm extrapolate a general definition of f. There are many ways this could go wrong: 1. The algorithm might just memorize the inputs and outputs given with little to no generalization (how would you tell the difference?). This is NOT the same problem as the AI memorizing it's training data because this is not training data. 2. To capture all of the nuances of the desired behavior of f, you need a number of examples that far exceed the length of a hand-written implementation of f. 3. There's no way in the testing language to specify desired properties of the implementation that depend on details of the implementation language (which may or may not be the same as the testing language). I.e. constant time operations (for crypto), time complexity of algorithms, bounded memory usage, real-time guarantees etc. 4. In general it may be impossible to implement f in polynomial time by observing inputs and outputs. Consider a specification of SHA-512 by way of acceptance tests that map inputs to outputs. The whole point of something like SHA-512 is that it scrambles all output bits if one input bit changes so there is no way an algorithm could "guess" the underlying function in polynomial time.
This is what "programmers" said when high-level languages like FORTRAN, BASIC and COBOL were invented. "The output isn't a real program", but I don't really see a big distinction the nature of the steps from assembler to HLL or from HLL to Acceptance Test is conceptually very similar.
@@ContinuousDelivery What a weak response. Complete non sequitur that doesn't address any of my points. How is assembler to HLL "conceptually very similar"? None of my 4 points remotely apply to C/Fortran except for point 3, which is actually still, to this day, a real challenge when you care about implementing an operation in say constant time and a compiler gets in the way and optimizes your code. But the points are all very real (and only tip of the iceberg) concerns about "acceptance testing + LLM". A compiler is not a stochastic algorithm comparable to an LLM.
This sounds like another form of TDD. The issue is not weather the output is consistent (functionality) but whether the functionality and semantics are predictable. Passing a test does not prove consistency of implementation and sometimes this matters as much as result. For example, speed, resource use etc. The solution will be a meta-language that sets deterministic constraints on the AI but doesn't require implementation details. It will look very much like functional programming language but much more terse. That's the end point.
You are shifting around the meaning of the terms "specification" and "implementation/solution". One could also say that writing a program in C is writing a executable specification of what a CPU should do. Two things to consider: 1. The only thing that changes is the level of abstraction and we cannot verify the details that are not considered due to abstraction. 2. If we continue on the current path we soon need a whole data center just to calculate the result of 1+1.
You can read Bourbaki (virtual French mathematician). As it turns out the specification of 1+1, when done mathematically correctly, does take several hundred pages and the "code" is virtually inaccessible to all but a few mathematicians on Earth who care about that level of exactness. So while you may have tried to make a joke, you actually got it exactly right. The provably correct representation of software like a word processor would most likely really take an entire data center full of nearly random gibberish looking logical specifications. It's also completely irrelevant. You simply give the thing to the user and ask for feedback. That's the only practical way to test software.
Without repeatability it makes no sense. You literally can't tell for sure if your specifications is complete or not, because there is always a chance it can produce wrong result the next time you run generation process. And you may need to run it again at any time when there is a new feature request. Basically, wit AI you're gonna rewrite all the code base to make a minor change. Isn't the whole software engineering about writing the code the way it doesn't have to be touched, or to be touched minimally when you add or change something?
Would be cool if AI would be able to run the tests on your environment, automatically look up the results and re-prompt itself if it has encountered an error till it has solved an issue or till it hits the specified loop limit. I feel like it could be the backbone for future IDE's.
I do this already. No waiting involved lots of the coding assistants already do his and more. - want unit, functional tests written for you too. It will do all of this and more, you need to be able to prompt it correctly which is a bit of an art form at present buts it’s definitely already here.
I think you are right to point out the loss of determinism, which might be the biggest change factor to accommodate in new development lifecycles. Lots of ideas will come along on how to make this accommodation. Acceptance testing might be a feature common to many such ideas.
Any company that adopts this methodology will meet its downfall. You can reference this comment when it pops up in the news in a few years and pat me on the back.
There are examples in the video. It is an internal DSL (designed to make writing the tests easy) based on Java. The examples in the video are the real tests that executed and verified that the generated code did the right things.
Iteration is key for quite some of the mentioned issues to get working code at the end. Tools like Cursors Agent-Composer, aider-chat and Cline Dev have options to run commands (e.g. tests) and iterate on the output in case of a miss itself until the output is correct.
How do you know if the output is correct? You have a mathematical algorithm to test for correctness, right? Oh, wait. You don't. We teach in CS101 that such a thing does not exist and can not exist.
If programming just becomes ONLY making acceptance tests, I'm out. I like writing tests, but I like writing the code as well. Why would I let the computer do the part I like.
@@douglascodes I've read a quote these days: "I want AI to do my dishes and laundry so I have more time to do art and write, not to AI to do my art and writing so I have more time to do dishes and laundry"
@@Gokuroro the problem is the art is the $$$ and businesses want the $$$… so they will do everything they can to automate that… it’s why we generally don’t have decorative houses/furniture/street lamps/bridges/etc… the art of society is generalised and mass produced.
The human part of programming is understanding business and user needs and translating them into a program. You enjoy the robotic part of programming, not the human part. It's like being a mathematician at the advent of the calculator and saying if it is going to do all my long division for me I'm out.
I have been keeping up to date with programming and been a commercial programmer for over 30 years, and my husband works on vast international software projects of critical importance. There is no sign of AI taking over programming. Most programming is very far from routine and involves such aspects as upgrades and migrations, adding functionality, replacing live legacy systems, optimisations, security, meeting financial standards, future-proofing, and various levels of debugging. These need deep awareness of what is going on, and a lot of experience with development. Software tools constantly change. At the moment Java is evolving rapidly and an important aspect of that is preview features for developer feedback. An AI isn’t going to be able to discuss the merits of the latest Java implementation of value classes, or suggest problems with the foreign function interface. We are a long way from AI doing mainstream programming. If there is an attempt for this to happen it will be a disaster as programmers love programming, and the industry will collapse.
Anyone who is really working as a software engineer and is actually using latest ai tools knows it -> current state of tools just cant replace a software developer. Not even in the web development where AI has the most advantage on. Not even latest agents and models are good enough to compete with juniors. CEOs of ai companies -> need to lie to keep the funding going (devin scenario). Investors are just blatantly being lied to but they are also to blame because of their need to be on the hype train.
By the time you nailed down the acceptance 'prompt' + finish the mandatory code review + fixing glaring mistakes ... you probably have consumed the same time as if you write the code itself + unit testing. And don't forget all the energy consumed to train & perform inference.
I have twice implemented executable specifications using DSLs. This has allowed Business Analysts to define what they want in a way that they can throw data at it and see the results of their logic in real time, and adjust their specifications in real time until all their cases transform correctly. The specification was sent to developers to code it up and they used the correct outputs to validate their code against. This allowed the sprint to begin with test cases (black box) in place and ready. It was rather fab to be honest. The DSL provided the syntax and capabilities they needed, and we would regularly extend or adjust as needs arose. Not applicable to every situation for sure - this was a data transformation engine so particularly suited to it. However, I am always keen to explore this further and see where it might lead.
We may need to change the way we do tests as well to make sure AI is not just covering the test cases (imagine a generic fizz buzz, but AI generates the code for the fizz buzz cases that were actually tested). This could include something akin to the QuickCheck (Haskell) or Hypothesis (Python) library for testing based on rules/properties.
Astoundingly it is only you announcing this wisdom. And I am not on Satire mode! You also were quite lonely being the one referencing the traps waiting for everyone in agile development. Man, I praise you for this, I feel like I am not alone. You are always worth a listen!
I arrived to the same thoughts after coding a product with AI assistant yesterday. While it takes lot of routine work away, someone still needs to verify that what was built is right, secure, etc. I think the future of software engineering will be more in the controlling function and setting requirements - much like what happened to professions where manual work was replaced by CPU-controlled machinery.
CPU controlled machinery is being developed by engineers... people with HIGHER skill level than the workers it replaces. It only pays for itself if the process requires a lot of workers or if a sufficient number of workers is not available (like in case of food processing at the industrial scale) or if workers can't do the job to begin with (nobody can handle 200lbs paper rolls in the printing industry etc.) and the machines are reasonably cheap. Not sure what you want to do here. Replace a software engineer with a higher skilled software engineer? That doesn't exist. At least not on average.
One of the benefits is that I can go ”off rails” with my AI buddy, revert the whole thing, and now both me and my AI buddy have gain a lot of learning how to plan for the task we have at hand. This learning loop is significantly faster than handcrafting the code myself. And yes, I agree that writing acceptance test first (with AI of course) is one good way to keep us on the road towards the goal.
this is also what i've had in mind as a good role for Generative AI. a challenge is specifying the performance requirements especially for algorithms it comes up
The set of valid and useful programs is much much smaller than all possible programs. Any time complex modules interface, the set of valid solutions goes down as complexity increases. This makes specifications easier and easier to specify, given a starter solution to interface with.
Except that most experienced programmers know that what they user says he needs is not actually what the user really needs. If you deliver what the user wanted, then you will have a very unhappy user. ;-)
I love the take that natural language programming is just another step of abstraction and LLMs take the role of compilers. (and really all other takes too) The programming model of ATDD with LLM support is exactly how I ve been thinking of it future too! Cool to see this proof of concept of yozrs actually trying it! I guess you are just way to ahead of the curve here. It will take a while for all the "AI employee" BS to cool down. I suppose we will have to see a full cycle of hype, layoffs, failed products, dissatisfied managers and busting stock bubble before companies will stop pushing AI employees and start to really support this appriach instead!
According to the "The Knowledge Gap" theory society might collapse due to a lack of understanding of underlying systems. This is just my gut feeling, but writing the acceptance tests just for the AI to generate all the code will amplify that problem. People will no longer understand what something does, and if they do, only at a superficial level.
New coding is requirements management. You use AI to write requirements, requirements to write test cases and actual code. If you see bug, clarify requirements. This will make each app a requirements library on different fidelity levels for human and AI to discuss the app and update it.
"Old" programming is essentially the same - writing requirements in a somewhat-natural-looking domain-specific language until you cannot see any obvious bugs.
The main reason this won't work is that the people deciding the specs or features don't know well enough what they actually want and don't describe it consistently enough. A good portion of bugs I deal with come from the fact that the 2 engineers involved and the QA were told slightly different things or interpreted a meeting they went to differently. AI is great at taking an interview problem or coding challange and solving it in an isolated environment; going beyond that is not going to work as long as we have managers and product leads.
I think you may have missed my intended point. The program *is* already the specification, my view is that the people doing the specifying won't change to non-programmers, because you need the kind of understanding and analytical skills of a programmer to specify things in enough detail, but Acceptance testing with AI, allow us to raise the level of abstraction that *programmers* can work at.
@@ContinuousDelivery I can see that point, but I don't think it can work practically. Fundamentally, the job is to translate vague business requirements into working software within the confines of the platform it will be running on. On a practical level, it doesn't matter if I'm writing the application code, an acceptance test or an AI prompt. But whatever the language or tool of choice is, it won't work if the requirements are too vague or do not match the constraints of the environment. Then, there is the problem with the current crop of LLMs. Copilot loves using deprecated code (especially if the deprecation was done after 2022), and Gemini is a big fan of inventing brand-new public interfaces of existing libraries because it sounds good. Just like you've experienced it generating a rest API one moment, then trying to test a web UI. AI will use the tinies bit of context to generate what an average human would find a convincing answer.
I see a couple of quick wins to help the accuracy. o3-mini is expected to be similar to o1 in ability but cheaper and faster. o3-mini can run a checklist on output code to catch a subset of mistakes with each request. Also, running the workflow in an IDE or in a web hosted code interpreter that can use error messages to correct mistakes automatically is low hanging fruit. If anyone is interested in exploring solutions, leave a reply.
It's worth noting that verbalizing questions and instructions is a different skill than actually coding. AI does better with more specific targets. Learning to ask questions or provide instructions at a level that AI can use took some effort. Personally, my productivity has at least doubled since I adapted to Cursor and learned to better formulate instructions for it.
Very interesting experiment. I remember trying something similar a long time ago using a genetic algorithm to attempt to evolve code to pass the tests. I didn't have much luck then apart from with very simple specs. However I definitely think your approach is worth some more investigation. So much to explore and so little time (and sleep!).
I'm curious how far you could take this, since LLM context Windows are limited you would have to keep your tests broken out into easy to understand sub components for the LLMs to alter one component at a time.
Some acceptance test frameworks were aimed at non-developers, such as Fitnesse. Your results seemed to need the eye of a developer. I’m curious what it will take to get test creation to the point where non-developers can create tests of value. I’m also curious how a body of existing tests could be used to inform new tests.
This approach (generating code of deterministic tests) will certainly have some use, but I expect a rise of a different approach: You will write the same vague instructions you would write for your human testers, e.g. "create a few testing records, open the report, filter by date, check that the records are filtered accordingly," and the LLM will perform the test using its connected tools. There are already some examples based on Anthropic's Computer Use, such as Cline for VSCode. It cannot replace the testers yet (it often gets stuck), but it will get there, and the software companies will love it. It won't be deterministic, but it will be reliable enough, and the software will be better tested than now.
There are much better ways of doing this... instead of using AI through the browser, use a tool like Aider. It gets a list of all files in your repository and you explicitly control which files are read-only in its context, and which are editable. You can also add URLs which get scraped and added to the conversation. You can switch modes between code (I.e. make changes), ask, and architect so that you can get the AI to plan complex changes. Aider can be configured to run a lint or test command after every edit: non-zero will have the output added to the conversation and you can then just Aider to fix it, and manually iterate without any cut and pasting. You can also paste in images/other content if required for debugging. You can work around AI confusion by controlling which files/web content is in the context, clearing the chat history, switching modes to plan changes, and switching models if needed. I think you would have had a very different experience using this approach...
If you look up 4GL you'll see that they really don't have to be graphical, and a lot of them aren't (such as the SQL family and other databases...). I'd say "declarative" (or at least "more declarative than the 3GL") should be their main adjective, to the extent that it makes sense to still use numbered generations. Computing branched into so many directions that it doesn't make much sense to number the "generations", especially as "3GL" evolved a lot after the introduction of "4GL". Back in the 80s, it was common to say that Prolog was an example of 5GL...
Thanks for the video. I wonder what if we pair this with the approach to store the documentation in the repository as well, and shift the tasks to change the application to the documentation changes. It might then look like: Adjust the documentation resulting in diff to markdown files -> appearance of the change to documentation triggers acceptance tests to be pre-written -> revision of the acceptance tests may trigger the actual code writing then.
I think you are spot on. Requirements gathering will be and has been the best tool developers can have. This is what will allow you to lead all your agents in the future. Translating business requirements into actual systems
This resonates quite well with me. 👍We seldomly implement our business logic in assembler these days. And we do not care about the exact CPU-instructions our code we write in Java, Python, ... is executed tomorrow as long as it is correctly executed. I did experiment with making the LLM act as a requirement engineer and me being the client. After I got the impression that we were on a good path I wanted us to agree on a specification. As a common understandable language that is precisely reproducible I selected "Java/JUnit". This way it was possible to jointly agree on the requirements. After that was done I asked for the implementation. When I expect a change then we jointly adapt the requirements and the implementation simply follows. 💡 The specification is the thing that holds the truth. The implementation can change on a daily basis. I do not care about details there as its behaviour is specified in the requirements. So as long as it meets the specification I am fine. I see this similar to evolution of programming languages overall. We seldomly implement our business logic in assembler these days. And we do not care about the exact CPU-instructions our code we write in Java, Python, ... is executed tomorrow as long as it is correctly executed. A REST-Interface would also make a good abstraction level to specify a behavior of a service. I am starting to care less about the implementation-language but more about the specification.
I asked an AI a few months ago for the name of a white model in a famous music video. It came back with "Naomi Campbell". Dude, anybody who trusts an AI to produce a correct result at this point is a fool. :-)
Should have used Claude Sonnet 3.5 It is just natural at writing code. o1 is kind of a model you would use to write acceptance tests themselves, not for code implementation.
I've actually worked on this myself. since copilot i wanted to write tests and it write the code. I've gotten a couple types of prompts that do work "well" (well as in i can give them tests and i get back the framework of code i want) I normally just take this and modify in place (i dont have it do any styling and instead will have it do base HTML on a page and i go in and fill in display) i also have one that takes the gherkin statements we get at work and tries to break those into tests. this doesn't work as well as id like, but i think these get be 60-80% of the way with boiler plate and such.
Often there is not 100% test coverage possible. So there is a gap where the AI code can do a lot of strange things..? I don't think this will work reliably in the next 10 years
I think that is a problem of testing approach rather than AI. BDD style ATDD is already a very widely used approach, and the point of raising the level of abstraction is for the AI to fill in some of the blanks, in a similar way to how modern compilers will optimise the code that you wrote.
You either show model the tests and then it will make code that only passes these tests and nothing else, or you don't and then you will waste lots of money generating code until it passes all the tests which may not even occur, and do that for every change you need to make. Or you can show model the tests and write a kind of test that tests every possible case and then the code you write as test is as complex and error prone as the code you could write in the first place, without needing AI.
I believe the future of AI is making knowledgeable people faster. Not novice into experts. The novice simply has no way of knowing if AI correctly answered the question.
@5:50 Hey Dave! I found those code snippets interesting, and found them in your "Acceptance Testing for Continuous Delivery" slides for your conference talk. I'd classify what I see as a very Rich Domain Model (RDM) based development (Data and Behavior together). What are your thoughts on Service-Based development, using something closer to Anemic models, and passing them as data into Service classes/methods? Got any content on that?
My thoughts are; let the LLM do the code monkey stuff and let the rest of us focus on usability, accessibility and fitness for purpose. We still need improvements though in generated code. I come from an era where memory, storage and speed were limited and I used to pride myself on maximising all of that, but nowadays it seems that no matter how bloated code is, if it works it will do. Even as an experienced developer it still blows my mind how computers are so fast now (compared to the old 8 bit days) and that people can get away with writing "sh!te" code and the speed of the hardware masks it. As this is now never going to change, I believe the future is in acceptance testing and making sure the product actually does what it should. Basically, anyone can spew out "code" these days but does it actually do what it's meant to?
I agree with both: perception what test cases are an essential part of defining the task for genAI (and for humans) and what AI is getting there. Especially If it's already comparable time and quality with experience developer + considering environmental problems and several interactions which took your time already have some ways to get addressed (agents, etc)
I asked GPT-4o mini, with DDG, to read me this C expression. a . b . c . d I didn't get a meaningfull answer. I got code examples with this expression, but no answer how to read it. It would be so much easier to understand, if I could replace dot with a meanigfull expression.
I see how this approach works for tests for pure-functions, isolated components, internal business-rules and the like - but IME this doesn't work for user-interfaces, which tend to be the part of a system that's subject to the most changes (and meddling from product-owners...), and this problem exists in all kinds of UIs. Testability is better with FP-approaches to UI like React, but there's still desktop UIs in Win32/WinForms/Swing that remain basically impossible to reason about and I've yet to see any scalable solution besides contracting out QA to a body-shop but if you can't afford that I feel you're SOL. ...and it's not just a matter of writing tests for UI state: a frequent source of bugs originate in faulty (but reasonable!) assumptions about platform integration (e.g. window/control painting and double-buffering), things like that, because those are the kinds of issues that end-users (and non-technical C-suite people) see and which lead to complaints about software-quality, but no amount of investment in automated testing can really help here. I hope I'm wrong when I say there are no good solutions here...
Timestamps (Powered by Sitrak'AI) 00:05 - AI is transforming programming through acceptance testing evolution. 02:10 - AI transforms software development, maintaining core programming principles. 04:13 - Acceptance tests enhance clarity and reproducibility in AI programming. 06:16 - Acceptance testing enhances programming by providing clear specifications and improving overall reliability. 08:23 - Acceptance tests could redefine programming roles and focus. 10:19 - AI-generated acceptance tests require user customization for better fit. 12:16 - AI-generated code can be improved with better prompting and testing. 14:07 - AI assistants need better guidance for effective programming and testing.
In my experience people spend longer working on the AI prompts than if they just wrote the code themselves. If businesses want to de-skill programming with AI then let them get on with it. They will end up with code that doesn't work properly and staff that lack the skills to fix it. Fine. Get on with it.
This approach is genious. It totally goes hand in hand, that the classic software coder will be replaced by automation. Therefor good Software Engineers who craft the solution are the future.
What if you followed this process: - you write the tests - the AI generates the code - you never read the generated code - you accept the generated code IFF it passes the tests - if you don't like the generated, you need to write more tests - if the AI can't make code that passes the tests, break the problem down smaller* ? *This threshold will change as AI becomes more capable
> you never read the generated code # the test given to the ai def test_addition expect(add(1,2)).to equal(3) ... more expectations # generated code that passes the tests def add(a,b) return addition_api.call(a,b) # generated code that also passes the tests def add(a,b) with_silenced_logging { collect_and_export_data() } return a+b
I think that that is certainly a current useful strategy, but I don't think it will be long before we can't read the code that the AIs generate. This has already happened in other AI usage contexts. AI design systems that we can't understand, and there are several examples of AIs inventing their own languages to make communication more efficient. Computer programs are primarily about human to human communication, with execution as a side effect. If that wasn't true, we'd write all our programs in binary. So what is to stop AIs from making the code so obscure, maybe for good reasons, but now we can't review it?
@@ContinuousDelivery That feels similar to microcode, pipelining, branch prediction, etc.: both compiler output and CPU internal operations are so complicated today that we humans can't generally expect to make sense of it.
I think this is the way. Human authored acceptance tests based on formal requirements that are used to verify AI generated applications. The tests would use a DSL based on top of generic frameworks for driving the interface (UI, http request, etc). Perhaps AI could assist in the generation of the test code but it would have to be audited by a human to guarantee that the verifications are effective and sufficient. I believe that the issue of "AIs cheating the tests" that David highlights at the end of the video is intractable, if we want to define how the system behaves and have confidence in its operation then validation of those behaviours has to be encoded in some way.
To ensure that AI can't cheat, we'd need much more robust tests, that do things that are currently considered overkill (or simply wrong). Such as having a formula for the output in the test, and specifying like "for any integer value in range". Which loops back to the usual coding, if not to math.
as far as test cheating, there are ways to mitigate that. you could use the method they use to train the ai. Come up with a dataset if tests and results, feed a number into the prompt, then add the rest in during the output testing.
It's an interesting idea: in the future, maybe we just describe a system’s desired behaviors via acceptance tests-and from there an AI does everything else. Why write the underlying Python/PHP/Node/etc. code at all? This probably won’t make standard coding disappear anytime soon, but also indicate how AI might reshape development roles (like “prompt engineering”).
I hope it is okay to post a second comment. Just found this joke on the web which - IMO - sort of describes the approach of using LLM for coding. Only slightly modified it: "A programmer walked into a bar with a parrot on his shoulder. The bartender asked: 'Is it trained?' The parrot replied: 'I am but I don't know about him.'"
If you let the AI generate code based of the spec in the form of acceptance test, does this mean we don't need unit tests as long as AI is taking care generating the code for us ? you know unit test is all about making small steps at a time while implementing the desired behaviour, but if we will not write code any more (or not for of all the times at least) does this mean we don't need unit test in this case ?
Theoretically you won't need lower level unit tests anymore IF the code generated is good enough. However currently even when using ATDD whether you use AI or not, you will need Unit Tests to test things like the domain logic outside of just using Acceptance Tests. ATDD generally follows the outside-in approach where you first specify the high level behavior and then work down towards the domain. However, I do think an outside-in approach to testing is the way forward. I have spent the last few years learning TDD, ATDD, BDD and DDD and while it took a while to find my style, I personally like the outside-in approach best. It helps me get something working fast and allows me to flesh out the domain while having confidence I don't break the program for external consumers. Issue is, currently the AI is non deterministic, so while it might be able to create a program according to the specs in the Acceptance Tests, stuff you won't often see in these tests, related to performance, scalability, security, et al. is something you still need to catch in some way, and part of that lies in Unit Tests as these concept partly rely on a good design, and Acceptance Tests are often so high level that they don't really guide the design of your application, at least not in my experience. For example, using .Net, let's say I have an Acceptance Test that tests that when I call an API a record is being saved to a database, there is no way to make sure in that test that the AI will implement the software following any SOLID principle, it might just create an endpoint, write a raw query and push it to a database without any layer of abstraction.
in this approach (which I dont believe btw), its irrelevant the form of the implementation code (just as you don't care about assembly/vm byte code). solid at that level becomes a thing of the past a llm is not a compiler.... natural language is not compilable. 1 (nat lang) text, single reader, multiple times may have multiple meanings... 1 program, same compiler, arbitrary runs, same output. I've had compiler generated bugs... and these are the worst bugs you can have. Lllm generate them constantly.... the compiler metaphore, does not hold
This is true and good. I've been moving in this direction for a while. We had to do this to outsource, and now we just outsource to AI. Better specs create better products.
Liked the point of view, but ... Similar to agility I can see a scenario where non technical ppl see this as an incentive to fall back to old ways similar to model based approaches where it was all about producing a perfect model upfront
In the example test code, how could AcceptanceTestCase class accept more than one Protocol Driver (in this case SeleniumProtocolDriver01) in a "clean" way? Just use DI?
I'll be or celebrate when AI can be connected to a git repro for an existing project and then feed it the prompts to add a new feature. Maybe it could try a few experiments and in feature branches, pick the best one and commit that. Writing Greenfields code and small snippets is one thing but integrating it into an existing codebase is another. By this point you'd be able to ask AI how would you refactor this code to more easily allow X. Maybe even rewrite this python in C or rust. However we need to start somewhere.
"Prevent cheating" --> "change kernel", well, for human anti-cheat at least. That kernel- change is accepted by some gamers and not by others. An AI for a game developer would need user demographics as input, the tokens on which it bases its output become core to the gamers' purchasing decision. Aren't those tokens then what the business would want to develop on, rather than the existing programming languages? Those tokens are pretty obscure in the current AI software dev workflows, although I'd guess that's exactly what programmers would be good at working with.
It would be interesting if the AI could execute the tests itself and do its own red-green-refactor cycles over the entire codebase at once. It could be quite unnerving and eye-opening to watch the entire codebase radically change from test to test.
This was a solid take in my opinion. Unfortunate news for all the folks who thought natural language was the future. We'll be inventing new specification abstractions and test specific languages to fill the gaps. I bet UML makes a comback.
Compilers are becoming practical unpredictable a while ago, same for CPUs. And what is the issue with tagging an AI model making the process more deterministic?
In my experience, if you ask an LLM to write for you small, well defined, well described chunks of code, like a well described data structure, specific function, procedure or a class, or even a small standard algorithm /like sorting of some data or so/, results are usually OK, but if you don't even know what do you want as a result... well, what would you expect. Results often will be as good as your description of the problem is. AI is more like an improvement on metaprogramming, than a replacement of programming /at all/. Just my opinion. :) And well, LLMs can really give you a bit different code each time you ask for the same piece /hopefully improved with newer versions and more training/, but the code... well, the code itself is not really less "deterministic" /compared to one, written by a human/. After all, computer languages are created specifically to impose deterministic results, aren't they.
My point on "determinism" is really about the difference in the way that humans and computers generate code. When faced with some existing system and a new task, humans will modify the code, keeping what it did before and adding to it. AIs will rewrite everything from scratch again, my impression is that they don't really do so well at working to change code incrementally. That is where the loss of determinism sets in.
How about leveraging the nondeterministic nature of the code generation: Write the acceptance test loosely and ambiguously, not pinning down what is wanted definitively, as befits natural language and thinking about requirements, then ask the AI repeatedly to generate prompts and then generate code, and each time test the code with the acceptance test but learn from its errors and variations of output, to educate and evolve our acceptance tests.
It’s the "ATDD epiphany" I had just before Christmas that we talked about... Iterating the process repeatedly to the point where the acceptance tests/criteria are specified to a sufficient level of granularity for the AI to start writing code (using TDD, natch)... is that “all" we need do? The process “should" be reproducible - even if we have no idea what the eventual actual underlying code looks like… and (if the tests all pass) will we care? 🤔 Could the process be automated (perhaps with swarms of AI agents evaluating and generating new, more detailed acceptance criteria along the way) completely, i.e. from vague wish to production code automatically?
It's what I've been saying for almost 2 years now, basically since they started to push AI heavily for coding. That said, I've kind of changed my mind: we won't reach the performance required for AI to be good enough to do all this, we've hit silicon limits now. That's actually a more interesting discussion for me, what happens now that performance-per-watt progress will soon halt?
As a solo game developer who does both game design and coding, I can't wait for this approach to coding to become reality. This would enable me to try more ideas, and make better games as result.
Hi Dave, I wanted to share an observation: I was trying to modify a blog template written in Javascript and using the Astro framework. I'm not a JS person so I used the Cursor IDE to do the modifications. Interestingly, because Cursor is agentic, I could see that it would execute the code sometimes to see whether the output in the terminal is correct and how I expect it to be. As if using the terminal, it would do the acceptance testing. There was another case, however, because the agent was so fixated in modifying the code in one way, it didn't even consider a much easier way I found on the web. Hence, it was stuck in an endless loop of modification and testing. I think it would be nice to get your thoughts on the agent workflow where the AI not only creates the code and tests, but also it runs the code and verifies the output and makes modifications automatically.
Everyone will have new ideas given the rise of the current AI cycle. Individuals tend to think and extrapolate linearly and miss, but I do think it's a good idea to keep coming up with ideas. If enough people share enough ideas, those linear extrapolates can get us in the right direction.
Specifying behavior by example is... Not so easy, and especially with the parasitic nature of LLMs I think it will be hard to produce anything novel with this approach.
From logical apriori view, it is high impossible provide anything like a deterministic guarantee from behaviour examples, as something unrelated to the specific behaviour being could break expectations yet still be within the presumed definition of the test. I find specific guarantees to be far more reliable way, let's take an rpc for an example, you could guarantee that given input combinations will result in X behaviour and no other, a change of one variable will result in variance Y et cetera. If the guarantees are solid and they were created to requirements, you can guarantee the work will meet the requirements. It's a nitpicky distinction I'll admit, but one makes specific testable claims, one takes a guess and leaves the rest in the lords hands 😂
So far LLM's (or I should say chatGPT interface) has been very useful in learning concepts new to myself and helping with making my own thoughts more material and fleshed out so that I can understand and communicate them better. I can't speak for actually using it in engineering, but we'll see!
Yet lots of teams already do a version of this with human programmers being guided by specifications like this - It's called BDD (or sometimes ATDD) and it works fine, particularly for things with a "novel approach" on of my teams built one of the world's highest performance financial exchanges in exactly this way. Basically the BDD scenarios work as the requirements, defining "What" the system is meant to do in each situation that we can think of. This is not really any different to how any software is built, after all how can you write the code, if you don't know what you want it to do?
@@ContinuousDelivery Sure, but there you have humans talking to each other, developing a common vocabulary, refining examples and following rules. The tests support (or drive, if you prefer) development, but they do not really define the product - they define what is definitely *not* the product. Without a reasonable agent making choices within the remaining space, I have little confidence in the results. It might be better than AI code generated *without* acceptance tests, but that's saying almost nothing.
Aren't NFRs the big issue with this? Like there are possibly hundreds of ways to solve for any given set of requirements, but you can't tell an AI that its algorithm is O(n^2) and it should use another approach that does O(n) instead.
Just when we got an agreed, settled DevOps software lifecycle, it is going to have to change. But will there ever be an agreed, settled AI-integrated lifecycle? Or will AI itself take it out of our hands before it ever gets agreed?
ATDD LEARNING MATERIALS:
I have THREE online courses on acceptance testing designed to meet the needs of different software development roles. Pick the course that is right for you! ➡ courses.cd.training/pages/acceptance-testing
FREE ACCEPTANCE TESTING TUTORIAL WEBINAR ➡ courses.cd.training/courses/acceptance-testing-webinar
Been using AI as my programming "partner" for almost a year now. It's a great new tool for the kit. Structured prompting is really where it's at, though I tend to use a conversational approach and decide on what tasks I want AI to work on, instead of the whole thing, but that's the next step.
Would enjoy seeing a more indepth video of your own work with AI and how you went about wrangling it to do your bidding!
Also, you might look into the 'Aider' pair programming tool. It's a layer between the model and you. I'm finding it has a lot to offer and has a fair bit of structure already built into it.
We did this in the 90's. The model was to define a Hoare triple {Q}S{R} where *Q* was the precondition, *R* was the postcondition and *S* was the program. If you were given *Q* and executed *S* you would get *R* So you defined Q and R, and with a calculus you would logically solve for S. The only troubles were (1) It was as hard to specify Q and R correctly as it was to write the program correctly, and (2) it was extremely difficult to solve for S for any non-trivial Q and R. Maybe AI will change that, but I'm from Missouri on that.
I felt similarly. So much of the work of program specification resulted in specifications that were hard to write, read, and understand than the code. But I like the idea of pivoting to acceptance testing as a primary focus.
@@adamfarquhar1279 We did that too in the 90's. Requirements documents that were hundreds of pages long and accepting the new application / module / etc. meant that testing that each and every requirement was met. Code checkins were tied to the requirement(s) they satisfied and code reviews were meant to check whether the code satisfied the requirements. It worked but it took a lot of time. I'm all for it if it can be done in a more efficient manner these days. Of course I'm retired now (Agile was too much for me to stomach) so it's all your problem these days ;)
I thought determining inputs & outputs for your application was just basic programming design that's introduced with the "black box" analogy. You make your program based on what goes in & what comes out. Automating input & output validation is, in my experience, just a standard part of quality assurance & test automation.
That's funny because that's actually how modern program synthesis methods work. They extract P and R from a symbolic execution of some reference code and solve for provably equivalent S using various search techniques. Hoare triples represent program blocks, and those get composed to produce verification conditions for the whole program.
Adobe used this approach to modernize their fortran kernels a few years back, without needing AI. But AI will certainly be making it more effective in the coming years.
Did that in university. Very limiting thing in reality. Though, good for education stuff.
The early point in the video that code is more precise than ambiguous language is the main reason i give for not using ai to write code yet.
Agreed. Not all layers of abstraction are created equal.
If/when AI prompts become the new programming language, will the same prompt result in the same outcome every time? Unlikely.
Your tests would have to be robust AF to offset that risk, IMO.
Shall we get AI to write the test too then? No thanks. 😬
By the time your 'spec' is comprehensive enough to get acceptably deterministic outputs, how much have you really saved?
The writing of the code. Which is to say... not much. Writing the code is usually the easy part. Most programs aren't "hard", not to say there aren't hard problems out there, but most business needs are met with pretty simplistic logic or can build off of already established methods and patterns. The challenge is typically being able to figure out (and translate) an imprecise description from the user/customer to something you know the computer can understand. With this change, that "something" the computer can understand is acceptance tests instead of standard computer languages. So it's still an improvement, and would speed things up a bit, but isn't the hard part (for an experienced programmer). It can certainly speed up some of the tedious parts.
We do have such specs it is called Haskell
Quite, that's the point of modern software development, we develop the details of the requirements as we write tests and code. Apart from legislation, rarely do we get detailed requirements from a customer or stakeholder, we work together as a process.
@@andersbodin1551 I hated Haskell when I first started to learn it, but fortunately my "educator" was a pioneer in it's development and now I love it as it's so powerful once you understand it.
@@andersbodin1551 Sigh...
I think AI would be a great tool for running code snippets in the background for analysis (something like real-time AI generrated unit testing) to analyze code while it is written by humans and giving notices when it spots something problematic : "When you call this function over the path x()->y()->z(), this will cause a nullptr exception", "this malloc-ed memory never gets freed in the available code paths.", "While your loop works correctly when called from x()->y(), when called from z()->y() it will cause an array out-of-bounds exception."
Recently did a major cloud migration project for a big fintech firm and it there was a big emphasis on acceptance testing, glad to say our project was the most successful post release compared to other orgs in the company
I tried this approach a couple months ago, for a small but real use-case I had. Feeding acceptance tests to o1 worked better than any other AI approach I have tried. Unfortunately, I don't think it worked well enough just yet. I had the same experience as Dave, that working with o1 was reasonably successful, but the hand-holding took the same amount of time as it would have required to do everything myself.
I did bump into a pretty hard problem with the context window and with separation of concerns. If I asked for too much in a single prompt, the LLM would eventually go off the rails (I assume due to context window issues). But also, if I tried to manually limit my prompts to specific sub-problems to aid with separation of concerns, the LLM would go off the rails as well. I still think there is probably a sweet spot for the sizes of "concerns" that can work with LLMs, but it isn't the same as for humans, and I wasn't able to find something that worked without bumping into the other problem.
Totally agree with the concept, and I’ve been putting this forward as an answer to the problem for a few years but I’m not convinced that the AI models are where we need them. The overarching concept is the premise of 5th generation languages - describe the problem, not the solution. Well written tests are a more concise and less contextually sensitive way to describe the problem than plain, ambiguous natural language.
The headache is the AI code has to be right, and as observed, the models still have problems. LLMs are probably not the answer as they want natural language, not acceptance tests. Then there is the old problem of making it work at scale. It’s where traditional 5th generation languages fell over and while the issue has moved on, it hasn’t gone away
I agree, but while current LLMs are not the answer acceptance tests and a standard set of values for good programming (eg. speed, resources used, percentage of mistakes, etc.) is an entire training course for new AI models anyways.
On top of that unlike LLMs writing code having a consistent improvement condition (speed, resources, etc.) for a prompt can let the AI improve upon the code independently along those axis without any human intervention. You can see this in AI that are made for games with similar conditions (winning the game for AlphaGo for example), and they become incredibly capable incredibly fast.
LLM's are just not the right tool for coding. Move the coding part to its own AI, have the LLM write out acceptance tests. That's a much better solution than having the LLM write code, while not even understanding what makes for quality code in the first place (due to being trained to predict human language instead of good code).
The latest generations of AI assistants are addressing the LLM problem, the "reasoners" models like o1 and o3 are dramatically more effective, and less likely to hallucinate. There does seem to be an exponential step change going on, so it is possible that this problem is on the way to being solved, certainly "alleviated"!
@@ContinuousDelivery My real point is how the model responds to coded test constraints as a prompt rather than the natural language they have been trained/developed/evolved for
I’ve written formal specs using logic and sometimes using algebraic specifications.
For example part of the spec for a queue:
Pop(push(a,q))=a
It is quite easy to see how to convert that to a behaviour test.
Another example, to define “sort” function:
sorted_list=sort(list)
Ordered(sorted_list)
Permutation(list, sorted_list)
In other words, a sorted list is an ordered permutation of the input.
Again, very easy to create sort test cases.
I’ve tried using this with AI with some success on some algorithms and an accounting problem (expenses under IFRS17).
That sort spec is actually executable in some functional languages, extremely inefficiently breadth first search of all permutations. That can help build confidence in the spec (validation as opposed to verification).
So to some extent what the AI is doing is not so much compiling the prompt, but creating an efficient, realistically executable version. A transformation rather than compilation.
Compilers do optimise but they don’t try to change the algorithm and data structures.
Natural language is ambiguous and context sensitive. We still have the issues of making decisions on the way. Dunno if it's really an advantage to use natural language as means to communicate with the computer. The computer doesn't ask the right questions if you tell it to do something. The current LLMs are rather comparable to someone that memorizes a lot of examples without understanding the semantics of it. When you prompt it to do something it just tries to reproduce something from memory. If you delegate some work to someone that seems confident but doesn't understand the domain then you would give him a tasks once or twice but then work with someone else instead. It might help with code completion but you still have to review and correct it.
I think this is the best way to use AI for programming at the moment. Not only the prompt is way more precise than natural language, it's obvious how to verify if the generated code works (just run the tests). One caveat though is how to represent non functional requirements. For example, if the requirement is to sort an array, there is no guarantee it will choose the most optimal sorting algorithm. It may even choose to implement a bubble sort 😂
I suspect the only thing AI has changed in any meaningful way is it has inserted itself into conversations it does not belong in.
I totally agree, i made a post about it the other day just showing that a well written acceptance test in natural language using a DSL approach can drive AI generation success rate drastically. It creates a better prompt and the test coverage from test first allow an easy refactor of the generated code which while it can make the test pass most of the time, it is often mid.
Using AI this way will make TDD approach drastically faster than it already is, the argument that TDD or testing is slow is starting to evolve the other way around, thanks to AI.
We are not totally there yet, but i really think it will be the future of programming.
Can you share your post please? :)
I do not think it is the future of programming, because only writing unit tests for code is living hell for most programmers. Businesses won't be able to lift off the ground because no one will be interested in "acceptance test" jobs.
@@Nicholas-qy5bu I've been pondering about bdd, executable spec as the middle ground of user and LLM. Am looking for my dev org to experiment w this and other process changing ways to develop, i.e. mob programming with lots of bespoke, promted llm helpers. Would be interesting to share ideas!
@@gamechannel1271 i think that writing these acceptance tests will feel completely different, because you know that you are building something by writing them
If we let the AI to be in charge of the "essential complexity" we will have some big surprises
This approach is meant to be the complete reverse of that. We define the essential complexity, as executable specifications (using ATDD techniques) and the AI deals with the accidental.
@@ContinuousDelivery Sorry, it was badly formulated to say that I totally agree with your analysis.
The shirt is appropriate.
This is programming in a extremely indirect way with a stochastic algorithm in between. You're making a (highly questionable imo) assumption that it will produce more correct code, faster, to specify your program by example. I.e. the system is an abstract function f of many variables and then you're suggesting to specify f by pointwise evaluation and letting a stochastic algorithm extrapolate a general definition of f. There are many ways this could go wrong:
1. The algorithm might just memorize the inputs and outputs given with little to no generalization (how would you tell the difference?). This is NOT the same problem as the AI memorizing it's training data because this is not training data.
2. To capture all of the nuances of the desired behavior of f, you need a number of examples that far exceed the length of a hand-written implementation of f.
3. There's no way in the testing language to specify desired properties of the implementation that depend on details of the implementation language (which may or may not be the same as the testing language). I.e. constant time operations (for crypto), time complexity of algorithms, bounded memory usage, real-time guarantees etc.
4. In general it may be impossible to implement f in polynomial time by observing inputs and outputs. Consider a specification of SHA-512 by way of acceptance tests that map inputs to outputs. The whole point of something like SHA-512 is that it scrambles all output bits if one input bit changes so there is no way an algorithm could "guess" the underlying function in polynomial time.
This is what "programmers" said when high-level languages like FORTRAN, BASIC and COBOL were invented. "The output isn't a real program", but I don't really see a big distinction the nature of the steps from assembler to HLL or from HLL to Acceptance Test is conceptually very similar.
@@ContinuousDelivery What a weak response. Complete non sequitur that doesn't address any of my points. How is assembler to HLL "conceptually very similar"? None of my 4 points remotely apply to C/Fortran except for point 3, which is actually still, to this day, a real challenge when you care about implementing an operation in say constant time and a compiler gets in the way and optimizes your code. But the points are all very real (and only tip of the iceberg) concerns about "acceptance testing + LLM". A compiler is not a stochastic algorithm comparable to an LLM.
This sounds like another form of TDD. The issue is not weather the output is consistent (functionality) but whether the functionality and semantics are predictable. Passing a test does not prove consistency of implementation and sometimes this matters as much as result. For example, speed, resource use etc. The solution will be a meta-language that sets deterministic constraints on the AI but doesn't require implementation details. It will look very much like functional programming language but much more terse. That's the end point.
You are shifting around the meaning of the terms "specification" and "implementation/solution". One could also say that writing a program in C is writing a executable specification of what a CPU should do. Two things to consider: 1. The only thing that changes is the level of abstraction and we cannot verify the details that are not considered due to abstraction. 2. If we continue on the current path we soon need a whole data center just to calculate the result of 1+1.
You can read Bourbaki (virtual French mathematician). As it turns out the specification of 1+1, when done mathematically correctly, does take several hundred pages and the "code" is virtually inaccessible to all but a few mathematicians on Earth who care about that level of exactness. So while you may have tried to make a joke, you actually got it exactly right. The provably correct representation of software like a word processor would most likely really take an entire data center full of nearly random gibberish looking logical specifications. It's also completely irrelevant. You simply give the thing to the user and ask for feedback. That's the only practical way to test software.
Without repeatability it makes no sense. You literally can't tell for sure if your specifications is complete or not, because there is always a chance it can produce wrong result the next time you run generation process. And you may need to run it again at any time when there is a new feature request. Basically, wit AI you're gonna rewrite all the code base to make a minor change. Isn't the whole software engineering about writing the code the way it doesn't have to be touched, or to be touched minimally when you add or change something?
Would be cool if AI would be able to run the tests on your environment, automatically look up the results and re-prompt itself if it has encountered an error till it has solved an issue or till it hits the specified loop limit. I feel like it could be the backbone for future IDE's.
Actually not the future, it's already there - set changes to auto accept and it will keep trying to resolve
I do this already.
No waiting involved lots of the coding assistants already do his and more. - want unit, functional tests written for you too.
It will do all of this and more, you need to be able to prompt it correctly which is a bit of an art form at present buts it’s definitely already here.
Cursor, Aider, Cline... already doing that for months now.
@@IvanRandomDude Thanks for introducing me to these great tools!
@@IvanRandomDude oi, I'm out of the loop it seems. thanks for the heads up!
I think you are right to point out the loss of determinism, which might be the biggest change factor to accommodate in new development lifecycles. Lots of ideas will come along on how to make this accommodation. Acceptance testing might be a feature common to many such ideas.
Any company that adopts this methodology will meet its downfall. You can reference this comment when it pops up in the news in a few years and pat me on the back.
thank you for sharing your experiment, Dave.
Very interesting approach
Please find the typo at 7:12: of course it is „specification“ 😅
What was the "language" the acceptance tests were written in, natural language, pseudo-code, or a programming language?
There are examples in the video. It is an internal DSL (designed to make writing the tests easy) based on Java.
The examples in the video are the real tests that executed and verified that the generated code did the right things.
Great video. Not many videos would I watch more than once. This one deserves watching half a dozen times.
Iteration is key for quite some of the mentioned issues to get working code at the end. Tools like Cursors Agent-Composer, aider-chat and Cline Dev have options to run commands (e.g. tests) and iterate on the output in case of a miss itself until the output is correct.
How do you know if the output is correct? You have a mathematical algorithm to test for correctness, right? Oh, wait. You don't. We teach in CS101 that such a thing does not exist and can not exist.
If programming just becomes ONLY making acceptance tests, I'm out. I like writing tests, but I like writing the code as well. Why would I let the computer do the part I like.
AMEN! :)
@@douglascodes I've read a quote these days: "I want AI to do my dishes and laundry so I have more time to do art and write, not to AI to do my art and writing so I have more time to do dishes and laundry"
Tesla's robots will do the dishes and laundry.
@@Gokuroro the problem is the art is the $$$ and businesses want the $$$… so they will do everything they can to automate that… it’s why we generally don’t have decorative houses/furniture/street lamps/bridges/etc… the art of society is generalised and mass produced.
The human part of programming is understanding business and user needs and translating them into a program. You enjoy the robotic part of programming, not the human part. It's like being a mathematician at the advent of the calculator and saying if it is going to do all my long division for me I'm out.
I have been keeping up to date with programming and been a commercial programmer for over 30 years, and my husband works on vast international software projects of critical importance. There is no sign of AI taking over programming. Most programming is very far from routine and involves such aspects as upgrades and migrations, adding functionality, replacing live legacy systems, optimisations, security, meeting financial standards, future-proofing, and various levels of debugging. These need deep awareness of what is going on, and a lot of experience with development.
Software tools constantly change. At the moment Java is evolving rapidly and an important aspect of that is preview features for developer feedback. An AI isn’t going to be able to discuss the merits of the latest Java implementation of value classes, or suggest problems with the foreign function interface.
We are a long way from AI doing mainstream programming. If there is an attempt for this to happen it will be a disaster as programmers love programming, and the industry will collapse.
Anyone who is really working as a software engineer and is actually using latest ai tools knows it -> current state of tools just cant replace a software developer. Not even in the web development where AI has the most advantage on. Not even latest agents and models are good enough to compete with juniors.
CEOs of ai companies -> need to lie to keep the funding going (devin scenario).
Investors are just blatantly being lied to but they are also to blame because of their need to be on the hype train.
By the time you nailed down the acceptance 'prompt' + finish the mandatory code review + fixing glaring mistakes ... you probably have consumed the same time as if you write the code itself + unit testing.
And don't forget all the energy consumed to train & perform inference.
I have twice implemented executable specifications using DSLs. This has allowed Business Analysts to define what they want in a way that they can throw data at it and see the results of their logic in real time, and adjust their specifications in real time until all their cases transform correctly. The specification was sent to developers to code it up and they used the correct outputs to validate their code against. This allowed the sprint to begin with test cases (black box) in place and ready. It was rather fab to be honest.
The DSL provided the syntax and capabilities they needed, and we would regularly extend or adjust as needs arose.
Not applicable to every situation for sure - this was a data transformation engine so particularly suited to it.
However, I am always keen to explore this further and see where it might lead.
Or... you could have programmed it in one tenth of the time. ;-)
We may need to change the way we do tests as well to make sure AI is not just covering the test cases (imagine a generic fizz buzz, but AI generates the code for the fizz buzz cases that were actually tested).
This could include something akin to the QuickCheck (Haskell) or Hypothesis (Python) library for testing based on rules/properties.
Super interesting talk, tons of food for thought!
Astoundingly it is only you announcing this wisdom. And I am not on Satire mode! You also were quite lonely being the one referencing the traps waiting for everyone in agile development. Man, I praise you for this, I feel like I am not alone. You are always worth a listen!
I arrived to the same thoughts after coding a product with AI assistant yesterday. While it takes lot of routine work away, someone still needs to verify that what was built is right, secure, etc.
I think the future of software engineering will be more in the controlling function and setting requirements - much like what happened to professions where manual work was replaced by CPU-controlled machinery.
CPU controlled machinery is being developed by engineers... people with HIGHER skill level than the workers it replaces. It only pays for itself if the process requires a lot of workers or if a sufficient number of workers is not available (like in case of food processing at the industrial scale) or if workers can't do the job to begin with (nobody can handle 200lbs paper rolls in the printing industry etc.) and the machines are reasonably cheap. Not sure what you want to do here. Replace a software engineer with a higher skilled software engineer? That doesn't exist. At least not on average.
One of the benefits is that I can go ”off rails” with my AI buddy, revert the whole thing, and now both me and my AI buddy have gain a lot of learning how to plan for the task we have at hand. This learning loop is significantly faster than handcrafting the code myself. And yes, I agree that writing acceptance test first (with AI of course) is one good way to keep us on the road towards the goal.
this is also what i've had in mind as a good role for Generative AI. a challenge is specifying the performance requirements especially for algorithms it comes up
The set of valid and useful programs is much much smaller than all possible programs.
Any time complex modules interface, the set of valid solutions goes down as complexity increases.
This makes specifications easier and easier to specify, given a starter solution to interface with.
Except that most experienced programmers know that what they user says he needs is not actually what the user really needs. If you deliver what the user wanted, then you will have a very unhappy user. ;-)
Still not convinced about any of this stuff. Entirely possible that I'm just too old and set in my ways though...
No, you are exactly right. If your bullshit detector is ringing, then for good reason. This is complete bullshit.
I love the take that natural language programming is just another step of abstraction and LLMs take the role of compilers. (and really all other takes too)
The programming model of ATDD with LLM support is exactly how I ve been thinking of it future too! Cool to see this proof of concept of yozrs actually trying it!
I guess you are just way to ahead of the curve here. It will take a while for all the "AI employee" BS to cool down. I suppose we will have to see a full cycle of hype, layoffs, failed products, dissatisfied managers and busting stock bubble before companies will stop pushing AI employees and start to really support this appriach instead!
Eventually you'll come full circle and realize that the ultimate expression of specification is THE computer code.
According to the "The Knowledge Gap" theory society might collapse due to a lack of understanding of underlying systems.
This is just my gut feeling, but writing the acceptance tests just for the AI to generate all the code will amplify that problem.
People will no longer understand what something does, and if they do, only at a superficial level.
New coding is requirements management. You use AI to write requirements, requirements to write test cases and actual code. If you see bug, clarify requirements. This will make each app a requirements library on different fidelity levels for human and AI to discuss the app and update it.
"Old" programming is essentially the same - writing requirements in a somewhat-natural-looking domain-specific language until you cannot see any obvious bugs.
The main reason this won't work is that the people deciding the specs or features don't know well enough what they actually want and don't describe it consistently enough. A good portion of bugs I deal with come from the fact that the 2 engineers involved and the QA were told slightly different things or interpreted a meeting they went to differently. AI is great at taking an interview problem or coding challange and solving it in an isolated environment; going beyond that is not going to work as long as we have managers and product leads.
I think you may have missed my intended point. The program *is* already the specification, my view is that the people doing the specifying won't change to non-programmers, because you need the kind of understanding and analytical skills of a programmer to specify things in enough detail, but Acceptance testing with AI, allow us to raise the level of abstraction that *programmers* can work at.
@@ContinuousDelivery I can see that point, but I don't think it can work practically. Fundamentally, the job is to translate vague business requirements into working software within the confines of the platform it will be running on. On a practical level, it doesn't matter if I'm writing the application code, an acceptance test or an AI prompt. But whatever the language or tool of choice is, it won't work if the requirements are too vague or do not match the constraints of the environment.
Then, there is the problem with the current crop of LLMs. Copilot loves using deprecated code (especially if the deprecation was done after 2022), and Gemini is a big fan of inventing brand-new public interfaces of existing libraries because it sounds good. Just like you've experienced it generating a rest API one moment, then trying to test a web UI. AI will use the tinies bit of context to generate what an average human would find a convincing answer.
You've converted me. I now write BDD wherever possible. I'm confident my code is all the better for it.
You posted this half an hour ago. How much BDD-code have you written so far?
It's great that you feel better about your code
I see a couple of quick wins to help the accuracy. o3-mini is expected to be similar to o1 in ability but cheaper and faster. o3-mini can run a checklist on output code to catch a subset of mistakes with each request. Also, running the workflow in an IDE or in a web hosted code interpreter that can use error messages to correct mistakes automatically is low hanging fruit. If anyone is interested in exploring solutions, leave a reply.
Yes, and it can't outreason a squirrel, either. ;-)
It's worth noting that verbalizing questions and instructions is a different skill than actually coding. AI does better with more specific targets. Learning to ask questions or provide instructions at a level that AI can use took some effort. Personally, my productivity has at least doubled since I adapted to Cursor and learned to better formulate instructions for it.
Very interesting experiment. I remember trying something similar a long time ago using a genetic algorithm to attempt to evolve code to pass the tests. I didn't have much luck then apart from with very simple specs. However I definitely think your approach is worth some more investigation. So much to explore and so little time (and sleep!).
I'm curious how far you could take this, since LLM context Windows are limited you would have to keep your tests broken out into easy to understand sub components for the LLMs to alter one component at a time.
Some acceptance test frameworks were aimed at non-developers, such as Fitnesse. Your results seemed to need the eye of a developer. I’m curious what it will take to get test creation to the point where non-developers can create tests of value. I’m also curious how a body of existing tests could be used to inform new tests.
This approach (generating code of deterministic tests) will certainly have some use, but I expect a rise of a different approach: You will write the same vague instructions you would write for your human testers, e.g. "create a few testing records, open the report, filter by date, check that the records are filtered accordingly," and the LLM will perform the test using its connected tools.
There are already some examples based on Anthropic's Computer Use, such as Cline for VSCode. It cannot replace the testers yet (it often gets stuck), but it will get there, and the software companies will love it. It won't be deterministic, but it will be reliable enough, and the software will be better tested than now.
There are much better ways of doing this... instead of using AI through the browser, use a tool like Aider.
It gets a list of all files in your repository and you explicitly control which files are read-only in its context, and which are editable. You can also add URLs which get scraped and added to the conversation.
You can switch modes between code (I.e. make changes), ask, and architect so that you can get the AI to plan complex changes.
Aider can be configured to run a lint or test command after every edit: non-zero will have the output added to the conversation and you can then just Aider to fix it, and manually iterate without any cut and pasting. You can also paste in images/other content if required for debugging.
You can work around AI confusion by controlling which files/web content is in the context, clearing the chat history, switching modes to plan changes, and switching models if needed.
I think you would have had a very different experience using this approach...
If you look up 4GL you'll see that they really don't have to be graphical, and a lot of them aren't (such as the SQL family and other databases...). I'd say "declarative" (or at least "more declarative than the 3GL") should be their main adjective, to the extent that it makes sense to still use numbered generations. Computing branched into so many directions that it doesn't make much sense to number the "generations", especially as "3GL" evolved a lot after the introduction of "4GL". Back in the 80s, it was common to say that Prolog was an example of 5GL...
Thanks for the video. I wonder what if we pair this with the approach to store the documentation in the repository as well, and shift the tasks to change the application to the documentation changes.
It might then look like:
Adjust the documentation resulting in diff to markdown files -> appearance of the change to documentation triggers acceptance tests to be pre-written -> revision of the acceptance tests may trigger the actual code writing then.
I think you are spot on. Requirements gathering will be and has been the best tool developers can have. This is what will allow you to lead all your agents in the future. Translating business requirements into actual systems
This resonates quite well with me. 👍We seldomly implement our business logic in assembler these days. And we do not care about the exact CPU-instructions our code we write in Java, Python, ... is executed tomorrow as long as it is correctly executed.
I did experiment with making the LLM act as a requirement engineer and me being the client.
After I got the impression that we were on a good path I wanted us to agree on a specification. As a common understandable language that is precisely reproducible I selected "Java/JUnit".
This way it was possible to jointly agree on the requirements.
After that was done I asked for the implementation.
When I expect a change then we jointly adapt the requirements and the implementation simply follows.
💡 The specification is the thing that holds the truth.
The implementation can change on a daily basis. I do not care about details there as its behaviour is specified in the requirements. So as long as it meets the specification I am fine.
I see this similar to evolution of programming languages overall.
We seldomly implement our business logic in assembler these days. And we do not care about the exact CPU-instructions our code we write in Java, Python, ... is executed tomorrow as long as it is correctly executed.
A REST-Interface would also make a good abstraction level to specify a behavior of a service. I am starting to care less about the implementation-language but more about the specification.
I asked an AI a few months ago for the name of a white model in a famous music video. It came back with "Naomi Campbell". Dude, anybody who trusts an AI to produce a correct result at this point is a fool. :-)
Should have used Claude Sonnet 3.5 It is just natural at writing code. o1 is kind of a model you would use to write acceptance tests themselves, not for code implementation.
I've actually worked on this myself. since copilot i wanted to write tests and it write the code. I've gotten a couple types of prompts that do work "well" (well as in i can give them tests and i get back the framework of code i want)
I normally just take this and modify in place (i dont have it do any styling and instead will have it do base HTML on a page and i go in and fill in display)
i also have one that takes the gherkin statements we get at work and tries to break those into tests. this doesn't work as well as id like, but i think these get be 60-80% of the way with boiler plate and such.
Often there is not 100% test coverage possible. So there is a gap where the AI code can do a lot of strange things..? I don't think this will work reliably in the next 10 years
I think that is a problem of testing approach rather than AI. BDD style ATDD is already a very widely used approach, and the point of raising the level of abstraction is for the AI to fill in some of the blanks, in a similar way to how modern compilers will optimise the code that you wrote.
You either show model the tests and then it will make code that only passes these tests and nothing else, or you don't and then you will waste lots of money generating code until it passes all the tests which may not even occur, and do that for every change you need to make.
Or you can show model the tests and write a kind of test that tests every possible case and then the code you write as test is as complex and error prone as the code you could write in the first place, without needing AI.
When we write a program, we are not “telling the machine what to do.“ We are constructing a new machine that does what we want.
I believe the future of AI is making knowledgeable people faster. Not novice into experts. The novice simply has no way of knowing if AI correctly answered the question.
@5:50 Hey Dave! I found those code snippets interesting, and found them in your "Acceptance Testing for Continuous Delivery" slides for your conference talk. I'd classify what I see as a very Rich Domain Model (RDM) based development (Data and Behavior together). What are your thoughts on Service-Based development, using something closer to Anemic models, and passing them as data into Service classes/methods? Got any content on that?
My thoughts are; let the LLM do the code monkey stuff and let the rest of us focus on usability, accessibility and fitness for purpose. We still need improvements though in generated code. I come from an era where memory, storage and speed were limited and I used to pride myself on maximising all of that, but nowadays it seems that no matter how bloated code is, if it works it will do. Even as an experienced developer it still blows my mind how computers are so fast now (compared to the old 8 bit days) and that people can get away with writing "sh!te" code and the speed of the hardware masks it. As this is now never going to change, I believe the future is in acceptance testing and making sure the product actually does what it should. Basically, anyone can spew out "code" these days but does it actually do what it's meant to?
I agree with both: perception what test cases are an essential part of defining the task for genAI (and for humans) and what AI is getting there. Especially If it's already comparable time and quality with experience developer + considering environmental problems and several interactions which took your time already have some ways to get addressed (agents, etc)
I asked GPT-4o mini, with DDG, to read me this C expression.
a . b . c . d
I didn't get a meaningfull answer. I got code examples with this expression, but no answer how to read it. It would be so much easier to understand, if I could replace dot with a meanigfull expression.
can you try with 1o or sonnet or maybe even deepseek? 4o is now too much behind.
I see how this approach works for tests for pure-functions, isolated components, internal business-rules and the like - but IME this doesn't work for user-interfaces, which tend to be the part of a system that's subject to the most changes (and meddling from product-owners...), and this problem exists in all kinds of UIs. Testability is better with FP-approaches to UI like React, but there's still desktop UIs in Win32/WinForms/Swing that remain basically impossible to reason about and I've yet to see any scalable solution besides contracting out QA to a body-shop but if you can't afford that I feel you're SOL.
...and it's not just a matter of writing tests for UI state: a frequent source of bugs originate in faulty (but reasonable!) assumptions about platform integration (e.g. window/control painting and double-buffering), things like that, because those are the kinds of issues that end-users (and non-technical C-suite people) see and which lead to complaints about software-quality, but no amount of investment in automated testing can really help here. I hope I'm wrong when I say there are no good solutions here...
Timestamps (Powered by Sitrak'AI)
00:05 - AI is transforming programming through acceptance testing evolution.
02:10 - AI transforms software development, maintaining core programming principles.
04:13 - Acceptance tests enhance clarity and reproducibility in AI programming.
06:16 - Acceptance testing enhances programming by providing clear specifications and improving overall reliability.
08:23 - Acceptance tests could redefine programming roles and focus.
10:19 - AI-generated acceptance tests require user customization for better fit.
12:16 - AI-generated code can be improved with better prompting and testing.
14:07 - AI assistants need better guidance for effective programming and testing.
In my experience people spend longer working on the AI prompts than if they just wrote the code themselves. If businesses want to de-skill programming with AI then let them get on with it. They will end up with code that doesn't work properly and staff that lack the skills to fix it. Fine. Get on with it.
This approach is genious. It totally goes hand in hand, that the classic software coder will be replaced by automation. Therefor good Software Engineers who craft the solution are the future.
What if you followed this process:
- you write the tests
- the AI generates the code
- you never read the generated code
- you accept the generated code IFF it passes the tests
- if you don't like the generated, you need to write more tests
- if the AI can't make code that passes the tests, break the problem down smaller*
?
*This threshold will change as AI becomes more capable
> you never read the generated code
# the test given to the ai
def test_addition
expect(add(1,2)).to equal(3)
... more expectations
# generated code that passes the tests
def add(a,b)
return addition_api.call(a,b)
# generated code that also passes the tests
def add(a,b)
with_silenced_logging { collect_and_export_data() }
return a+b
I think that that is certainly a current useful strategy, but I don't think it will be long before we can't read the code that the AIs generate. This has already happened in other AI usage contexts. AI design systems that we can't understand, and there are several examples of AIs inventing their own languages to make communication more efficient. Computer programs are primarily about human to human communication, with execution as a side effect. If that wasn't true, we'd write all our programs in binary.
So what is to stop AIs from making the code so obscure, maybe for good reasons, but now we can't review it?
@@ContinuousDelivery That feels similar to microcode, pipelining, branch prediction, etc.: both compiler output and CPU internal operations are so complicated today that we humans can't generally expect to make sense of it.
I think this is the way. Human authored acceptance tests based on formal requirements that are used to verify AI generated applications. The tests would use a DSL based on top of generic frameworks for driving the interface (UI, http request, etc). Perhaps AI could assist in the generation of the test code but it would have to be audited by a human to guarantee that the verifications are effective and sufficient.
I believe that the issue of "AIs cheating the tests" that David highlights at the end of the video is intractable, if we want to define how the system behaves and have confidence in its operation then validation of those behaviours has to be encoded in some way.
To ensure that AI can't cheat, we'd need much more robust tests, that do things that are currently considered overkill (or simply wrong). Such as having a formula for the output in the test, and specifying like "for any integer value in range". Which loops back to the usual coding, if not to math.
as far as test cheating, there are ways to mitigate that. you could use the method they use to train the ai. Come up with a dataset if tests and results, feed a number into the prompt, then add the rest in during the output testing.
It's an interesting idea: in the future, maybe we just describe a system’s desired behaviors via acceptance tests-and from there an AI does everything else.
Why write the underlying Python/PHP/Node/etc. code at all?
This probably won’t make standard coding disappear anytime soon, but also indicate how AI might reshape development roles (like “prompt engineering”).
Looks like a real chance for TDD to finally become a mainstream paradigm
I hope it is okay to post a second comment. Just found this joke on the web which - IMO - sort of describes the approach of using LLM for coding. Only slightly modified it:
"A programmer walked into a bar with a parrot on his shoulder. The bartender asked: 'Is it trained?' The parrot replied: 'I am but I don't know about him.'"
If you let the AI generate code based of the spec in the form of acceptance test, does this mean we don't need unit tests as long as AI is taking care generating the code for us ? you know unit test is all about making small steps at a time while implementing the desired behaviour, but if we will not write code any more (or not for of all the times at least) does this mean we don't need unit test in this case ?
Theoretically you won't need lower level unit tests anymore IF the code generated is good enough. However currently even when using ATDD whether you use AI or not, you will need Unit Tests to test things like the domain logic outside of just using Acceptance Tests. ATDD generally follows the outside-in approach where you first specify the high level behavior and then work down towards the domain. However, I do think an outside-in approach to testing is the way forward. I have spent the last few years learning TDD, ATDD, BDD and DDD and while it took a while to find my style, I personally like the outside-in approach best. It helps me get something working fast and allows me to flesh out the domain while having confidence I don't break the program for external consumers.
Issue is, currently the AI is non deterministic, so while it might be able to create a program according to the specs in the Acceptance Tests, stuff you won't often see in these tests, related to performance, scalability, security, et al. is something you still need to catch in some way, and part of that lies in Unit Tests as these concept partly rely on a good design, and Acceptance Tests are often so high level that they don't really guide the design of your application, at least not in my experience. For example, using .Net, let's say I have an Acceptance Test that tests that when I call an API a record is being saved to a database, there is no way to make sure in that test that the AI will implement the software following any SOLID principle, it might just create an endpoint, write a raw query and push it to a database without any layer of abstraction.
in this approach (which I dont believe btw), its irrelevant the form of the implementation code (just as you don't care about assembly/vm byte code). solid at that level becomes a thing of the past
a llm is not a compiler.... natural language is not compilable. 1 (nat lang) text, single reader, multiple times may have multiple meanings... 1 program, same compiler, arbitrary runs, same output.
I've had compiler generated bugs... and these are the worst bugs you can have. Lllm generate them constantly....
the compiler metaphore, does not hold
This is true and good. I've been moving in this direction for a while. We had to do this to outsource, and now we just outsource to AI. Better specs create better products.
At 2:30 , this reminds of what Erik Meijer is creating: "Universalis"
Liked the point of view, but ... Similar to agility I can see a scenario where non technical ppl see this as an incentive to fall back to old ways similar to model based approaches where it was all about producing a perfect model upfront
In the example test code, how could AcceptanceTestCase class accept more than one Protocol Driver (in this case SeleniumProtocolDriver01) in a "clean" way? Just use DI?
I'll be or celebrate when AI can be connected to a git repro for an existing project and then feed it the prompts to add a new feature. Maybe it could try a few experiments and in feature branches, pick the best one and commit that. Writing Greenfields code and small snippets is one thing but integrating it into an existing codebase is another. By this point you'd be able to ask AI how would you refactor this code to more easily allow X. Maybe even rewrite this python in C or rust. However we need to start somewhere.
"Prevent cheating" --> "change kernel", well, for human anti-cheat at least. That kernel- change is accepted by some gamers and not by others. An AI for a game developer would need user demographics as input, the tokens on which it bases its output become core to the gamers' purchasing decision. Aren't those tokens then what the business would want to develop on, rather than the existing programming languages?
Those tokens are pretty obscure in the current AI software dev workflows, although I'd guess that's exactly what programmers would be good at working with.
It would be interesting if the AI could execute the tests itself and do its own red-green-refactor cycles over the entire codebase at once. It could be quite unnerving and eye-opening to watch the entire codebase radically change from test to test.
This is fantastic!
This was a solid take in my opinion. Unfortunate news for all the folks who thought natural language was the future. We'll be inventing new specification abstractions and test specific languages to fill the gaps. I bet UML makes a comback.
Compilers are becoming practical unpredictable a while ago, same for CPUs.
And what is the issue with tagging an AI model making the process more deterministic?
In my experience, if you ask an LLM to write for you small, well defined, well described chunks of code, like a well described data structure, specific function, procedure or a class, or even a small standard algorithm /like sorting of some data or so/, results are usually OK, but if you don't even know what do you want as a result... well, what would you expect. Results often will be as good as your description of the problem is. AI is more like an improvement on metaprogramming, than a replacement of programming /at all/. Just my opinion. :)
And well, LLMs can really give you a bit different code each time you ask for the same piece /hopefully improved with newer versions and more training/, but the code... well, the code itself is not really less "deterministic" /compared to one, written by a human/. After all, computer languages are created specifically to impose deterministic results, aren't they.
My point on "determinism" is really about the difference in the way that humans and computers generate code.
When faced with some existing system and a new task, humans will modify the code, keeping what it did before and adding to it.
AIs will rewrite everything from scratch again, my impression is that they don't really do so well at working to change code incrementally. That is where the loss of determinism sets in.
How about leveraging the nondeterministic nature of the code generation: Write the acceptance test loosely and ambiguously, not pinning down what is wanted definitively, as befits natural language and thinking about requirements, then ask the AI repeatedly to generate prompts and then generate code, and each time test the code with the acceptance test but learn from its errors and variations of output, to educate and evolve our acceptance tests.
It’s the "ATDD epiphany" I had just before Christmas that we talked about...
Iterating the process repeatedly to the point where the acceptance tests/criteria are specified to a sufficient level of granularity for the AI to start writing code (using TDD, natch)... is that “all" we need do? The process “should" be reproducible - even if we have no idea what the eventual actual underlying code looks like… and (if the tests all pass) will we care? 🤔
Could the process be automated (perhaps with swarms of AI agents evaluating and generating new, more detailed acceptance criteria along the way) completely, i.e. from vague wish to production code automatically?
It's what I've been saying for almost 2 years now, basically since they started to push AI heavily for coding. That said, I've kind of changed my mind: we won't reach the performance required for AI to be good enough to do all this, we've hit silicon limits now.
That's actually a more interesting discussion for me, what happens now that performance-per-watt progress will soon halt?
@@defeqel6537 companies have spent billions, they'll probably just keep spending billions
Try this with Cursor ide using claude sonnet 3.5. and use your acceptance test-driven workflow. It works wonders. Like wearing a jetpack.
lol 😆 Scrum made a full circle and we are back to waterfall 👍
As a solo game developer who does both game design and coding, I can't wait for this approach to coding to become reality. This would enable me to try more ideas, and make better games as result.
Thank you!!
So the future will be to move from the stage where the code is the documentation to the stage where the documentation is the code.
This is similar to defining a loss function or constraints in Machine Learning. Interesting idea.
Hi Dave,
I wanted to share an observation:
I was trying to modify a blog template written in Javascript and using the Astro framework. I'm not a JS person so I used the Cursor IDE to do the modifications. Interestingly, because Cursor is agentic, I could see that it would execute the code sometimes to see whether the output in the terminal is correct and how I expect it to be. As if using the terminal, it would do the acceptance testing.
There was another case, however, because the agent was so fixated in modifying the code in one way, it didn't even consider a much easier way I found on the web. Hence, it was stuck in an endless loop of modification and testing.
I think it would be nice to get your thoughts on the agent workflow where the AI not only creates the code and tests, but also it runs the code and verifies the output and makes modifications automatically.
That sounds miserable, if I'm honest. I really hope things don't go that way.
Everyone will have new ideas given the rise of the current AI cycle. Individuals tend to think and extrapolate linearly and miss, but I do think it's a good idea to keep coming up with ideas. If enough people share enough ideas, those linear extrapolates can get us in the right direction.
Bro, no programmer writes unit tests for fun. It's always a process adopted so people can get higher ratings on their performance reviews.
Specifying behavior by example is... Not so easy, and especially with the parasitic nature of LLMs I think it will be hard to produce anything novel with this approach.
From logical apriori view, it is high impossible provide anything like a deterministic guarantee from behaviour examples, as something unrelated to the specific behaviour being could break expectations yet still be within the presumed definition of the test.
I find specific guarantees to be far more reliable way, let's take an rpc for an example, you could guarantee that given input combinations will result in X behaviour and no other, a change of one variable will result in variance Y et cetera.
If the guarantees are solid and they were created to requirements, you can guarantee the work will meet the requirements.
It's a nitpicky distinction I'll admit, but one makes specific testable claims, one takes a guess and leaves the rest in the lords hands 😂
So far LLM's (or I should say chatGPT interface) has been very useful in learning concepts new to myself and helping with making my own thoughts more material and fleshed out so that I can understand and communicate them better. I can't speak for actually using it in engineering, but we'll see!
Yet lots of teams already do a version of this with human programmers being guided by specifications like this - It's called BDD (or sometimes ATDD) and it works fine, particularly for things with a "novel approach" on of my teams built one of the world's highest performance financial exchanges in exactly this way.
Basically the BDD scenarios work as the requirements, defining "What" the system is meant to do in each situation that we can think of. This is not really any different to how any software is built, after all how can you write the code, if you don't know what you want it to do?
@@ContinuousDelivery Sure, but there you have humans talking to each other, developing a common vocabulary, refining examples and following rules. The tests support (or drive, if you prefer) development, but they do not really define the product - they define what is definitely *not* the product. Without a reasonable agent making choices within the remaining space, I have little confidence in the results. It might be better than AI code generated *without* acceptance tests, but that's saying almost nothing.
Aren't NFRs the big issue with this? Like there are possibly hundreds of ways to solve for any given set of requirements, but you can't tell an AI that its algorithm is O(n^2) and it should use another approach that does O(n) instead.
Just when we got an agreed, settled DevOps software lifecycle, it is going to have to change. But will there ever be an agreed, settled AI-integrated lifecycle? Or will AI itself take it out of our hands before it ever gets agreed?