March 2026: Flutter's 2026 roadmap, GPT-5.4, agentic workflows, AI orchestration, safer coding agents

While collecting links for this edition, I noticed a pattern: the most interesting progress in AI coding is not just about better models, but about the systems around them.

Indeed, so many of this month's resources were really about the same thing: better workflows, better orchestration, better verification, safer permissions, and better ways to improve the tools and skills we give to coding agents.

So the question I care about most right now is not just how good is the model? but how do we make AI coding practical, repeatable, and trustworthy?

But before diving into all that, let's start with Flutter. 👇

Flutter Updates

Flutter-specific news was lighter this month, but two links were worth highlighting.

📝 Flutter & Dart's 2026 roadmap

The official Flutter & Dart's 2026 roadmap is the clearest view we have of where the ecosystem is heading.

Here are the key points:

Complete the Impeller migration on Android, remove legacy Skia on Android 10+, and keep up with new platform releases such as Android 17
Push Flutter web further with WebAssembly (Wasm), while also collaborating with frameworks like Jaspr for more DOM-oriented web apps
Explore more dynamic and expressive UIs with the Flutter GenUI SDK and the A2UI protocol
Expand the full-stack story with Dart Cloud Functions, possible Dart support for the Google Cloud SDK, and collaboration with Genkit
Improve the AI developer experience with better support in Gemini CLI, Antigravity, and MCP (Model Context Protocol) servers for Dart tooling
Continue evolving Dart itself with work on Primary Constructors, Augmentations, better build_runner, improved Dart/Wasm, and faster analyzer performance

Overall, this shows that the Flutter team is trying to push on multiple fronts at once: platform quality, web, AI tooling, and the broader full-stack story.

You can read the full roadmap here:

Flutter 2026 Roadmap

📝 Genkit for Dart

Another notable update is the preview launch of Genkit, an open-source framework for building AI-powered features and apps in Dart.

At the simplest level, it gives you a cleaner way to call models. For example, even the basic model call looks much nicer than hand-rolled REST requests:

import 'package:genkit/genkit.dart';
import 'package:genkit_google_genai/genkit_google_genai.dart';

void main() async {
  final ai = Genkit(plugins: [googleAI()]);

  final response = await ai.generate(
    model: googleAI.gemini('gemini-2.5-flash-image'),
    prompt: 'a banana riding a bicycle',
  );

  if (response.media != null) {
    print('Generated image: ${response.media!.url}');
  }
}

But the real point of Genkit is not just nicer model calls. It's a higher-level framework for building AI features with workflows, tool calling, structured outputs, and better developer tooling.

If Dart is going to play a bigger role in AI app development, I think tools like this are exactly what we need.

If you want the overview, the official docs are a good place to start:

GenKit Overview

AI News

The biggest model announcement in my bookmarks this month was GPT-5.4.

📝 OpenAI - Introducing GPT-5.4

According to the official GPT-5.4 announcement, this is OpenAI's new flagship general-purpose model, with native computer use, a 1M-token context window, and tool search.

I've been testing it myself, and my current view is that I still alternate between GPT-5.4 and Opus 4.6, but GPT-5.4 has impressed me more on genuinely hard tasks.

If you want another perspective, this GPT-5.4 review from Turing College is worth reading too.

Earlier this month, all Claude Code models were also upgraded to a larger, 1M-token context window. That's excellent news for Claude users, but it's still important to keep the context focused and relevant to the task at hand so agents are more likely to stay on track.

Agentic Coding Workflows

After the first wave of AI coding hype, I think the more interesting question is no longer "what can the model do?" but "what workflow actually helps us build better software?"

These two articles stood out to me.

📝 A sufficiently detailed spec is code

In A sufficiently detailed spec is code, Gabriella Gonzalez argues against a common fantasy in agentic coding: that you can hand an agent a spec and somehow skip the real engineering work.

Her point is that if a spec is detailed enough to reliably produce working code, it often starts looking a lot like code already. The work hasn't disappeared; it has just moved elsewhere. The quality of the result still depends on the quality of the thinking behind the spec.

She uses the OpenAI Symphony repo as an example, and I think the critique lands well. Specs are still useful, but they're not magic shortcuts. They're thinking tools.

That said, I think the signal-to-noise ratio changes a lot depending on the task:

Dense or complex business logic is often better expressed as code, because prose becomes verbose and less precise once the logic gets intricate.
System architecture is often better expressed in natural language, because it gives a clearer high-level view of intent, boundaries, and trade-offs.
More generally, code is better for exact mechanics, while prose is better for goals and rationale.

📝 Beyond agentic coding

The companion article, Beyond agentic coding, asks a more constructive question: what would AI-assisted development look like if it were designed to preserve developer flow?

That leads to the idea of calm technology: tools that minimize demands on attention, fade into the background, and help us stay in flow.

That resonated with me because by nature, chat-centric tools are not calm. They demand our attention and focus when input is needed, and leave us waiting while LLMs are thinking. Compare that to auto-complete, which helps us by providing suggestions, without breaking our flow.

AI Orchestration

I think this is one of the most interesting shifts happening right now.

Most developers are not there yet, but for the small group already running multiple agents regularly, orchestration is quickly becoming the next challenge: higher-level patterns that can coordinate planning, execution, review, retries, and long-running work.

You can see that progression in things like the original Ralph loop written in bash, the built-in /loop command in Claude, and more ambitious systems such as beads for persistent coordination and memory.

There's a proliferation of tools in this space, and this is one that popped up in my feeds very recently. 👇

📝 cook

I like it because the primitives are easy to understand: run the same task multiple times, race different approaches, add review loops, and compose those pieces naturally.

Here are a few examples:

# review loop
cook "Implement dark mode" review

# 3 passes
cook "Implement dark mode" x3

# race 3, pick best
cook "Implement dark mode" v3 "least code"

# two approaches, pick one
cook "Auth with JWT" vs "Auth with sessions" pick "best security"

# task list
cook "Work on next task in plan.md" review \
     ralph 5 "DONE if all tasks complete, else NEXT"

# compose freely
cook "Implement dark mode" review v3 "cleanest result"

Check it out here:

cook

How many developers are running multiple agents?

Recently, I ran a newsletter survey asking my readers "What stage are you in your AI-assisted coding journey?".

The results were:

Near-Zero AI - mostly code completion, occasional Chat questions → 19.4%
IDE agent, permissioned - a narrow sidebar agent asks before running tools → 21.3%
IDE agent, more autonomy - trust goes up, and the agent handles more → 19.9%
CLI, one or two agents in terminal with broader tool access (rules or YOLO) → 25.6%
CLI, multi-agent, YOLO - you regularly run 3 to 5 agents in parallel → 8.5%
Frontier workflows - you hand-manage 10+ agents or build your own orchestrator → 5.2%

This tells me that about 87% of developers are still working with at most two agents in parallel - and, for full disclosure, that includes me too.

So if you're not at the orchestration stage yet, I don't think you're behind.

Personally, I'm still more interested in shipping software I actually care about, and making sure it gets proper QA, than in maximizing the number of agents I can run at once.

That said, I still think this is a space worth watching, because the people experimenting here are often surfacing the workflows and abstractions that may become much more common later on.

AI Visual Verification

AI truly shines when it can verify its own output, fix mistakes, and iterate autonomously.

When it comes to processing code and text, that's easy enough. But what about UI verification and E2E testing? Can AI automate that?

Regular web apps can use the Playwright MCP server for this purpose. But what about mobile apps?

📝 flutter-skill

flutter-skill was probably the most relevant tool I found this month for Flutter E2E testing.

The idea is simple: give AI agents eyes and hands inside a running app. In practice, that means they can navigate, tap, type, take screenshots, inspect the accessibility tree, and test UI flows across platforms.

The GitHub repo includes a cool demo showing how this works in practice.

After taking it for a ride, I can confirm that the installation is super easy, but I think there's a problem: speed.

After all, consider what AI agents have to do when interacting with the simulator:

Visually understand the app UI
Figure out which elements can be interacted with
Actually interact with the app (tap a button / scroll down)
Wait for the UI to update

This needs to happen for each single interaction, and requires multiple round trips back to the model, adding extra latency. In practice, this means that even short user journeys can take minutes to complete.

Compare that to unit and widget tests, which can run in milliseconds.

Overall, automated UI verification with AI needs to get a lot faster before it can be practical for everyday use, and this is an area I'm very interested in.

Permissions & Safety

Permissions are still one of the most awkward parts of working with coding agents. Too many prompts make the workflow annoying; too few safeguards can put your files, git history, and secrets at risk.

📝 Claude Code auto mode: a safer way to skip permissions

Anthropic's Claude Code auto mode is one of the most useful AI engineering posts I read this month.

The core observation is that users approve around 93% of permission prompts anyway, so Anthropic built a middle ground between full manual approval and --dangerously-skip-permissions.

What makes the post good is that it takes the failure modes seriously: deleting the wrong thing, exposing credentials, retrying dangerous operations with safety disabled, and prompt injection through external content.

As you'd expect, the developer community is also creating new tools to help with permissions. One interesting example is this. 👇

📝 nah

nah is a context-aware safety guard for Claude Code.

The interesting part is how it works:

Instead of simple allow/deny rules, nah evaluates what a tool call is actually doing. git push may be acceptable while git push --force deserves scrutiny; rm -rf __pycache__ is not the same as rm ~/.bashrc; reading project files is not the same as reading ~/.ssh/id_rsa.

Under the hood, it uses a PreToolUse hook to run a deterministic structural classifier first, and only consult an optional LLM layer if the classifier is unsure.

I also wrote about my current approach to permissions and safety in this guide, if you want a deeper look at the trade-offs here.

Skill Improvement

If you're writing custom skills for agents, the next question is obvious: how do you improve them systematically?

📝 Skill Creator

Skill Creator is Anthropic's official plugin for creating, evaluating, improving, and benchmarking Claude Code skills.

What I like is that it turns skill authoring into a loop: create, evaluate, improve, benchmark. That discipline is often missing when people write custom instructions once and never revisit them.

📝 autoresearch

At the more ambitious end, there's autoresearch by Andrej Karpathy.

It lets an AI agent run iterative overnight research loops on a small LLM training setup, modifying code, testing results, and keeping or discarding changes. The interesting shift is that the human focuses more on improving the instructions and research setup itself.

Most Flutter developers won't use this directly, but as a glimpse into where autonomous workflow improvement might be heading, it's fascinating.

Latest from Code with Andrea

In my 2025 retrospective, I announced that I was building an agentic coding toolkit for Flutter app development. Now it's finally here. 👇

📝 Agentic Coding Toolkit (ACT)

ACT is a spec-driven workflow for Flutter and Dart development with Claude Code and OpenCode.

The idea is closely aligned with the themes in this newsletter. I'm less interested in AI coding that is merely fast, and more interested in AI coding that is structured and easier to trust.

That's why ACT is built around a workflow like this: spec first, refine the spec, turn it into a plan, execute in stages, and capture reusable lessons at the end.

I've also bundled in Flutter-specific knowledge, setup skills, research helpers, and docs around permissions and safety and spec-first development.

If this sounds useful, you can check it out here:

Agentic Coding Toolkit

It's currently available at early-access pricing, with a 30-day money-back guarantee.

Until Next Time

That's a wrap for this month's edition.

In this newsletter, I wanted to focus less on model hype, and more on the workflows, tools, and safety practices that can make AI coding genuinely useful.

My agentic coding toolkit is under very active development, and I have a lot of ideas for making it better.

And now that it's live, I'm also planning to publish new guides and videos about AI-assisted development.

So if you're a Flutter developer who values quality over speed when working with AI, watch this space.

Thanks for reading, and happy coding! 🎉