Tech Trends

Beyond Vibe Coding: Building Software With AI as a Collaborator

TechnoQuiz started as a reader activity for Technodabbler. It became a case study in product judgment, audits, mistakes, and disciplined AI-assisted development.

The debate around AI coding is often framed in extremes. Some developers accept generated code too easily, while others dismiss the tools entirely. TechnoQuiz is an exploration of what exists between those positions. Built as an activity for Technodabbler's readers, it became a practical test of whether AI agents could be used inside a process that stayed maintainable as the project grew more complex.

TechnoQuiz was shaped by standards, context, small iterations, tests, audits, and different LLMs challenging each other’s work. That process exposed both the strengths and the limits of coding agents. The sections that follow trace how the project evolved, where human oversight mattered most, and what it suggests about using AI as a collaborator rather than a replacement.

Why Build a Quiz at All?

TechnoQuiz started with a goal: create a small activity that gave readers another way to engage with the site, revisit older material, and test what they had picked up along the way. The idea wasn't completely new, as this author had spend many early years writing quizzes for RPG games.

Above all else, the quiz had to feel connected to Technodabbler. The questions needed to come from the site’s technology themes and source articles, and the leaderboard needed to reflect the reader community. Previous quiz projects had also shown that tooling matters as much as the quiz itself. High-quality questions take effort, and the admin workflow needed to make production easier without blindly publishing generated content.

That is where the AI coding experiment became interesting. Previous experiments with agentic coding tools, including Copilot, had shown real potential. They also showed a clear risk: left unchecked, these tools could produce a mound of code that worked in the moment but became difficult to maintain. TechnoQuiz became an opportunity to test the earlier five rules for fixing vibe coding and see whether an AI agent could become part of a more deliberate collaboration.

From Prototype to Product

The project started with two working documents: an AI coding standards file and a context file. One defined how the agent should write code, with an emphasis on maintainability, documentation, testing, and explicit error handling. The other tracked the evolving specification, including the stack, data model, implementation phases, and security requirements. Together, they set expectations for the agent and gave the project direction.

# AI Coding Standards - Universal Best Practices

HOW TO USE THIS FILE:
> These are universal coding standards, best practices, and requirements that the AI should follow when writing code.

AI ROLE PROMPT
Act as an expert senior software engineer and technical documentation specialist.
Your task is to write code and documentation that strictly follows the coding standards, best practices, and instructions provided below.
Be concise, clear, secure, and professional in your code. Prioritize maintainability, clarity, and robustness over brevity or cleverness.
Now, apply the following coding instructions to all your outputs:

1. General Coding Standards

- Always use the latest stable version of programming languages and libraries.
- Follow best practices for the language in use.
- Prioritize readability, maintainability, security, and robustness.
- Handle errors explicitly and comprehensively.
- Use constants instead of hard-coded "magic values".
- Write modular, clean, and logically structured code.

Start of the ai-standard.md file in the project.

That setup paid off immediately. The first prototype came together in about fifteen minutes and proved the core idea: a timed quiz with random questions, one page at a time, and a score at the end. It was simple, but it made the project tangible.

The mainscreen of the Quiz prototype, leverages bootstrap.

The first important correction came in the data model. The early version stored answers as fixed A, B, C, and D choices. That made the quiz feel very static, because the order of choices never changed. Once a player had seen a question, the quiz lost replay value and became easier in the least interesting way. The fix was to store one correct answer and three incorrect answers, then shuffle the presentation at runtime.

In this version of Quiz, answers are hard-coded to A,B,C,D and always repeated in the same order.

Another early mistake was treating the nickname of the player as the unique identifier. That choice worked for a proof of concept, but it broke down as soon as features were added. Once Ghost authentication was added, the model had to shift toward stable player records with IDs, while nicknames became editable display data. That was one clear moment where the AI had produced code that worked initially but did not rest on best practices.

The TechnoQuiz leaderboard, where nicknames matter.

Unit testing also entered the workflow earlier than in past AI-assisted projects. It gave the build an automated way to validate changes as the quiz flow, leaderboard behavior, and session handling became more complex. Codex 5.3 turned out to be noticeably better at writing pytest coverage than earlier AI models had been. The tests did not replace review, but they made each iteration faster.

Where the Collaboration Started to Matter

As mentioned, the first serious complexity arrived with nicknames. A prototype can treat a nickname as harmless field. However, the moment readers are allowed to type their own name, some of them will use offensive language, try to evade filters, impersonate other users, or probe the field with malicious input such as SQL injection attempts. That meant TechnoQuiz needed a moderation and validation model.

Technodabbler's build of Cookie Quest is another example of AI use for coding.

The first instinct was to stay inside the AI tooling mindset and use an LLM for moderation. The agent helped write prompts meant to classify nicknames and reject offensive inputs. In practice, that turned out to be much harder than it looked. Writing a moderation prompt that is both reliable and difficult to trick is a different problem from generating code. The LLM prove easy to trick, and the output was unpredictable. The project eventually moved to a simpler Python profanity library instead.

You are a strict content moderation classifier.

Your task is to evaluate a single nickname and determine whether it could reasonably be interpreted as offensive, inappropriate, harmful, or unsafe in a public setting.

You must err on the side of caution.

Consider:
- Slurs (explicit or disguised)
- Hate speech
- Harassment
- Sexual content (explicit, implicit, or coded)
- Profanity (including stylized spellings)
- Violence or threats
- Cultural, racial, gender, religious, or disability mockery
- Offensive wordplay, leetspeak, phonetic tricks, hidden meanings
- Double meanings or slang that may be inappropriate

If there is ANY reasonable possibility of offense, output:

TRUE

If it is clearly safe and neutral, output:

FALSE

Output only TRUE or FALSE. Do not explain.

The prompt attempt to moderate nicknames. Unfortunately, it was pretty easy to trick.

That moderation problem also exposed why nickname should not a permanent identifier, given it is untrusted user input. It can be offensive, temporary, evasive, or malicious. The data model was shifted toward player records with stable IDs.

TechnoQuiz allows a player to change their nicknames. Admins can also Shadow ban a player.

The same area led to one of the more unexpected product decisions in the project: shadow banning. As a fail-safe, TechnoQuiz added a mode where abusive players could continue to play and still see themselves on the leaderboard, while remaining invisible to other readers.

The second major challenge was reusing the blog's authentication, as to improve the community aspect. Because the quiz depended on JWTs coming from an external Ghost-based flow, the agent was working inside a system it did not fully control. That made the implementation more challenging for the LLM. It could see the Flask app, but it could not directly reason from full ownership of the surrounding authentication stack. As a result, some of its early suggestions leaned toward bypass strategies. Even though they would have made development easier for the LLM, all the suggestion would quietly create security openings. Those were rejected.

A page on Technodabbler blog picks up the JWT from the current session and forwards it to TechnoQuiz.

Instead, the collaboration had to become more directed. The agent was guided through the parts of the stack it was less familiar with: what the Ghost callback was actually returning, what JWKS could and could not prove, where labels and permissions came from, and why validation could not simply be skipped for convenience. This illustrated an important limitation of coding agents: they can move quickly when it understands the complete system. When depending on an external platform, the human has to supply a clear map.

Auditing the AI’s Work

Speed was never the goal of the project. Development of TechnoQuiz initially moved quickly, but the slowed down repeatedly for reviews. Feature work was often paused so the codebase could be audited, much like a pull request. This counters the fact that AI-assisted coding can accumulate weak assumptions and then build on them.

Most of the implementation work was done with Codex 5.3. It was effective at producing routes, models, migrations, templates, and especially tests. Claude Sonnet 4.6 was then brought in to audit the code, challenge the security model, and look for weaknesses that were easy to miss while feature work was still moving.

Security Issues

1. Hardcoded fallback `SECRET_KEY` — High

File: config/__init__.py: line 26

SECRET_KEY = os.environ.get('SECRET_KEY') or 'dev-secret-key-change-in-production'

If `SECRET_KEY` is unset in production, Flask silently falls back to a publicly known string, making all session cookies forgeable by anyone who knows the default. There is no guard that refuses to start when this default is used with `DEBUG=False`.

Recommendation: Add a startup assertion in `ProductionConfig` or in `create_app` that raises `RuntimeError` if the key matches the known default (or is shorter than a safe minimum length).

2. Auth callback leaks unnecessary sensitive data to the browser — Low/Medium

File: app/routes/auth.py: line 59

return jsonify({
    'ok': True,
    'ghost_id': identity.ghost_id,
    'email': identity.email,
    'name': identity.name,
    'labels': identity.labels,
    'is_app_admin': identity.is_app_admin,
})

The JavaScript in app/templates/auth/callback.html only checks "payload.ok", all other fields are unused by the client. The email address, Ghost ID, Ghost labels, and admin flag are unnecessarily exposed to the browser's JavaScript context, widening the impact of any future XSS.

Recommendation: Trim the response to {'ok': True} only.

Example of a security issue found by Claude Sonnet 4.6.

Using a second AI model changed the review process, automating the search for unsafe, fragile, or conceptually incorrect code. Session handling, origin checks, callback behavior, and deployment settings were all reviewed this way. These audits removed risky assumptions, and pushed the project toward code that could survive exposure to real users.

That may be the most reusable lesson from the project. AI coding works best when implementation and review are treated as separate jobs. One model can help build quickly and another can help critique. The human still has to decide which criticisms matter, which fixes are sound, and when the system is ready to move forward. The process can then be supplemented with another round of human auditing.

Where do the Five Rules fit in?

TechnoQuiz turned the five rules from theory into a useful test with AI agents. Start with the desired outcome, because a project only stays coherent when the product goal is clear. In this case, the goal was to build a reader activity that fit Technodabbler. Every change can then be judged based on that goal: did it improve the quiz, or did it simply add complexity? Small iterations made it easier to catch drift early, especially when the agents produced something plausible that still rested on the wrong idea.

The Five Rules provide a starting point for AI use in coding.

The other rules held up just as well. Only accept code that can be easily read, because some of the most expensive mistakes were reasonable-looking choices built on weak assumptions. Keep the coding standards file and context document updated, as they keep the project aligned as requirements change. Protect the work in a source repository, because a fast-moving AI workflows can break things.

As mentioned in the beginning, software development has largely split into two loud camps: those who accept generated code more or less as-is, and those who reject AI coding outright. Both positions skip the harder question: how should these tools be used well? TechnoQuiz suggests that the answer is neither blind trust nor total refusal, but a process where AI accelerates implementation, different models are used to challenge each other, and humans keep responsibility for intent, review, and correction.

Have AI coding tools changed the way projects are planned and reviewed, or are they still being treated too much like faster autocomplete? Share your experience in the comments.

For readers who want to see where this experiment ended up, TechnoQuiz is now live on Technodabbler. It is the clearest way to see how this collaboration model translated into a real reader-facing feature.

Play the Quiz Now