I built a sliding puzzle game in 48 hours to stress-test AI as a design partner.
Here's what actually happened.
The goal was never to ship something polished. I wanted a low-risk sandbox, a fast, contained experiment where I could test AI as a creative collaborator without the pressure of a real product on the line. To be clear, my goal wasn't to become a game designer. It was to understand how generative AI could support design judgment, speed up iteration, and shape product thinking. And use that understanding to shape how I'd lead design teams through an AI-enabled process more effectively.
A trivial sliding puzzle felt right. Ten images, ten fun facts, a daily loop, some difficulty progression. Simple enough to build in 48 hours. Complex enough to surface real design decisions. This is exactly the kind of sandbox I'd want teams to use to build AI fluency.
What I found was messier, more interesting, and more honest than I expected.

I framed the brief clearly: one theme, a daily puzzle, replay mechanics. The experience I had in mind was specific. I wanted it to feel nostalgic, like a childhood puzzle pulled out on a rainy afternoon. Quiet satisfaction. Something you'd return to not because it demands your attention, but because it earns it.
What I didn't anticipate was how nonlinear the process would actually be. The chat history I kept with ChatGPT tells a very different story to the clean experiment I'd described to myself afterwards. I had over thirty back-and-forths. The concept shifted from user-uploaded photos to curated image packs. Grid sizes went from 9×9 down to removing them entirely. The theme changed. At one point I just typed "change of plan", and deliberately narrowed everything down to a single theme, ten images, and one daily puzzle. Not because I'd run out of ideas, but because I realised that proving the experience mattered more than building the product. That decision shaped everything that followed.
I leaned on ChatGPT throughout as a co-brainstorming partner; scoping what was realistic, mapping the daily loop, exploring difficulty and reward mechanics. For assets, I generated images and curated them based on clarity, contrast, and how well they'd work when split into tiles. The split felt clean in theory: AI would accelerate the output, my human judgment would decide what lands.

This is where it got interesting. Not because the division was clean, but because of where it broke down.
AI contributed:
• Rotation logic and initial difficulty structure
• Personalisation concepts and puzzle layout variations
• Fast iteration on fun facts and mechanics
• An unprompted accessibility feature (numbered tiles for users with visual impairments that I hadn't asked for, but that made me think harder about inclusive design than I would have otherwise)
I controlled:
• Themes and asset selection
• The core experience intent: nostalgic, calm, discovery-led
• Scope and constraints: what stayed out mattered as much as what went in
• Theme, asset curation, and the cultural references that shaped the aesthetic
• Every decision tied to habit formation, emotional tone, and trust
Where AI genuinely helped
• Brainstorming mechanics and flows quickly
• Rapidly generating variations to pressure-test ideas
• Testing multiple difficulty models in minutes rather than days
Where it was wrong or risky:
Overly complex grids (7×7 and 8×8) that broke the habit loop by being too hard
• A bland, emoji-heavy UI that signalled noise /outdated instead of calm and discovery
• Multi-theme systems that would have diluted the MVP before it proved anything
• Fun facts that needed human checking (AI's confidence is not the same as accuracy)
The clearest lesson: AI isn't inherently trustworthy. Predictability, transparency, and curation still matter. Every hallucination became a design lesson; but only because I was paying attention.
The first versions of Tiletopia were littered with WhatsApp-style emojis. Used in bullet points, in headings, just everywhere! They felt immediately wrong. The game was meant to evoke calmness and discovery; the feeling of completing something beautiful. The emojis said something else entirely: loud, chatty, and frantic. They pulled attention away from the puzzles rather than drawing you in.
I pushed back and asked for something more modern and timeless. What came back felt like a blank notepad clinical, cold, no personality. So I pushed further, this time with something specific. I'd been to Portugal and became aware of Azulejo tiles: geometric, handcrafted, quietly beautiful. For a game called Tiletopia, it felt almost too obvious to use that as inspiration. The aesthetic worked, not because AI discovered it, but because I brought the reference from somewhere real. The AI didn't have that memory. I did.
This came up again when AI recommended Sports as the strongest MVP theme due to clear silhouettes, action shots, and them being easy to source. Objectively sensible, I agree. But having looked up various image collections, I'd already decided on World Landmarks. I'd thought carefully about what discovery feels like as an experience: unlocking somewhere real, learning something true, the satisfaction of a place revealed tile by tile. Sports couldn't carry that. The landmark choice wasn't just aesthetic preference, it shaped the emotional logic of the whole game, including the language. Changing it to "Today's Discovery" instead of "Today's Puzzle." "Discovery Unlocked" instead of "Completed." Small changes i know, but they came from a human decision about what the experience was actually for.
The difficulty curve told a similar story. AI suggested 7×7 and 8×8 grids for harder levels. Logically, it made sense. However in practice, the first few of people gave up when i gave it to them. What was meant to be a light daily challenge became effort in the wrong direction. The kind that stops habits forming. That failure led directly to the hint system idea: not something I'd planned, but a direct response to watching AI logic collide with how real people actually play (and got annoyed from).
Sound was the most persistent struggle! The first audio felt like an alarm. When I asked for relaxation, it went so far the other way it felt like a meditation app for people already asleep. And when I eventually asked AI to choose a royalty-free calm track, it listed sources and told me to pick one myself. That's the honest shape of the boundary: AI could describe what I needed, but it couldn't make the judgment call about what actually felt right. Even now the music isn't quite right. But the failure produced something useful: making audio toggleable, giving users control rather than forcing them into whatever mood the game had decided on.
There was one moment where AI did something I hadn't asked for and it genuinely made me think differently. While building the game, it added an option for numbered tiles unprompted, as an accessibility/colourblind feature for visually impaired users.
I want to be honest about how I frame this. The feature has not been tested against WCAG standards. I don't know if it actually helps those who need it, or whether it does so in the right way. Shipping an untested accessibility feature is not the same as building an accessible product. However what AI did is it prompted me to think more seriously about something I'd deprioritised. Whether the execution holds up is a different and still-open question.
This is the tension I found the hardest to resolve: AI can open up thinking you wouldn't have done yourself, but it can also create a false sense that something important has been handled. Those are very different things, and as a leader and designer, the responsibility for knowing the difference sits with you, not the tool.

There were moments in this project where I genuinely couldn't tell where the AI's thinking ended and mine began. Choosing the music was one. Naming the states for completed and discovered puzzles was another. The first version of the collection screen appeared almost fully formed from Lovable, and because it looked like what I'd seen before, I didn't question it. I just accepted it.
It was only later, when I asked for a horizontal progress bar and blurred locked images, that I felt something shift. Those were specific, deliberate choices overriding defaults, and asserting a point of view. The blurred images in particular felt mine: a way of showing that something is there to be uncovered without spelling it out. It wasn't one big moment of reclaiming ownership. It happened gradually, through the decisions where I pushed back and the output changed because of my judgment, not the model's.
This is something I think every design leader needs to watch for in their teams, as well as designers. When AI generates something that looks right quickly, people stop questioning it. Things look good, the loop feels productive, decisions feel made, but if you're not careful, you are curating rather than creating. The question of what is genuinely yours vs AI becomes harder to answer than it should be. As a leader, it's going to be their responsibility to spot before communicating to the wider teams.
I'll be straight about the numbers. Of 20 users, only 3 came back the next day. Only 1 played consistently over 6 to 8 days. That's a weak retention signal for something built around a daily habit.
Some of this was a distribution problem, several people had saved the game as a WhatsApp link buried in their messages, not as an app on their home screen which may have prompted them to play more. The difference in daily pull is significant! But one user made a more pointed observation: "replaying the same puzzle at a harder difficulty felt like a cheat when the fun fact didn't change or show a deeper fact. I've already read it". The reward felt hollow.
I think both things are true. There's a distribution problem, and the loop itself needs some more work. I'm not going to frame thin retention as a success just because the project goal was process over product. It didn't land the way I'd hoped, and that's actually the more useful finding.
This didn't just test AI as a tool. It reshaped how I think about leading design in an AI-first environment. If I were leading or advising a design team starting out with AI today, here's what I'd do:
• I’d institutionalise AI as a co-pilot / design partner while preserving judgment, checkpoints and review processes needed to cover the whole creative journey. Prompts, assumptions, and decisions made along the way, not just the final screens. Doing this experiement, I had to go back through my own chat history to understand what I'd actually decided versus what had just happened by default. Leaders need to build that kind of visibility into how their teams work with AI.
• Team structure will change. Senior designers become judgment curators. AI handles volume, iteration, speed, and low-risk ideation. I think this leads to smaller design teams with more AI leverage, not larger ones. The value shifts from output to oversight.
• Quality needs guardrails before work starts, not after. Without clear principles and constraints set upfront, AI will fill the space with whatever is statistically likely, not what is right for the product. I experienced this directly with the emoji UI, the over-complex grids, and the generic sound. The guardrails I didn't set out early cost me time later.
• Hiring will shift. I think the designers who will matter the most are those who can challenge AI output from a strong human-centred position, people who can bring real taste, real references, and real knowledge of users and design sensitivities. That's harder to screen for than portfolio craft, but it's what will actually determine quality.
• Trust becoming more of a design responsibility. As AI realism increases through better deepfakes, synthetic content, AI-generated experiences, people are becoming more sceptical about what's real and what's manufactured. Designers will need to actively safeguard credibility and build trustworthy experiences, not just usable ones. That's a leadership concern, not just a craft one.
However for me, one tension stands out above all: AI is very good at remixing what already exists, and its speed can easily be mistaken for originality. But without strong judgment, and a genuine point of view that comes from real experiences designers and teams risk producing work that is fast, familiar, and quietly average. At scale, that's a serious problem, and it's one that design leaders, not AI tools, are responsible for preventing.
• To reduce low-level manual work allowing designers to focus design more on strategy, principles and orchestration.
• Protect scope, simplicity, and user centered decision-making, (especially in MVPs).
• Reinforce across the team that AI amplifies design decisions; it doesn’t replace them or real user insight
• To build AI competency arrange short experiements like this, small, short & constrained where we as humans own quality and direction.
This wasn't about building a sliding puzzle. It was about working out how I'd lead design responsibly in a world where AI is already part of the process, and what changes when it is.
The honest summary: it was messier than I expected, more nonlinear than I'd admit in a presentation, and full of moments where I had to reclaim ground I hadn't realised I'd ceded. The music still isn't quite right. The loop needs some more work. The accessibility feature raises more questions than it answers. And somewhere in the middle of it all I lost track of what was mine, and had to find my way back to it deliberately.
I'm genuinely excited about what AI makes possible in design. But I've moved from theoretical caution to real caution. Caution that comes from having actually done it and seen where it pulls you if you're not paying attention.
AI accelerates ideas and iteration. It also accelerates risk. Judgment is the real differentiator, and this experiment showed me exactly what happens when you let it slip, and what it takes to get it back.
Beta is still ongoing: 5 of 20 users over 7 days. Happy to add more testers.