Building an end-to-end SaaS app for Gistra

Finding the gap in the market

What existing tools got wrong

Most YouTube transcript tools fall into one of three buckets: poor user experience, single-video limitations, or missing AI features. Some require users to manually paste URLs one at a time. Others have great features but are clunky and unpleasant to use. And the ones that do offer bulk processing often lack the ability to actually do anything meaningful with all that data.

Worse still, many tools force users to download every video from a channel or playlist without the ability to selectively choose what they need. That's inefficient and frustrating when you only want specific content.

Gistra was designed to address all of these issues simultaneously. The bulk extraction system is two-step: first, it pulls a list of all available videos, then lets users select exactly what they want in a modal interface. This gives users control while keeping the workflow fast.

Designing the architecture

A security-first, full-stack approach

The system is built on Astro in SSR mode, which handles the frontend and acts as a backend-for-frontend. All authentication happens server-side through Supabase, which also stores transcripts and user data. Dodo Payments handles billing, connected to both Astro and Supabase. A separate microservice stack handles the actual transcript fetching and preparation.

From day one, security was a priority. The client remains "dumb"—it never has access to auth-critical information. All auth events, payment webhooks, and account management happen on protected server actions and endpoints. Rate-limiting is implemented for sensitive actions like login, signup, and password recovery.

Gistra system architecture diagram — We prioritized a practical, simple, but secure architecture

Building a fault-tolerant extraction engine

Fast, fair, and scalable

When a user requests hundreds of transcripts, they need some results immediately, but their request shouldn't overwhelm the system for everyone else. The solution is a priority queue system. The first 5-10 requests from any user get priority 1, meaning they're executed immediately. The rest are assigned lower priority. This gives users a quick win while protecting system performance.

Reliability was equally important. The primary method fetches video metadata from the YouTube Data API (under 200ms) and uses a microservice for transcripts (under 2 seconds for a batch of 5-10). The microservice retries up to three times depending on the error type. If a video has no captions, we surface that to the user and refund the credit immediately.

When everything fails, we call an Apify worker as a final fallback. It's slower but very reliable. Testing across 10,000 transcripts showed this approach achieves over 98% reliability, which is strong for a service depending on proxies and external APIs.

Clear feedback as data loads

Users always know what's happening. Postgres realtime connections monitor the status of transcripts. When viewing a library, unfinished transcripts display a loading skeleton. They pop in smoothly when ready. This provides clear visual understanding of what's available even when downloading hundreds of transcripts at once.

Details of loading skeletons and components — Some examples of UI elements that communicate loading state.

AI-powered insights and interaction

Making transcripts useful

Once transcripts are extracted, the real work begins. A job queue generates embeddings for each video in the background via OpenRouter. This doesn't block users from accessing their transcripts—the embeddings arrive shortly after. The embeddings power semantic search across entire libraries, letting users find relevant content across hundreds of videos.

From there, users can generate bulk AI interactions: chat with transcripts, create flashcards, build quizzes, and develop study guides from multiple videos at once.

Since launching, Gistra has processed over 100,000 transcripts. The time between submitting a request and receiving the first transcript averages less than one second, depending on location and network conditions.

Gistra now has a strong foundation on which to grow: a robust, stable, but easily maintainable architecture.

Turning Youtube videos into searchable knowledge

Gistra

Industry

Technologies

Services

Table of Contents