How Gemini Agents Orchestrate Document Actions: From Transformers to Client WASM
By LLM Practitioner
When you tell TellPDF to 'merge files, delete the third page, and add a watermark,' how does a language model translate that sentence into execution coordinates? The core engine relies on the Transformer architecture, introduced by a Google Research team in their seminal 2017 paper 'Attention Is All You Need.' By replacing recurrent neural network loops with self-attention mechanisms, Transformers compute parallel relationships between all words in a prompt. This allows the model to map your intent ('delete third page') directly to tool schemas.
At runtime, your text command is parsed by our Gemini endpoint. The model does not execute code directly; instead, it acts as an intelligent reasoning planner. It maps your instructions to a sequence of JSON-structured schemas defined in our codebase (such as triggering the DELETE_PAGES function with parameter pageIndex: 2). The Next.js API acts as a secure key-isolated proxy, sending the planner's structured output back to the browser.
Once the JSON plan returns to your browser, our client-side orchestration layer takes over. The JavaScript coordinator imports the WASM modules and fires the operations sequentially. This design guarantees data security: the LLM is only utilized to plan the steps, whereas the actual document bytes are read and modified inside your browser memory. In addition to commercial API endpoints, this planner pattern is increasingly deployed using local open-weight models (such as Gemma, Llama, and Mistral) which run directly inside client environments.