I am working on a declarative CLI for google docs/sheets/slides etc. The general idea is a "pull" command that converts the proprietary document into a local files like tsv or xml. The agent (claude code) then simply edits these files in place and calls "push". The library then figures out the diff and applies only the diff, taking care to preserve formatting and comments.
The hypothesis is that llms are better off getting the "big picture" by reading local files. They can then spend tokens to edit the document as per the business needs rather than spending tokens to figure out how to edit the document.
Another aspect is the security model. Extrasuite assigns a permission-less service account per employee. The agent gets this service account to make API calls. This means the agent only gets access to documents explicitly shared with it, and any changes it makes show up in version history separate from the user's changes.
It provides a git like pull/push workflow to edit sheets/docs/slides. `pull` converts the google file into a local folder with agent friendly files. For example, a google sheet becomes a folder with a .tsv, a formula.json and so on. The agent simply edits these files and `push`es the changes. Similarly, a google doc becomes an XML file that is pure content. The agent edits it and calls push - the tool figures out the right batchUpdate API calls to bring the document in sync.
None of the existing tools allow you to edit documents. Invoking batchUpdate directly is error prone and token inefficient. Extrasuite solves these issues.
In addition, Extrasuite also uses a unique service token that is 1:1 mapped to the user. This means that edits show up as "Alice's agent" in google drive version history. This is secure - agents can only access the specific files or folders you explicitly share with the agent.
This is still very much alpha - but we have been using this internally for our 100 member team. Google sheets, docs, forms and app scripts work great - all using the same pull/push metaphor. Google slides needs some work.
We have been using something similar for editing Confluence pages. Download XML, edit, upload. It is very effective, much better than direct edit commands. It’s a great pattern.
You can use the Copilot CLI with the atlassian mcp to super easily edit/create confluence pages. After having the agent complete a meaningful amount of work, I have it go create a confluence page documenting what has been done. Super useful.
I'm afraid I can't easily share this, as we have embedded a lot of company-specific information in our setup, particularly for cross-linking between confluence/jira/zendesk and other systems. I can try explain it though, and then Claude Code is great at implementing these simple CLI tools and writing the skills.
We wrote CLIs for Confluence, Jira, and Zendesk, with skills to match. We use a simple OAuth flow for users to login (e.g., they would run jira login). Then confluence/jira/zendesk each have REST APIs to query pages/issues/tickets and submit changes, which is what our CLIs would use. Claude Code was exceptional at finding the documentation for these and implementing them. Only took a couple days to set these up and Claude Code is now remarkably good at loading the skills and using the CLIs. We use the skills to embed a lot of domain-specific information about projects, organisation of pages, conventions, standard workflows, etc.
Being able to embed company-specific links between services has been remarkably useful. For example, we look for specific patterns in pages like AIT-553 or zd124132 and then can provide richer cross-links to Jira or Zendesk that help agents navigate between services. This has made agents really efficient at finding information, and it makes them much more likely to actually read from multiple systems. Before we made changes like this, they would often rabbit-hole only looking at confluence pages, or only looking at jira issues, even when there was a lot of very relevant information in other systems.
My favourite is the confluence integration though, as I like to record a lot of worklog-style information in there that I would previously write down as markdown files. It's nicer to have these in Confluence as then they are accessible no matter what repo I am working in, what region I am working in, or what branch or feature I'm working on. I've been meaning to try to set something similar up for my personal projects using the new Obsidian CLI.
We have been doing something similar but it sounds like you have come further along this way of working. We (with help from Claude) have built a similar tool that you describe to interface with our task- and project management system, and use it together with the Gitlab and Github CLI tools to allow agents to read tickets, formulate a plan and create solutions and create MR/PR to the relevant repos. For most of our knowledge base we use Markdown but some of it is tied up in Confluence, that's why I have an interest in that part. And, some is even in workflows are in Google Docs which makes the OP tool interesting as well -- currently our tool output Markdown and we just "paste from markdown" into Gdocs. We might be able to revise and improve that too.
Thank you! Sounds like a fantastic setup. Are the claude code agents acting autonomously from any trigger conditions or is this all manual work with them? And how do you manage write permissions for documents amongst team members/agents, presumably multiple people have access to this system?
(Not OP, but have been looking into setting up a system for a similar use case)
This is all manual, so people ask their agent to load Jira issues, edit Confluence pages, etc. Users sign-in using their own accounts using the CLIs, so the agents inherit their own permissions. Then we have the permissions in Claude Code setup so any write commands are in Ask, so it always prompts the user if it wants to run them.
Excellent project! I see that the agent modifies the google docs using an interesting technique: convert doc to html, AI operates over the HTML and then diff the original html with ai-modified html, send the diff as batchUpdate to gdocs.
IMO, this is a better approach than the one used by Anthropic docx editing skill.
1. Did you compare this one with other document editing agents? Did you have any other ideas on how to make AI see and make edits to documents?
2. What happens if the document is a big book? How do you manage context when loading big documents?
PS:I'm working on an AI agent for Zoho Writer(gdocs alternative) and I've landed on a similar html based approach. The difference is I ask the AI to use my minimal commands (addnode, replacenode, removenode) to operate over the HTML and convert them into ops.
re. comparing with other editing agents - actually, I didn't find any that could work with google docs. Many workflows were basically "replace the whole document" - and that was a non-starter.
re. what happens if its a big book - each "tab" in the google doc is a folder with its own document.xml. A top-level index.xml captures the table of contents across tabs. The agent reads index.xml and then decides what else to read. I am now improving this by giving it xpath expressions so it can directly pick the specific sections of interest.
Philosophically, we wanted "declarative" instead of "imperative". Our key design - the agent needs to "think" in terms of the business, and not worry about how to edit the document. We move all the reconcilliation logic in the library, and free the agent from worrying about the google doc. Same approach in other libraries as well.
Struggling with the same issues with junior developers. I've been asking for an implementation plan and iterating on it. Typical workflow is to commit the implementation plan and review it as part of a pr. It takes 2-3 iterations to get right. Then the developer asks claude code to implement the based on the markdown. I've seen good results with this.
Another thing I do is ask for the claude session log file. The inputs and thought they provided to claude give me a lot more insight than the output of claude. Quite often I am able to correct the thought process when I know how they are thinking. I've found junior developers treat claude like a sms - small ambiguous messages with very little context, hoping it would perform magic. By reviewing the claude session file, I try to fix this superficial prompting behaviour.
And third, I've realized claude works best of the code itself is structured well and has tests, tools to debug and documentation. So I spend more time on tooling so that claude can use these tools to investigate issues, write tests and iterate faster.
Still a far way to go, but this seems promising right now.
Take a look at JinjaSQL (https://github.com/hashedin/jinjasql). Supports conditional where clauses as well as in statements. It uses Jinja templates, so you get a complete template language to create the queries.
I have been working on an idea to leverage SQL + REST APIs instead of GraphQL to solve the similar problems, and I would love to get some feedback on it.
GraphQL is great during development, but in production you often use "persisted queries" to prevent abuse. Persisted queries are also a way to use HTTP GET instead of POST, and thereby leverage HTTP caching. As such, if you swap out graphql and use sql during the development phase, you perhaps can get similar benefits.
My solution (https://github.com/hashedin/squealy) uses SQL queries to generate the API response. Squealy uses a template language for the sql queries, so you can directly embed where clause from the API parameters. The library internally binds the parameters, and is free from SQL injection.
Handling many-to-many relation, or many one-to-many relations is done in-memory after fetching data from the relational database. This provides the same flexibility as GraphQL, but of course requires you to know SQL.
This looks really cool! Particularly for the embedded analytics use case.
Do you have an example of how this would work on the front end with relationships. For example if I want to fetch posts and their comments how would the front code look like?
In short, with 1 GET API call that uses 3 SQL queries, you are getting a list of questions, related answers and related comments.
You should parse that example as follows -
1. The first SQL query fetches all the questions, filtered by API parameters
2. The second SQL query fetches all comments for the questions selected in the first query. See the `WHERE c.postid in {{ questions.id | inclause }}` part - questions.id is coming from the output of the first query
3. The output of the first and second query is merged together on the basis of the questionid that is present in both the query output.
4. Similar approach is used for all the answers
I have been working JinjaSQL[1] for a while to simplify complex SQL Queries.
JinjaSQL is a templating language, so you can use interpolation, conditionals, macros, includes and so on. It automatically identifies bind parameters, so you don't have you to keep track of them.
Re. Guide comments - I usually try to make a smaller function with an appropriate name and then call the function. So long function become a series of invocations to smaller functions. This eliminates most guide comments IMO.
I came to make a similar observation re guide comments, with the difference that guide comments with inline code is better in most cases than a sequence of small function calls.
I've often wondered why my code style is moderate inline blocks with guide comments when I know about the benefits of good naming and self-documenting code. In the end, I find it more effective. By encoding in function names, you've (technically) eliminated comments but now buried what it actually does somewhere else meaning the reader must navigate to verify it does exacty and only what each reader infers from the name. I much prefer having a function of a small-medium logical size that I can hold in my head all at once and see relevant details without navigating away from where I'm reading/editing.
My experience is that less depth and fewer single-use functions aids overall comprehension similarly to how flattening as in list comprehensions or filter/map pipelines buys more headroom. I can imagine building/using DSL where everything is well-named and free of guide comments but I have never encountered such a thing where there are multiple committers using a procedural language.
The hypothesis is that llms are better off getting the "big picture" by reading local files. They can then spend tokens to edit the document as per the business needs rather than spending tokens to figure out how to edit the document.
Another aspect is the security model. Extrasuite assigns a permission-less service account per employee. The agent gets this service account to make API calls. This means the agent only gets access to documents explicitly shared with it, and any changes it makes show up in version history separate from the user's changes.
Early release can be found over here - https://github.com/think41/extrasuite.