Rewriting and filtering history

No, not what you think. This is about git, not politics.
July 1, 2020 by Michael

In my role as a Spring library developer at Neo4j, I spent the last year – together with Gerrit on creating the next version of Spring Data Neo4j. Our name so far has been Spring Data Neo4j⚡️RX but in the end, it will be SDN 6.

Anyway. Part of the module is our Neo4j Cypher-DSL. After working with jOOQ, a fantastic tool for writing SQL in Java, and seeing what our friends at VMWare are doing with an internal SQL DSL for Spring Data JDBC, I never wanted to create Cypher queries via string operations in our mapping code ever again.

So, we gave it a shot and started modeling a Cypher-DSL after openCypher, but with Neo4j extensions supported.

You’ll find the result these days at neo4j-contrib/cypher-dsl.

Wait, what? This repository is nearly ten years old.

Yes, that is correct. My friend Michael started it back in the days. There are only few things were you won’t find him involved in. He even created jequel, a SQL-DSL as well and was an author on this paper: On designing safe and flexible embedded DSLs with Java 5, which in turn had influence on jOOQ.

Therefor, when Michael offered that Gerrit and I could extract our Cypher-DSL from SDN/RX into a new home under the coordinates org.neo4j:neo4j-cypher-dsl, I was more than happy.

Now comes the catch: It would have been easy to just delete the main branch, create a new one, dump our stuff into it and call it a day. But: I actually wanted to honor history. The one of the original project as well as ours. We always tried to have meaningful commits and also took a lot of effort into commit messages and I didn’t want to lose that when things are not working.

Adding content from one repository into an unrelated one is much easier than it sounds:

# Get your self a fresh copy of the target 
git clone git@wherever/whatever.git targetrepo
# Add the source repo as a new origin
git remote add sourceRepo git@wherever/somethingelse.git
# Fetch and merge the branch in question from the sourceRepo as unrelated history into the target
git pull sourceRepo master --allow-unrelated-histories

Done.

But then, one does get everything from the source. Not what I wanted.

The original repository needed some preparation.

git filter-branch to the rescue. filter-branch works with the “snapshot” model of commits in a repository, where each commit is a snapshot of the tree, and rewrites these commits. This is in contrast to git rebase, that actually works with diffs. The command will apply filters to the snapshots and create new commits, creating a new, parallel graph. It won’t care about conflicts.

Manisch has a great post about the whole topic: Understanding Git Filter-branch and the Git Storage Model.

For my use case above, the build in subdirectory-filter was most appropriate. It makes a given subdirectory the new repository root, keeping the history of that subdirectory. Let’s see:

# Clone the source, I don't want to mess with my original copy
git clone sourceRepo git@wherever/somethingelse.git
# Remove the origin, just in case I screw up AND accidentally push things
git remote rm origin
# Execute the subdirectory filter for the openCypher DSL
git filter-branch --subdirectory-filter neo4j-opencypher-dsl -- --all

Turns out, this worked good, despite that warning

WARNING: git-filter-branch has a glut of gotchas generating mangled history
rewrites. Hit Ctrl-C before proceeding to abort, then use an
alternative filtering tool such as ‘git filter-repo’
(https://github.com/newren/git-filter-repo/) instead. See the
filter-branch manual page for more details; to squelch this warning,
set FILTER_BRANCH_SQUELCH_WARNING=1.

I ended up with a rewritten repo, containing only the subdirectory I was interested in as new root. I could have stopped here, but I noticed that some of my history was missing: The filtering only looks at the actual snapshots of the files in question, not at their history you get when using --follow. As we moved around those files around a bit already, I lost all the value information.

Well, let’s read the above warning again and we find filter-repo. filter-repo can be installed on a Mac for example with brew install git-filter-repo and it turns out, it does exactly what I want, given I know vaguely the original places of the stuff I want to have in my new root:

# Use git filter-repo to make some content the new repository root
git filter-repo --force \
    --path neo4j-opencypher-dsl \
    --path spring-data-neo4j-rx/src/main/java/org/springframework/data/neo4j/core/cypher \
    --path spring-data-neo4j-rx/src/main/java/org/neo4j/springframework/data/core/cypher \
    --path spring-data-neo4j-rx/src/test/java/org/springframework/data/neo4j/core/cypher \
    --path spring-data-neo4j-rx/src/test/java/org/neo4j/springframework/data/core/cypher \
    --path-rename neo4j-opencypher-dsl/:

This takes a couple of paths into consideration, tracks the history and renames the one path (the blank after the : makes it the new root). Turns out that git-filter-repo is also way faster than the git-filter-branch.

With the source repository prepared in that way, I cleaned up some meta and build information, added one more commit and incorporated it into the target as described at the first step.

I’m writing this down because I found it highly useful and also because we are gonna decompose the repository of SDN/RX further. Gerrit described our plans in his post Goodbye SDN⚡️RX. We will do something similar with SDN/RX and Spring Data Neo4j. While we have to manually transplant our Spring Boot starter into the Spring Boot project via PRs, we want to keep the history of SDNR/RX for the target repo.

Long story short: While I was skeptical at first ripping the work of a year apart and distributing it on a couple of projects, I’m seeing it now more as a positive decomposing of things (thanks Nigel for that analogy).

Featured image courtesy of Nathan Dumlao on Unsplash.

No comments yet

One Trackback/Pingback
  1. Java Weekly, Issue 341 | Baeldung on December 23, 2020 at 12:07 AM

    […] >> Rewriting and Filtering History [info.michael-simons.eu] […]

Post a Comment

Your email is never published. We need your name and email address only for verifying a legitimate comment. For more information, a copy of your saved data or a request to delete any data under this address, please send a short notice to michael@simons.ac from the address you used to comment on this entry.
By entering and submitting a comment, wether with or without name or email address, you'll agree that all data you have entered including your IP address will be checked and stored for a limited time by Automattic Inc., 60 29th Street #343, San Francisco, CA 94110-4929, USA. only for the purpose of avoiding spam. You can deny further storage of your data by sending an email to support@wordpress.com, with subject “Deletion of Data stored by Akismet”.
Required fields are marked *