I am working since 2018 for Neo4j. At Neo4j I maintain both Spring Data Neo4j and Neo4j-OGM, both object mappers and entity managers for our database product. This is a great job in a great company with awesome colleagues such as my friends Gerrit and Michael.
Some other projects I created on the job are the Cypher-DSL, a builder for our query language Cypher, which is used extensively inside Spring Data Neo4j, in products of our partner Graphaware and by a whole bunch of other customers. Headed for it’s 100th star is Neo4j-Migrations, a database migration toolkit. Last but not least, I did create the original Neo4j Testcontainers module and was recognized as a Testcontainers Champion for that.
However, I did get “socialized” in a previous lifetime in an SQL (actually, an Oracle shop) and while I did swear more than 20 years ago during studies, I will never do anything with SQL, there I was. For whatever reason, my head actually works pretty well with the relational model and the question I can answer with it. I spent about 15 years in the company doing all kinds of things, such as geospatial applications, applications with energy forecasts based on past measurements and stuff like that. What all that had in common was SQL.
Just a couple of months prior to joining Neo4j, I did a talk under the title Live with your SQL-fetish and choose the right tool for the job, in which I presented jOOQ, way back before it became the hot topic. Anyhow, a lot of the dataset from that talk I eventually used on Neo4j specific talks too… Taking my musical habits into a knowledge graph.
What I am trying to say here is: I am deeply involved in the Graph ecosystem, we have a great product and tools, and I’m glad I have a part in those. But I also think that other query languages have their benefits and also that other projects and companies are doing great work, too.
So, DuckDB, what is it? DuckDB is an “DuckDB is an in-process SQL OLAP database management system” and you can go ahead try it out in your browser, because it can be compiled to WASM.
DuckDBs query engine is vector based. Every bit of data that flows through is a collection of vectors on which algorithms often can be applied in parallel; see Execution Format. That’s the groundwork of DuckDBs fast, analytical queries. From Why DuckDB: Online analytical processing (OLAP) workloads are characterized by complex, relatively long-running queries that process significant portions of the stored dataset, for example aggregations over entire tables or joins between several large tables. In the vectorized query execution engine queries are interpreted and processed in large batches of values in one operation. DuckDB can query foreign stores, such as Postgres and SQLite and there are scenarios in which the engine is actually faster while doing this than native Postgres.
When going through their SQL documentation you will be surprised how much you get essentially for free in one small executable. It’s all there: CTEs, Window functions, ASOF joins, Pivot and many things that make SQL friendlier. Back then we had to run big Oracle installation for similar things.
If you follow me on social media you might notice that my focus in private shifted the last years; I have been doing a lot of sport and training, and less doing site projects that involve any big setups. This is where a small tool that runs without a server installation comes in super handy: I was able to define a neat schema for my photovoltaic systems with a bunch of views and now have a good enough dashboard.
My biking page has been remodeled to be a static page these days but uses a DuckDB database beneath, see biking.michael-simons.eu.
At the moment, I neither need or don’t want more. And as sad it might be, I don’t have a graph problem in either of those applications. I want to aggregate measurements, do analytics and that’s it. It’s a lot of time-series that I’m working with and graph doesn’t really help me there.
Those are use cases that are unrelated to work.
But I do have often times use for DuckDB at work, too. Neo4j can natively ingest CSV files. You need to write some Cypher to massage them into the graph you want, but still.That CSV must be properly formatted.
DuckDB on the other hand can read from CSV, Parquet, JSON and other formats as if they are tables. These source can be in files or in URLs. It does not require you to actually create a schema and persist data in its own store, but just can query things. Querying data without persisting the data as a technique might be unusual for a database and seems counter-intuitive at first look, but is useful in the right situations. DuckDB does not need to persist the content, but you are free to create views and over them. You can happily join a CSV file with a JSON file and literally copy the result to a new CSV file that is well suited for Neo4j.
This can all be done either in a CLI or actually as part of a pipeline in a script.
DuckDB replaced a bunch of tools in my daily usage, such as xiv and to some extend, jq. DuckDBs JSON processing capabilities allow you to query and normalize many complex and denormalized JSON documents, but in some cases it’s not enough… It’s just that you can do so much “interesting” things in JSON 😉
Anyhow, a long read, but maybe an explanation why you did see so many toots by me that dealt with DuckDB or even my profile at DuckDB snippets.
I think it’s valuable knowing other technology and also mastering more than one (query) language. I value both my Cypher knowledge and everything I learned about SQL in the past. Hopefully, I can teach some about both, with my current work and any future publication.
And last but not least: Mark Needham, Michael Hunger and myself have worked together to bring out “DuckDB in Action”. The Manning Early Access Program (MEAP) started October 2023 and the book will be released in 2024. We have more than 50% of the content ready and we would love to hear your feedback:
One comment
Hey Michael,
yes, we also use duckdb as a ligghweight engine to parse and load csv and sometimes json. To validate data formats.
Since we use Neo4J as a core component for our Informations System Cartography (see #nodes22 speak : https://dev.to/optnc/our-speech-about-it-holism-at-nodes22-1bpl), we had a to spend to much time in reviewing incoming pull requests datas (we rely on csv)
So we have developed a custom GH Action to delegate review process to duckdb: https://dev.to/optnc/effortless-data-quality-wduckdb-on-github-2mkb
Thanks for the book and the passion you put, .with Mark (he is the one who made me discover duckdb wit his YT Channel Data w/ Mark) and Michael for the Neo4J community.
You have a dream team… and I’m a big fan of the three of you.
Yours friendly.
Adrien aka. @rastadidi
One Trackback/Pingback
[…] >> Why would a Neo4j person be so fond of an embedded, relational database? [info.michael-simons.eu] […]
Post a Comment