Stepping out of my comfort zone: No Rest for the Wicked

“Oh please no, Michael, not Sport content here”… Well, sorry, it’s my blog and right now I don’t have a much better place for it. Bear with me, though, there will be some cool database querying later on.

In late 2024 I stumbled upon No Rest for the Wicked 2025 and the idea is simple, even though the page describes it a bit complicated. The challenge is as follows:

Throughout February, do

  • Do 24 runs in total
  • Run once each hour of the day
  • Run once on each weekday
  • Each run (or walk / hike) must be at least 5km and at least 45 minutes
  • A break of at least 90 minutes between each activity

That sounds like “fun”, let’s go:

There’s a race result event that tracks the progress and we come to that later. I was in the top twenty at the time of writing, with one activity to go and finished eleventh.

The challenge was quite an emotional ride for me between “oh my gosh what did I sign up for?”, “I never manager” etc. My favorite hours of the day for running is actually anything between 6 and 10am. Afternoon is ok, evenings I really dislike, especially after long working hours sitting. I have a really odd way of putting my feet down while sitting, so in the evening my ankles are always kinda busted. Night running? I never tried. As a matter of fact, I try to avoid activities with a high heart rate in the evenings.

So there was a lot going on outside of my comfort zone and I am very content that I did this:

  • I felt good enough to go out in the middle of the freaking nights several times for running
  • Running in the night was a lot more satisfying than walking, the latter just to boring alone and I didn’t want to put on headphones
  • The silence was so nice, and so was the starry nights, quite unique, deeply satisfying experience
  • Prior to getting up and out it’s tough, doing it is uplifting then
  • One is capable of many things you never thought you are
  • Happy that I live in a place that I can go out every time without having to be afraid of things

Honestly, at the end of 2 weeks I felt less depressed and down than usual in February. The last 3 years I travelled for a week to the Canaries to get some sun and cycling in, which I couldn’t do this year for personal reasons. This challenge was a great antidote.


The only picture of myself I know of between 3 and 4am

So what does running here have to do with running database queries? Remember that Race result event above? I wanted to have an proper overview myself, without manual work and I created this:


My report of No Rest for the Wicked 2025

Looking at those number, would I do things differently? Yeah, I would probably not pushback the wee small hours between midnight and 3am to the end, but do them in the middle in two nights, in contrast spreading them over several days. I wasn’t sure if I would push through the challenge and my reasoning was that I didn’t want to fell for sunken costs. Anyhow, how did I create the above report? I went to connect.garmin.com, pulled down my activities as CSV for the past weeks, fired up DuckDB and ran that query:

WITH
  src AS (
    SELECT DISTINCT ON (Datum, Titel)
           *, EXTRACT('hour' FROM Zeit)*60*60 + EXTRACT('minute' FROM Zeit)*60 + EXTRACT('second' FROM Zeit) AS Duration
    FROM read_csv('*.csv', union_by_name=TRUE, filename=FALSE)
    WHERE Titel LIKE 'NRFTW%'
  ),
  activities AS (
    SELECT "Day #"      : dense_rank() OVER (ORDER BY Datum::DATE),
           "Day"        : Datum::DATE,
           "Run #"      : ROW_NUMBER() OVER (),
           "Break"      : age(Datum, COALESCE(date_add(lag(Datum) OVER starts, INTERVAL (lag(Duration) OVER starts) SECOND), Datum))::VARCHAR,
           "Hour of day": HOUR(Datum),
           "Weekday"    : dayname(Datum),
           "Sport"      : CASE Aktivitätstyp
                            WHEN 'Gehen'  THEN 'Walking'
                            WHEN 'Laufen' THEN 'Running'
                            ELSE Aktivitätstyp
                          END,
           "Distance"   : REPLACE(Distanz, ',', '.')::NUMERIC,
           "Duration"   : EXTRACT('hour' FROM Zeit)*60*60 + EXTRACT('minute' FROM Zeit)*60 + EXTRACT('second' FROM Zeit),
           "Progress"   : 100/24,
    FROM src
    WINDOW starts AS (ORDER BY Datum)
  ),
  sums AS (
    SELECT * REPLACE(
               round(SUM(Distance), 2)   AS Distance,
               to_seconds(SUM(Duration)) AS Duration,
               SUM(Progress) OVER dates  AS Progress
           ),
           "p_weekdays" : COUNT(DISTINCT Weekday) OVER (ORDER BY DAY GROUPS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
           "Weekdays"   : COUNT(DISTINCT Weekday) OVER dates
    FROM activities
    GROUP BY GROUPING SETS (("Day #", "Run #", DAY, "Hour of day", Progress, Weekday, Break, Sport), ("Day #", DAY), ())
    WINDOW dates AS (ORDER BY DAY)
  )
SELECT * EXCLUDE(p_weekdays) REPLACE (
           Break::INTERVAL AS Break,
           CASE WHEN "Hour of day" IS NULL AND p_weekdays <> 7 AND Weekdays = 7 THEN '✅' END AS Weekdays,
           CASE
             WHEN DAY           IS NULL THEN lpad(printf('%.2f%%', Progress), 7, ' ')
             WHEN "Hour of day" IS NULL THEN lpad(printf('%.2f%%', Progress), 7, ' ') || ' ' || bar(Progress, 0, 100, 20) END AS Progress
         )
FROM sums
ORDER BY DAY, "Hour of day" NULLS LAST;

It’s so funny how much value one can get out of basically three columns alone. I am basically working only on the start date, the duration and the distance. Everything else are derived values. Things to note: The progress per day and not per run (the latter would always be 1/24 of course), the checkbox when I hit the weekdays and the summary of distance and duration per day and overall. The nice formatting of the break time is a sweet bonus (using the age function). The DuckDB team just released 1.2.0 and put out a couple of blog posts that I really dig and I was able to utilize in my script:

  • Vertical Stacking as the Relational Model Intended: UNION ALL BY NAME Great blog, great feature. Used it to union all the CSV files (they have had different columns, because some of them included cycling watts, some didn’t. See FROM read_csv('*.csv', union_by_name=True, filename=False) in the query)
  • Catching up with Windowing Again, great content and I am a bit jealous because I think it’s more complete than the explanation in our book. I used the GROUPS framing to figure out when I hit all the weekdays
  • Announcing DuckDB 1.2.0 New features in 1.2.0. I am mostly using the alternative projection syntax “X: U” compared to “U AS X” in the inner projection. It just reads better with long expressions.

Apart from that the query demonstrates again the power of window functions computing for example the break between activities as well as the day and activity numbers. It also makes extensive use of the addition to asterisk expression EXCLUDE and REPLACE avoiding juggling all the columns all over again. Also, take note of the GROUP BY with GROUPING SETS, which is the longer form of a GROUP BY ROLLUP that gives you control over the sets. Last bit is the sweet DISTINCT ON. I could just add more csv files, and this deduplicates not on all columns, but only on date and title.

This is what I like running: Challenges and algorithms locally, on my personal data. Those small and “crazy” challenges are so much nicer than the super commercialised side of running you find with the 6 star marathons or super halves. And the same thing applies to reports as the one I created here for myself–and through FLOSS and actually sharing how things are done–for others, too. For me being able to do this locally without any big vendor analytics is empowering and brings back a lot of joy that is missing today in many settings.

Challenge accepted and put to bed with walking and running roughly 175km in 24 hours total.

| Comments (0) »

14-Feb-25


Let’s deadlock all the things.

Hej there, yes it’s me, I’m still alive and doing things, such as neo4j-jdbc, with automatic SQL to Cypher translations or building things like this for my own enjoyment. However, with the tons of meaningless posts and stuff around “Hey ma, look how I do AI with $tool”, I felt a bit out of time these days, just enjoying artisanal coding, requirements analysis and architectural work and didn’t have much incentive to write more prose. And giving the rest out for AI to swallow and regurgitate it, well… 🤷

In the last days however a rookie mistake I made gave me some food for thoughts and I write this post for anyone new in the Java ecosystem or new to programming in general: Regardless how much years you have acquired or how good you think you are, you always will make some stupid mistakes such as the following. Hopefully, you’re lucky as me and find yourself surrounded by colleagues who help you rolling a fix out. And yes, the blast radius of something that runs in $cloud is bigger than a bug in a client-side library, especially with the slow update cycles in many industries.

Enough talking, let’s go. I was refactoring some code to not use Java’s HttpURLConnection of old anymore. Here’s basically how it looked:

import java.net.HttpURLConnection;
import java.net.URI;
import java.nio.charset.StandardCharsets;
 
class S1 {
  public static void main(String... a) throws Exception {
    var uri = URI.create("http://randomapi.io");
    var body = "Hello";
 
    // Nice I get an output stream I just can write too, yeah…
    var connection = (HttpURLConnection) uri.toURL().openConnection();
    connection.setDoOutput(true);
    connection.setRequestMethod("POST");
    try (var out = connection.getOutputStream()) {
      // Reality: objectMapper.writeValue(out, Map.of("some", "content"));
      out.write(body.getBytes(StandardCharsets.UTF_8));
    }
    connection.connect();
  }
}

This works, but the URL connection is not a great API to work with. Also, it does not support HTTP/2 out of the box (or even not at all). The recommended replacement is the Java HttpClient. The same behaviour like above but with the HttpClient looks like this, defaulting to HTTP/2 by default:

import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpRequest.BodyPublishers;
import java.net.http.HttpResponse.BodyHandlers;
import java.util.concurrent.Executors;
 
class S2 {
  public static void main(String... a) throws Exception {
    var uri = URI.create("http://randomapi.io");
    var body = "Hello";
 
    try (var client = HttpClient.newBuilder().executor(Executors.newVirtualThreadPerTaskExecutor()).build()) {
      var response = client
        .send(
          HttpRequest.newBuilder(uri).POST(BodyPublishers.ofString(body)).build(),
          BodyHandlers.ofString())
        .body();
    } catch (InterruptedException e) {
      Thread.currentThread().interrupt();
    }
  }
}

The request body is prepared by a BodyPublishers.ofString(body). There are quite a few others, such as from a byte array, files, reactive streams and more. What does not exist is a way to just get an output stream from the HTTP client one can write directly to, like the above example using URL connection. So assuming I am a Jackson user and I want to write some rather large JSON as request body, I would need to materialise this as a byte array in memory or deal with the shenanigans of ByteBuffer, which I really didn’t feel like. Instead, I thought, hey, there’s the pair of PipedInputStream and PipedOutputStream, let’s use a body publisher from an input stream supplier like the following:

var publisher = HttpRequest.BodyPublishers.ofInputStream(() -> {
  var in = new PipedInputStream();
  try (var out = new PipedOutputStream(in)) {
     outputStreamConsumer.accept(out);
   } catch (IOException e) {
     throw new UncheckedIOException(e);
   }
   return in;
});

Easy, right? While the pair is made exactly for the above purpose, the usage is 100% wrong and the JavaDoc is quite explicit with this:

Attempting to use both objects from a single thread is not recommended, as it may deadlock the thread. The piped input stream contains a buffer, decoupling read operations from write operations, within limits.

I did actually read this, set this up like above nevertheless, tested it and it “worked”. I thought, yeah, the HttpClient is asynchronous under the hood anyway, let’s go. Stupid me, as I did use its blocking API and ofc, reading and writing did occur on the same thread of course. Why did it not show up in tests? Rookie mistake number two, not testing proper extreme, like 0, -1, Integer.MAX_VALUE, a lizard and null. Within limits means: There’s a default buffer of 1024 bytes, any payload bigger than coming from anywhere into the request body would blow up.

Do this often enough, and you have a procedure that happily drains all the available threads from a system by deadlocking them, congratulations.

What would have prevented that? Taking a step down from mount stupid to begin with. A second pair of eyes. Better testing. Would AI have helped me here? I am unsure. Maybe in generating test input. On API usage: Most likely not. It’s all correct, and Java has no way in the language to make me aware. Maybe an IDE with a proper base model for semantic analysis can spot it, too. IntelliJ does not see it.

How did I solve it in the end? I was not happy using a byte[] array, so I really wanted to have that pipe and a working solution looks like this:

import java.io.IOException;
import java.io.PipedInputStream;
import java.io.PipedOutputStream;
import java.io.UncheckedIOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpRequest.BodyPublishers;
import java.net.http.HttpResponse.BodyHandlers;
import java.nio.charset.StandardCharsets;
 
class S3 {
  public static void main(String... a) throws Exception {
    var uri = URI.create("http://randomapi.io");
    var body = "Hello";
 
    try (var client = HttpClient.newBuilder().build()) {
      var bodyPublisher = BodyPublishers.ofInputStream(() -> {
        var in = new PipedInputStream();
        var out = new PipedOutputStream();
        try {
          out.connect(in);
        } catch (IOException e) {
          throw new UncheckedIOException(e);
        }
        Thread.ofVirtual().start(() -> {
          try (out) {
            // Here the stream can be passed to Jackson or whatever you have 
            // that's generating output
            out.write(body.getBytes(StandardCharsets.UTF_8));
          } catch (IOException e) {
            throw new UncheckedIOException(e);
          }
        });
        return in;
      });
      var response = client
        .send(
          HttpRequest.newBuilder(uri).POST(bodyPublisher).build(),
          BodyHandlers.ofString()
        ).body();
    } catch (InterruptedException e) {
      Thread.currentThread().interrupt();
    }
  }
}

The important pieces here are:

  • Create the input and output streams independently
  • Connect them outside both the reading and writing thread
  • Kick of the writing thread, and since we are on modern Java, just use a virtual thread for it. In the end, they are exactly for that purpose: To be used when there’s some potentially block things happening
  • Just return the input stream to the HTTP Client, it will take care of using and later closing it.
  • Close the output stream inside the writing thread (note the try-with-resources statement that “imports” the stream)

A runnable solution is below. I am using an explicitly declared main class here, so that I can have a proper Java script without the boiler plate of old. It brings up an HTTP server for testing as well, that just mirrors its input. It than uses both methods described above to POST to that server, get the response and asserts it. Use Java 24 to run it with java -ea --enable-preview Complete.java.

Fun was had the last days.

import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.PipedInputStream;
import java.io.PipedOutputStream;
import java.io.UncheckedIOException;
import java.net.HttpURLConnection;
import java.net.InetSocketAddress;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpRequest.BodyPublishers;
import java.net.http.HttpResponse.BodyHandlers;
import java.nio.charset.StandardCharsets;
import java.security.SecureRandom;
import java.util.Base64;
import java.util.concurrent.Executors;
 
import com.sun.net.httpserver.HttpServer;
 
void main() throws IOException {
  var port = 4711;
  var server = HttpServer.create(new InetSocketAddress(port), 0);
  try {
    server.createContext("/mirror", exchange -> {
      byte[] response;
      try (var in = new BufferedInputStream(exchange.getRequestBody())) {
        response = in.readAllBytes();
      }
 
      exchange.sendResponseHeaders(HttpURLConnection.HTTP_OK, response.length);
      var outputStream = exchange.getResponseBody();
      outputStream.write(response);
      outputStream.flush();
      outputStream.close();
    });
    server.setExecutor(null);
    server.start();
 
    var secureRandom = new SecureRandom();
    var buffer = new byte[(int) Math.pow(2, 16)];
    secureRandom.nextBytes(buffer);
    var requestBody = Base64.getUrlEncoder().withoutPadding().encodeToString(buffer);
 
    var uri = URI.create("http://localhost:%d/mirror".formatted(port));
 
    var responseBody = useUrlConnection(uri, requestBody);
    assert responseBody.equals(requestBody);
    responseBody = useHttpClient(uri, requestBody);
    assert responseBody.equals(requestBody);
  } finally {
    server.stop(0);
  }
}
 
private static String useUrlConnection(URI uri, String body) throws IOException {
  var urlConnection = (HttpURLConnection) uri.toURL().openConnection();
  urlConnection.setDoOutput(true);
  urlConnection.setRequestMethod("POST");
  try (var out = urlConnection.getOutputStream()) {
    out.write(body.getBytes(StandardCharsets.UTF_8));
  }
  urlConnection.connect();
  try (var in = new BufferedInputStream(urlConnection.getInputStream())) {
    return new String(in.readAllBytes(), StandardCharsets.UTF_8);
  }
}
 
private static String useHttpClient(URI uri, String body) throws IOException {
  try (var client = HttpClient.newBuilder().executor(Executors.newVirtualThreadPerTaskExecutor()).build()) {
    return client.send(HttpRequest.newBuilder(uri).POST(
      BodyPublishers.ofInputStream(() -> {
        var in = new PipedInputStream();
        //noinspection resource
        var out = new PipedOutputStream();
        try {
          out.connect(in);
        } catch (IOException e) {
          throw new UncheckedIOException(e);
        }
        Thread.ofVirtual().start(() -> {
          try (out) {
            out.write(body.getBytes(StandardCharsets.UTF_8));
          } catch (IOException e) {
            throw new UncheckedIOException(e);
          }
        });
        return in;
      })).build(), BodyHandlers.ofString()).body();
  } catch (InterruptedException e) {
    Thread.currentThread().interrupt();
    throw new IOException(e.getMessage());
  }
}

| Comments (1) »

05-Feb-25


Some thoughts about user defined functions (UDF) in databases

Last week I came across a Twitter thread about using stored procedures or in other words, user defined functions in databases.

Every myth about them has already been debunked like 13 years ago in this excellent post: Mythbusters: Stored Procedures Edition. I can completely get behind it, but also like some personal thoughts and share some memories:

In the past I worked for years, in case over a decade, on big systems in the utilities sector. IHL evolved from an Oracle Forms 6i application into a Java desktop application and from what I hear, into a web application in the last 24 years. While the clients changed, the database model was pretty consistent through the years (A relational model, actually, and maybe you understand why this paper under the title What comes around by Andrew Pavlo and Michael Stonebreaker resonates with me, even though I work at Neo4j these days).
Even in first decade of the 2000s we followed the Pink database paradigm, even though we didn’t call it that way. A lot of our API schema was enriched or even defined by various stored procedures, written in PL/SQL and *gosh* Java (yes, Oracle database could do this even back in 2010, these days powered by GraalVM, you can even run JavaScript proper in the thing). Why was this useful? We could have a lot of complicated computations, such as “if that power rod drops, in what radius will it fall, where will the lines end, does it kill a cow or hit a Kindergarten?”, and use the result straight in a view without having first grab all the necessary data, compute, write back… Furthermore it allowed us to move quite quickly all the time: Instead of rolling out new client versions all the time, we could just role out improved PL/SQL code, without any downtime.

The thing I find most funny is that one of the loudest argument against stored procedures is “but you cannot test them…”: Of course you can. But, my main point: I would bet money on it, that the people who bring this argument are by some share the same who don’t write tests, because $reasons, such as “my manager does not allow”, “no time”, etc…

Of course, when you fell for the “we need to rewrite our app every year, replace all the frameworks each quarter”, than none of my ranting, the myth buster above or the paper linked will convince you otherwise, that are might be a different, more stable solution and I can keep on using this slide more often in the future.

Anyway, in this week I realized that the 1.x release of DuckDB also supports UDFs, or Macros as they call them and I was like sh*t, we don’t have it in the book… Maybe a good opportunity to set a reminder for a 2nd edition, if we ever sell that many… Nevertheless, I needed to try this… And what is a better goal than my sports page, that received its 3rd major overhaul since 2014: biking.michael-simons.eu/. I wanted to add my age group to my favorite achievements. In Germany, there are some halfway complicated rules to compute it, and I did so in a stored procedure. That diff creates the function, calls it in a view (which defines the API, and the only thing I have to touch in the app is the actual view layer, as the actual client code is just one statement that couldn’t be simpler (con.execute('FROM v_reoccurring_events').fetchall()) and I can now flex a bit like this:



What about Neo4j? I am so glad you asked: Of course we offer a framework to enhance our query language Cypher, with User-defined procedures and I happen to own the example repository: neo4j-procedure-template, which includes a proper test-setup (btw, one of the *rare* cases, I would actually *not* use Testcontainers, but our dedicated testing framework, because the overall turnaround will be faster (you skip the repackaging)).

For Neo4j enhance ability is key: We do evolve Cypher (and GQL) with great care, but often not as fast as a customer, the necessities of a deal or a support case might require. Neo4j APOC started as a set of curated functions and procedures and we promoted it ever since. From Neo4j 5 onwards, it even became in parts as much as a standard library, delivered with the project.

Is there something I’d personally would add? Actually, yes: We only have compiled procedures at the moment, written in any JVM language. This is ok to some extend, but I’d love to see an enhancement here that I could *just* do a create or replace function in our database as well. If it would be any of the seasonal holidays that bring gifts, maybe even in something else than Java. A while back I started an experiment using GraalVM for that as Neo4j polyglot stored procedures and I would love to revive that at some point.

My little rant does not even come close to these two I read last month (1, 2), but I hope you enjoyed it nevertheless. If you take one thing away: There’s always more than one solution to a problem. Make better use of a database product, you most likely pay for, is a proper good idea. Being it either optimizing your model (there are probably a ton of books on SQL, less so for Neo4j, hence I really can recommend The definitive Guide to Neo4j, by my friends Christophe and Luanne, it has excellent content on modeling), understanding its query language and capabilities better or enhancing it to your needs, will be beneficial in the long run.

To quote one of the rants above: Happy Apathetic coding and don’t forget to step outside for a proper run sometime.

Title picture by Jan Antonin Kolar on Unsplash.

| Comments (1) »

06-Jul-24


Run your integration tests against Testcontainers with GraalVM native image

I have started working with a small team on an exciting project at Neo4j. The project is about database connectivity (what else) and we use Testcontainers in our integration tests, asserting the actual network connectivity and eventually the API.

The thing we are creating should of course also work when being compiled as part of an application into a native executable by GraalVM. For a bunch of older projects with a somewhat convoluted test-setup I used to create dedicated, small applications that produce some output. I compiled these apps into a native binary and used a scripted run to eventually assert the output.

For this new project I wanted to take the opportunity to use the Maven plugin for GraalVM Native Image building and it’s test capabilities directly.
The plugin works great and the maintainers are—like the whole GraalVM team—quite quick fixing any issue. We use it already at several occasions to produce native binaries as part of a distribution, but so far not for testing.

I personally find the documentation above linked not sufficient to create a proper setup for tests. Especially the section “Testing support” does not work for me: Neither the latest surefire nor failsafe plugins bring the required dependency org.junit.platform:junit-platform-launcher. This extension contains a test execution listener org.junit.platform.launcher.listeners.UniqueIdTrackingListener that tracks each executed tests and stores it in a file with unique ids. The GraalVM plugin will use that file to discover the tests it needs to run in native mode. If the file is not generated, it will yell at you with

[ERROR] Test configuration file wasn’t found. Make sure that test execution wasn’t skipped.

I can’t share the thing I am actually testing right now, so here’s something similar. The code below uses the Neo4j Java Driver and creates a data access object interacting with the Neo4j database:

package demo;
 
import java.util.Map;
import java.util.stream.Collectors;
 
import org.neo4j.driver.Driver;
 
public record Movie(String id, String title) {
 
	public static final class Repository {
 
		private final Driver driver;
 
		public Repository(Driver driver) {
			this.driver = driver;
		}
 
		public Movie createOrUpdate(String title) {
			return this.driver.executableQuery("MERGE (n:Movie {title: $title}) RETURN n")
				.withParameters(Map.of("title", title))
				.execute(Collectors.mapping(r -> {
					var node = r.get("n").asNode();
					return new Movie(node.elementId(), node.get("title").asString());
				}, Collectors.toList()))
				.stream()
				.findFirst()
				.orElseThrow();
		}
 
	}
}

and the integration tests looks like this. There are no surprises in it. It is disable without Docker support, uses a life cycle per class so that I can keep a reusable test container around for all tests and a sample test:

package demo;
 
import static org.junit.jupiter.api.Assertions.assertNotNull;
 
import org.junit.jupiter.api.AfterAll;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.TestInstance;
import org.neo4j.driver.AuthTokens;
import org.neo4j.driver.Driver;
import org.neo4j.driver.GraphDatabase;
import org.testcontainers.containers.Neo4jContainer;
import org.testcontainers.junit.jupiter.Testcontainers;
 
@Testcontainers(disabledWithoutDocker = true)
@TestInstance(TestInstance.Lifecycle.PER_CLASS)
class RepositoryIT {
 
	protected final Neo4jContainer<?> neo4j = new Neo4jContainer<>("neo4j:5.13.0")
		.waitingFor(Neo4jContainer.WAIT_FOR_BOLT)
		.withReuse(true);
 
	protected Driver driver;
 
	@BeforeAll
	void startNeo4j() {
		this.neo4j.start();
		this.driver = GraphDatabase.driver(this.neo4j.getBoltUrl(),
				AuthTokens.basic("neo4j", this.neo4j.getAdminPassword()));
	}
 
	@AfterAll
	void closeDriver() {
		if (this.driver == null) {
			return;
		}
		this.driver.close();
	}
 
	@Test
	void repositoryShouldWork() {
 
		var repository = new Movie.Repository(driver);
		var newMovie = repository.createOrUpdate("Event Horizon");
		assertNotNull(newMovie.id());
	}
 
}

Now, let’s walk through the pom.xml. For my dependencies, I usually look if they have a BOM file and import those into dependency management, so that I can rely on other projects organising their releases and transitive dependencies proper. Here I have the latest JUnit and Testcontainers plus Slf4j. From the latter I’ll use only the simple logger later, so that I can see Testcontainers logging:

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.junit</groupId>
            <artifactId>junit-bom</artifactId>
            <version>5.10.0</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-bom</artifactId>
            <version>2.0.9</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
        <dependency>
            <groupId>org.testcontainers</groupId>
            <artifactId>testcontainers-bom</artifactId>
            <version>1.19.1</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

The relevant test dependencies then look like this. I kinda grouped them together, but please note how I include the JUnit launcher mentioned above explicitly. It is not part of the JUnit core dependencies and at least, was not brought in transitively by either Failsafe or Surefire for me:

<dependencies>
    <dependency>
        <groupId>org.junit.jupiter</groupId>
        <artifactId>junit-jupiter</artifactId>
        <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.junit.platform</groupId>
        <artifactId>junit-platform-launcher</artifactId>
        <scope>test</scope>
    </dependency>
 
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-simple</artifactId>
        <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.testcontainers</groupId>
        <artifactId>junit-jupiter</artifactId>
        <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.testcontainers</groupId>
        <artifactId>neo4j</artifactId>
        <scope>test</scope>
    </dependency>
</dependencies>

“But Michael, didn’t you think about the fact that you won’t usually get the latest Failsafe and Surefire plugins with default Maven?”—Of course I did. This is how I configure Failsafe. Take note here how I set a (Java) system property for the integration tests. While the UniqueIdTrackingListener is on the class path, it is disabled by default and must be enabled with the property below (yes, I did read sources for that. The rest is just the usual dance for setting up integration tests:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-failsafe-plugin</artifactId>
    <version>3.2.1</version>
    <configuration>
        <systemPropertyVariables>
            <junit.platform.listeners.uid.tracking.enabled>true</junit.platform.listeners.uid.tracking.enabled>
        </systemPropertyVariables>
    </configuration>
    <executions>
        <execution>
            <goals>
                <goal>integration-test</goal>
                <goal>verify</goal>
            </goals>
        </execution>
    </executions>
</plugin>

Now onto the GraalVM native-maven-plugin. I usually wrap this into a dedicated profile to be activated with a system property like in the below listing. The documentation says that one must use <extensions>true</extensions> as part of the configuration in order to use the recommended JUnit Platform test listener mode, but that didn’t work for me. I guess it should in theory avoid having to set the above system property.

The next important part in the listing—at least when you want to use Testcontainers—is enabling the GraalVM Reachability Metadata Repository. This repository contains the required configuration shims for quite the number of libraries, including Testcontainers. If you don’t enable it, Testcontainers won’t work in native mode:

<profile>
    <id>native-image</id>
    <activation>
        <property>
            <name>native</name>
        </property>
    </activation>
    <build>
        <plugins>
            <plugin>
                <groupId>org.graalvm.buildtools</groupId>
                <artifactId>native-maven-plugin</artifactId>
                <version>0.9.28</version>
                <extensions>true</extensions>
                <configuration>
                    <metadataRepository>
                        <enabled>true</enabled>
                    </metadataRepository>
                </configuration>
                <executions>
                    <execution>
                        <id>test-native</id>
                        <goals>
                            <goal>test</goal>
                        </goals>
                        <phase>verify</phase>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</profile>

One cannot praise the people from VMWares Spring and Spring Boot teams enough for bringing their knowledge of so many libraries into that repository.

With that in a place, you can run your integration tests on the JVM and as a native image like this:

mvn -Dnative verify

You first see the usual dance of integration tests, then the GraalVM compilation

[1/8] Initializing...                                                                                    (5,9s @ 0,25GB)
 Java version: 17.0.9+11-LTS, vendor version: Oracle GraalVM 17.0.9+11.1
 Graal compiler: optimization level: 2, target machine: armv8-a, PGO: off
 C compiler: cc (apple, arm64, 15.0.0)
 Garbage collector: Serial GC (max heap size: 80% of RAM)
 1 user-specific feature(s)
 - org.graalvm.junit.platform.JUnitPlatformFeature
[junit-platform-native] Running in 'test discovery' mode. Note that this is a fallback mode.
[2/8] Performing analysis...  [*****]                                                                   (21,0s @ 1,48GB)
  13.384 (87,45%) of 15.304 types reachable
  22.808 (62,52%) of 36.483 fields reachable
  73.115 (60,68%) of 120.500 methods reachable
   4.250 types, 1.370 fields, and 3.463 methods registered for reflection
      99 types,   102 fields, and   102 methods registered for JNI access
       5 native libraries: -framework CoreServices, -framework Foundation, dl, pthread, z
[3/8] Building universe...                                                                               (2,6s @ 1,40GB)
[4/8] Parsing methods...      [**]                                                                       (2,6s @ 1,62GB)
[5/8] Inlining methods...     [***]                                                                      (1,4s @ 1,49GB)
[6/8] Compiling methods...    [******]                                                                  (37,1s @ 2,99GB)
[7/8] Layouting methods...    [**]                                                                       (4,0s @ 4,12GB)
[8/8] Creating image...       [**]                                                                       (4,4s @ 1,49GB)
  37,40MB (58,34%) for code area:    40.731 compilation units
  25,78MB (40,21%) for image heap:  338.915 objects and 80 resources
 951,36kB ( 1,45%) for other data
  64,11MB in total
------------------------------------------------------------------------------------------------------------------------
Top 10 origins of code area:                                Top 10 object types in image heap:
  14,74MB java.base                                            7,80MB byte[] for code metadata
   5,94MB testcontainers-1.19.1.jar                            3,21MB byte[] for java.lang.String
   3,85MB java.xml                                             2,46MB java.lang.String
   3,79MB svm.jar (Native Image)                               2,44MB byte[] for general heap data
   1,12MB neo4j-java-driver-5.13.0.jar                         2,39MB java.lang.Class
   1,08MB netty-buffer-4.1.99.Final.jar                        1,50MB byte[] for embedded resources
 938,35kB docker-java-transport-zerodep-3.3.3.jar            879,17kB byte[] for reflection metadata
 683,00kB netty-transport-4.1.99.Final.jar                   627,38kB com.oracle.svm.core.hub.DynamicHubCompanion
 655,80kB netty-common-4.1.99.Final.jar                      438,19kB java.util.HashMap$Node
 490,98kB jna-5.12.1.jar                                     384,69kB c.o.svm.core.hub.DynamicHub$ReflectionMetadata
   3,95MB for 58 more packages                                 3,40MB for 2755 more object types
------------------------------------------------------------------------------------------------------------------------

And shortly after that:

[main] INFO org.testcontainers.DockerClientFactory - Checking the system...
[main] INFO org.testcontainers.DockerClientFactory - ✔︎ Docker server version should be at least 1.6.0
[main] INFO tc.neo4j:5.13.0 - Creating container for image: neo4j:5.13.0
[main] INFO tc.neo4j:5.13.0 - Reusing container with ID: 56cc1b02f9b0ebfcc8670f5cdc54b5b3a85a4720e8d810548986369a557482ca and hash: aa81fad313c4f8e37e5b14246fe863c7dbc26db6
[main] INFO tc.neo4j:5.13.0 - Reusing existing container (56cc1b02f9b0ebfcc8670f5cdc54b5b3a85a4720e8d810548986369a557482ca) and not creating a new one
[main] INFO tc.neo4j:5.13.0 - Container neo4j:5.13.0 started in PT0.280264S
demo.RepositoryIT > repositoryShouldWork() SUCCESSFUL
 
 
Test run finished after 608 ms
[         2 containers found      ]
[         0 containers skipped    ]
[         2 containers started    ]
[         0 containers aborted    ]
[         2 containers successful ]
[         0 containers failed     ]
[         1 tests found           ]
[         0 tests skipped         ]
[         1 tests started         ]
[         0 tests aborted         ]
[         1 tests successful      ]
[         0 tests failed          ]

What I like here is the fact that I don’t have anything special in my test classes, no weird hierarchies nor any additional annotation. In the project we are working on we have all our integration tests as a separate Maven module as we want to make sure we are testing the packaged jar proper (also for the fact that we have integration tests for both the Java class- and module path in separate Maven modules). This setup now gives us the additional advantage that the packaging of our library is subject to be tested under native image, too. This will let you discover issues with missing resources etc in native image, too.

Anyway, the whole project is shared as a gist, it’s only three files anyway.

The image of this post was generated with Dall-E by my friend Michael.

| Comments (1) »

25-Oct-23


Why would a Neo4j person be so found of an embedded, relational database?

I am working since 2018 for Neo4j. At Neo4j I maintain both Spring Data Neo4j and Neo4j-OGM, both object mappers and entity managers for our database product. This is a great job in a great company with awesome colleagues such as my friends Gerrit and Michael.

Some other projects I created on the job are the Cypher-DSL, a builder for our query language Cypher, which is used extensively inside Spring Data Neo4j, in products of our partner Graphaware and by a whole bunch of other customers. Headed for it’s 100th star is Neo4j-Migrations, a database migration toolkit. Last but not least, I did create the original Neo4j Testcontainers module and was recognized as a Testcontainers Champion for that.

However, I did get “socialized” in a previous lifetime in an SQL (actually, an Oracle shop) and while I did swear more than 20 years ago during studies, I will never do anything with SQL, there I was. For whatever reason, my head actually works pretty well with the relational model and the question I can answer with it. I spent about 15 years in the company doing all kinds of things, such as geospatial applications, applications with energy forecasts based on past measurements and stuff like that. What all that had in common was SQL.

Just a couple of months prior to joining Neo4j, I did a talk under the title Live with your SQL-fetish and choose the right tool for the job, in which I presented jOOQ, way back before it became the hot topic. Anyhow, a lot of the dataset from that talk I eventually used on Neo4j specific talks too… Taking my musical habits into a knowledge graph.

What I am trying to say here is: I am deeply involved in the Graph ecosystem, we have a great product and tools, and I’m glad I have a part in those. But I also think that other query languages have their benefits and also that other projects and companies are doing great work, too.

So, DuckDB, what is it? DuckDB is an “DuckDB is an in-process SQL OLAP database management system” and you can go ahead try it out in your browser, because it can be compiled to WASM.

DuckDBs query engine is vector based. Every bit of data that flows through is a collection of vectors on which algorithms often can be applied in parallel; see Execution Format. That’s the groundwork of DuckDBs fast, analytical queries. From Why DuckDB: Online analytical processing (OLAP) workloads are characterized by complex, relatively long-running queries that process significant portions of the stored dataset, for example aggregations over entire tables or joins between several large tables. In the vectorized query execution engine queries are interpreted and processed in large batches of values in one operation. DuckDB can query foreign stores, such as Postgres and SQLite and there are scenarios in which the engine is actually faster while doing this than native Postgres.

When going through their SQL documentation you will be surprised how much you get essentially for free in one small executable. It’s all there: CTEs, Window functions, ASOF joins, Pivot and many things that make SQL friendlier. Back then we had to run big Oracle installation for similar things.

If you follow me on social media you might notice that my focus in private shifted the last years; I have been doing a lot of sport and training, and less doing site projects that involve any big setups. This is where a small tool that runs without a server installation comes in super handy: I was able to define a neat schema for my photovoltaic systems with a bunch of views and now have a good enough dashboard.

My biking page has been remodeled to be a static page these days but uses a DuckDB database beneath, see biking.michael-simons.eu.

At the moment, I neither need or don’t want more. And as sad it might be, I don’t have a graph problem in either of those applications. I want to aggregate measurements, do analytics and that’s it. It’s a lot of time-series that I’m working with and graph doesn’t really help me there.

Those are use cases that are unrelated to work.

But I do have often times use for DuckDB at work, too. Neo4j can natively ingest CSV files. You need to write some Cypher to massage them into the graph you want, but still.That CSV must be properly formatted.

DuckDB on the other hand can read from CSV, Parquet, JSON and other formats as if they are tables. These source can be in files or in URLs. It does not require you to actually create a schema and persist data in its own store, but just can query things. Querying data without persisting the data as a technique might be unusual for a database and seems counter-intuitive at first look, but is useful in the right situations. DuckDB does not need to persist the content, but you are free to create views and over them. You can happily join a CSV file with a JSON file and literally copy the result to a new CSV file that is well suited for Neo4j.

This can all be done either in a CLI or actually as part of a pipeline in a script.

DuckDB replaced a bunch of tools in my daily usage, such as xiv and to some extend, jq. DuckDBs JSON processing capabilities allow you to query and normalize many complex and denormalized JSON documents, but in some cases it’s not enough… It’s just that you can do so much “interesting” things in JSON 😉

Anyhow, a long read, but maybe an explanation why you did see so many toots by me that dealt with DuckDB or even my profile at DuckDB snippets.

I think it’s valuable knowing other technology and also mastering more than one (query) language. I value both my Cypher knowledge and everything I learned about SQL in the past. Hopefully, I can teach some about both, with my current work and any future publication.

And last but not least: Mark Needham, Michael Hunger and myself have worked together to bring out “DuckDB in Action”. The Manning Early Access Program (MEAP) started October 2023 and the book will be released in 2024. We have more than 50% of the content ready and we would love to hear your feedback:

| Comments (2) »

05-Oct-23