No silver bullets here: Accessing data stored in Neo4j on the JVM

In the previous post I presented various ways how to get data into Neo4j. Now that you have a lot of connected data and it’s attributes, how to access, manipulate, add to them and delete them?

I’m working with and in the Spring ecosystem quite a while now and for me the straight answer is – without much surprises – just use the Spring Data Neo4j module if you work inside the Spring ecosystem. But to surprise of some, there’s more than just Spring there outside.

In this blog post I walk you through

  • Using the Neoj4 Java-Driver directly
  • Creating an application based on Micronaut, which went 1.0 GA these days, the Neo4j Java-Driver and Neo4-OGM
  • A full blown Spring Boot application using Spring Data Neo4j

Before we jump right into some of the options you as an application developer have to access data inside Neo4j, we have to get a clear idea of some of the building blocks and moving parts involved. Let’s get started with those.

Building blocks and moving parts

Neo4j Java-Driver

The most important building block for access Neo4j on the JVM is possibly the Neo4j Java Driver. The Java driver is open source and is available on Github under the Apache License. This driver uses the binary “Bolt” protocol.

You can think of that driver as analogue to a JDBC driver that available for a relational database. Neo4j also offers drivers for different languages based on the Bolt protocol.

As with Java’s JDBC driver, there’s a bit of ceremony involved when working with this driver. First you have to acquire a driver instance and then open a session from which you can query the database:

try (
    Driver driver = GraphDatabase.driver( uri, AuthTokens.basic( user, password ) );
    Session session = driver.session()
) {
    List<String> artistNames =
        session
            .readTransaction(tx -> tx.run("MATCH (a:Artist) RETURN a", Map.of()))
            .list(record -> record.get("a").get("name").asString());
}

With that code, one connects against the database and retrieves the names of all artists, I imported in my previous post. What I omitted here is the fact that the driver does connection pooling and one should not open and close it immediately. Instead, you would have to write some boiler plate code to handle this.

There are some important things to notice here: The code speaks of a driver. That is org.neo4j.driver.v1.Driver. The session is also from the same package: org.neo4j.driver.v1.Session. Those both are types from the driver itself. You have to know this things, because those terms will pop up later again. Neo4j-OGM, the object graph mapper, also speaks about drivers and session, but those are completely different things.

The Java driver has a nice type system (see The Cypher type system) and gets you quite far.

Most of the time however, people in the Java ecosystem prefer nominal typing over structural typing and want to map “all the things database” to objects of some kind. Let’s not get into bikeshedding here but just accept things as they are. Given a database model where a musical artist has multiple links to different wikipedia sites, represented like this (I omitted getter and setter for clarity):

public class WikipediaArticleEntity implements Comparable<WikipediaArticleEntity> {
 
    private Long id;
 
    private String site;
 
    private String title;
 
    private String url;
 
    public WikipediaArticleEntity(String site, String title, String url) {
        this.site = site;
        this.title = title;
        this.url = url;
    }
}
 
public class ArtistEntity {
 
    private String name;
 
    private String wikidataEntityId;
 
    private Set<WikipediaArticleEntity> wikipediaArticles = new TreeSet<>();
 
    public ArtistEntity(String name, String wikidataEntityId, Set<WikipediaArticleEntity> wikipediaArticles) {
        this.name = name;
        this.wikidataEntityId = wikidataEntityId;
        this.wikipediaArticles = wikipediaArticles;
    }
}

To fill such a model directly by interacting purely with the driver, you’ll have to do something like this: A driver session get’s opened, than we write a query in Neo4j’s declarative graph query language called Cypher, execute and map all the returned records and nodes:

public List<ArtistEntity> findByName(String name) {
    try (Session s = driver.session()) {
        String statement
            = " MATCH (a:Artist) "
            + " WHERE a.name contains $name "
            + " WITH a "
            + " OPTIONAL MATCH (a) - [:HAS_LINK_TO] -> (w:WikipediaArticle)"
            + " RETURN a, collect(w) as wikipediaArticles";
 
        return s.readTransaction(tx -> tx.run(statement, Collections.singletonMap("name", name)))
            .list(record -> {
                final Value artistNode = record.get("a");
                final List<WikipediaArticleEntity> wikipediaArticles = record.get("wikipediaArticles")
                    .asList(node -> new WikipediaArticleEntity(
                        node.get("site").asString(), node.get("title").asString(), node.get("url").asString()));
 
                return new ArtistEntity(
                    artistNode.get("name").asString(),
                    artistNode.get("wikidataEntityId").asString(),
                    new HashSet<>(wikipediaArticles)
                );
            });
    }
}

(This code is part of my example how to interact with Neo4j from a Micronaut application, find its source here and the whole application here.)

While this works, it’s quite an effort: For a simple thing (one root aggregate, the artist, with some attributes), a query that is not that simple anymore and a lot of manual mapping. The query makes good use of a standardized multiset (the collect-statement), to avoid having n+1 queries or deduplication of things on the client site, but all this mapping is kinda annoying for a simple READ operation.

Enter

Neo4j-OGM

Neo4j-OGM stands for Object-Graph-Mapper. It’s on the same level of abstraction as JPA/Hibernate are for relational databases. There’s extensive documentation: Neo4j-OGM – An Object Graph Mapping Library for Neo4j. An OGM maps nodes and relationships in the graph to objects and references in a domain model. Object instances are mapped to nodes while object references are mapped using relationships, or serialized to properties. JVM primitives are mapped to node or relationship properties.

Given the example from above, we only have to add a handful of simple annotations to make our domain usable with Neo4j-OGM:

@NodeEntity("WikipediaArticle")
public class WikipediaArticleEntity implements Comparable<WikipediaArticleEntity> {
 
    @Id
    @GeneratedValue
    private Long id;
 
    private String site;
 
    private String title;
 
    private String url;
 
    WikipediaArticleEntity() {
    }
 
    public WikipediaArticleEntity(String site, String title, String url) {
        this.site = site;
        this.title = title;
        this.url = url;
    }
}
 
 
@NodeEntity("Artist")
public class ArtistEntity {
 
    @Id
    @GeneratedValue
    private Long id;
 
    private String name;
 
    private String wikidataEntityId;
 
    @Relationship("HAS_LINK_TO")
    private Set<WikipediaArticleEntity> wikipediaArticles = new TreeSet<>();
 
    ArtistEntity() {
    }
 
    public ArtistEntity(String name, String wikidataEntityId, Set<WikipediaArticleEntity> wikipediaArticles) {
        this.name = name;
        this.wikidataEntityId = wikidataEntityId;
        this.wikipediaArticles = wikipediaArticles;
    }
}

Notice @NodeEntity on the classes, @Relationship on the attribute wikipediaArticles of the ArtistEntity-class and some technical details, mainly @Id @GeneratedValue, needed to map Neo4j's internal, technical ids to instances of the classes and vice-versa.

@NodeEntity and @Relationship are used not only to mark the classes and attributes as something to store in the graph, but also to specify labels to be used for the nodes and names for the relationship.

The whole query than folds together into something like this:

public Iterable<ArtistEntity> findByName(String name) {
    return this.session
        .loadAll(ArtistEntity.class, new Filter("name", ComparisonOperator.CONTAINING, name), 1);
}

Quite a different, right? Dealing with the driver, the driver's session and Cypher has been abstracted away. Take note that the above session attribute is not a Driver's session, but OGM's session. This is a bit confusing when you start using those things.

Again, this code is part of my example how to interact with Neo4j from a Micronaut application. The complete source of the above is here and the whole application here.

To be fair, Neo4j-OGM needs to be configured as well. This is done in it's simplest form with a drivers instance and a list of packages that contains domain entities as described above, for example like this:

public SessionFactory createSessionFactory(Driver driver) {
    return new org.neo4j.ogm.session.SessionFactory(
        new BoltDriver(driver), "ac.simons.music.micronaut.domain");
}

The driver instance in the example above is instantiated by Micronaut. With Micronaut's configuration support, it would have been manually configured as in the very first example.

In a Spring Boot application, Spring Boot takes care of the driver and Spring Data Neo4j creates the OGM session and deals with transactions, among other things:

Spring Data Neo4j

Let's start with quoting Spring Data:

Spring Data’s mission is to provide a familiar and consistent, Spring-based programming model for data access while still retaining the special traits of the underlying data store. It makes it easy to use data access technologies, relational and non-relational databases, map-reduce frameworks, and cloud-based data services.

That goes so far, that Craig Walls is fairly correct when he says, that many stores "are mostly the same from a Spring Data perspective":

Spring Data Neo4j has some specialities, but on a superficial level, the above statement is correct.

Spring Data depends on the Spring Framework and given that, it's kinda hard to get it to work in environments other than Spring. If you're however using Spring Framework already, I wouldn't think twice to add Spring Data to the mix, regardless whether I have to deal with a relational database or Neo4j.

Given the entity ArtistEntity above, one can just declare a repository as this:

interface ArtistRepository extends Neo4jRepository<ArtistEntity, Long> {
    List<ArtistEntity> findByNameContaining(String name);
}

There is no need to add an implementation for that interface, this is done by Spring Data. Spring Data also wires up a Neo4j-OGM session that is aware of Spring transactions.

From an application developers point you don't have to deal with mapping, opening and closing sessions and transactions any longer, but only with one single "repository" as abstraction over a set of given entities.

Please be aware that the idea behind Spring Data and its repository concept is not having a repository for each entity there is, but only for the root aggregates. To quote Jens Schauder: "Repositories persist and load aggregates. An aggregate is a cluster of objects that form a unit, which should always be consistent. Also, it should always get persisted (and loaded) together. It has a single object, called the aggregate root, which is the only thing allowed to touch or reference the internals of the aggregate." (see Spring Data JDBC, References, and Aggregates).

In my "music" example, I deal with albums released in a given year. The release year is an integral part of the album and it would be weird having an additional repository for it.

So what are the specialities of Spring Data Neo4j? First of all, in the pure Neo4-OGM example you might have noticed the single, lone "1". That specifies the fetch depth in which entities should be loaded. Depending on how entities are modeled, you could ran in the problem, that you fetch your whole graph with hone single query. Specifying the depth means specifying how deep relationships should be fetch. The repository method can be declared analog:

interface ArtistRepository extends Neo4jRepository<ArtistEntity, Long> {
    List<ArtistEntity> findByNameContaining(String name, @Depth int depth);
}

People familiar with Spring Data know that derive query method like the findByNameContaining can be much more complicated. You could even write down

interface ArtistRepository extends Neo4jRepository<ArtistEntity, Long> {
    List<ArtistEntity> findByNameContainingOrWikipediaArticlesTitleIs(String name, String title, @Depth int depth);
}

and so on. I have seen some interesting finder methods here and there. While this is technically possible, I would recommend using the @Query annotation on the method name, write down the query myself and chose a method name that corresponds to the business.

Different abstraction levels

At this point it should be clear, that Neo4j Java-Driver, Neo4j-OGM and Spring Data act on different abstraction levels:



In your application, you have to decide which level of abstraction you need. You can come along way with direct interaction with the driver, especially for all kind of queries that facilitates your database for more than simple CRUD operations. However, I don't think that you want to deal with all the cruft of CRUD yourself throughout your application.

When to use what?

All three abstractions can execute all kind of Cypher queries. If you want to deal with result sets and records yourself and don't mind mapping stuff as you go along, use the Java driver. It has the least overhead. Not mapping stuff to fixed objects has the advantage that you can freely traverse relationships in your queries and use the results as needed.

As soon as you want to map nodes with the same labels and their relationship to other nodes more often than not, you should consider Neo4j-OGM. It takes away the "boring" mapping code from you and helps you to concentrate on your domain. Also, Neo4j-OGM is not tied to Spring. I didn't write application outside the Spring ecosystem for quite a while now. For this post, I needed an example where I don't have Spring, so I came up with the Micronaut demo, that uses both plain Java-Driver access and OGM access. Depending on what you want to achieve, you can combine both approaches: Mapping the boring stuff with Neo4j-OGM, handling "special" results yourself.

If you're writing an application in the Spring-Eco-System and decided for OGM, please also add Spring Data Neo4j to the mix. While it doesn't put any further abstraction layer on the mapping itself and thus is not slowing things down, it takes away the burden dealing with the session and transaction from you.

I do firmly believe that Spring Data Neo4j is the most flexible solution.

  1. Start with a simple repository, relying on the CRUD methods
  2. If necessary, declare your queries with @Query
  3. To differentiate between write and read models, execute writes through mapped @NodeEntities and reads through read-only @QueryResults
  4. Write a custom repository extension and interact directly with the Neo4j-OGM or Neo4j Java-Driver session
  5. d

To complete this post, I'll show you option 2 and 3. Given my AlbumEntity, TrackEntity and a AlbumRepository.

First of all I want a query that retrieves all the albums containing one specific track. That is pretty easy to write in Cypher:

interface AlbumRepository extends Neo4jRepository<AlbumEntity, Long> {
    @Query(value
        = " MATCH (album:Album) - [:CONTAINS] -> (track:Track)"
        + " MATCH p=(album) - [*1] - ()"
        + " WHERE id(track) = $trackId"
        + "   AND ALL(relationship IN relationships(p) WHERE type(relationship) <> 'CONTAINS')"
        + " RETURN p"
    )
    List<AlbumEntity> findAllByTrack(Long trackId);
}

By declaring this additional method on the repository, I know have mapped a simple Cypher query that does complex thinks (Here match all albums that contain a specific track and all the relationships of that album and return that all apart from the other tracks) to my entity. I benefit from SDNs mapping and have all the queries in one place.

In my domain, I didn't model the track as part of the album. Those tracks should be explicitly read and not all the time. I therefore added an additional class, called AlbumTrack. Again, accessors omitted for brevity:

@QueryResult
public class AlbumTrack {
 
	private Long id;
 
	private String name;
 
	private Long discNumber;
 
	private Long trackNumber;
}

Notice the @QueryResult annotation. This is special to Spring Data Neo4j. It marks this as a class that is instantiated from arbitrary query result but doesn't have a lifecycle. It then can be used as in a declarative query method, similar to the first one:

interface AlbumRepository extends Neo4jRepository<AlbumEntity, Long> {
    @Query(value
        = " MATCH (album:Album) - [c:CONTAINS] -> (track:Track) WHERE id(album) = $albumId"
        + " RETURN id(track) AS id, track.name AS name, c.discNumber AS discNumber, c.trackNumber AS trackNumber"
        + " ORDER BY c.discNumber ASC, c.trackNumber ASC"
    )
    List<AlbumTrack> findAllAlbumTracks(Long albumId);
}

while this query is indeed much simpler as the first one, it's important to be able to do such things for designing an application that performs well. Think about it: Is it really necessary to have all the relations to all other possible nodes at hands all the time?

In the end, you might have guess it: There are no silver bullets. There are situations where an approach close to the database is more appropriate than another, sometimes a higher abstraction level is better. Whatever you chose, try not be to dogmatic.

All the examples are part of my bootiful music project, more specifically, the "knowledge" submodule. With the building blocks described here, you can develop an web application that is used for reading and writing data.

The example application uses a simple, server side rendered approach for the frontend, but Spring Data Neo4j plays well with Spring Data Rest and that makes many different approaches possible.

In the next installment of this series, we have a look at the concrete domain modeling with Spring Data Neo4j.

| Comments (4) »

29-Oct-18


How to get data into Neo4j?

This is the second post in the series of From relational databases to databases with relations. Check the link to find all the other entries, too. The source code for everything, including the relational dataset is on GitHub: https://github.com/michael-simons/bootiful-music.
The post has also been featured in This Week in Neo4j.

To feel comfortable in a talk, I need to have a domain in front of me I’m familiar with. Neo4j just announced the Graph Gallery which is super nice to explore Graphs, but I wanted to have something of my own and I keep continue using the idea of tracking my musical habits. In the end I want to have knowledge base in addition to my chart applications (both linked above).

There are several ways to get data into your Neo4j database. One is a simple Cypher statement like this

CREATE (artist:Artist {name: 'Queen'}) RETURN artist

which creates a node with the label Artist and a property name for one of my favorite bands of all time. I can use the MERGE statement, which ensures the existence of a whole pattern (note only a node), too. More on that later, though. In any case, that would be a real effort todo manually. Therefor, let’s have a look how to import data. I found several options:

I find it quite fascinating in how many ways CSV data can be processed from within Neo4j itself. It reads through CSV and provides each row in a way that you can interact with from a Cypher statement like it comes from the graph itself. Those CSV files can be put into a dedicated import folder in the database instance but can also be retrieved from a URL. You can come a long way with that import if you already have CSV or a willing to massage your data a bit so that it fits a plain structure. Rik did this with a beer related data, check it out here: Fun with Beer – and Graphs in Part 1: Getting Beer into Neo4j.

JSON is an option if you want to work agains the many nice APIs out there, but my data sits in a relational database. It looks like this:



This tables store all tracks I have listened to, their artists and genres and in a time series table all plays. If you read the previous post closely, you’ll notice that this database, dubbed statsdb is only in NF2. There is no separate relation for the albums of an artist. They are stored with the tracks. I actually forgot why I modeled it that way. I vaguely remember that I was enjoyed that album names are not necessarily unique and I could find a good business primary key. Anyway, the schema is still running and gives me each month something like that which btw takes only three SQL queries:



In my property graph model I wanted to have separate albums nodes, though, so I had to massage and aggregate my data a bit before creating nodes. As I didn’t want to do this based on CSV, I looked at all the options JDBC. First, Neo4j ETL Tool:

Neo4j ETL Tool

The Neo4j ETL Tool can be downloaded through Neo4j Desktop and needs a free activation key. The link guides you through all the steps necessary to connect the tool to both the Graph database as well as the relational database. The nice thing here: You don’t have to install a driver for popular databases like MySQL or PostgreSQL as it comes with our tool.

I ran the tool agains my database and stopped here:



The ETL Tool recognized my tables and also the foreign keys and proposed a node and relationship structure. This is nice, but doesn’t fit exactly what I want. The relationships can easily be renamed and so can the node labels, but I don’t want the structure. I could just ran this and than transform everything via Cypher again, but that feels weird.

What I want is something like this (plus the play counts, for sure):

MATCH path1=(album:Album {name: 'Jazz ist anders'}) - [:RELEASED_BY] -> (:Artist)
MATCH path2=(album) - [:RELEASED_IN] -> (:Year) - [:PART_OF] -> (:Decade)
MATCH (album) - [:CONTAINS] -> (tracks:Track) 
RETURN path1, path2, tracks


Getting all the artists is simple. It’s basically a “select * from artists” and the corresponding Cypher. This is what Apoc does:

Importing data with JDBC into Neo4j

APOC is not only the name of a technician in the famous Movie “The Matrix”, but also an acronym for “Awesome Procedures on Cypher”. Neo4j is extensible via custom, stored procedures and functions much like PostgreSQL and Oracle. I have been a friend of those for years, I even read and generated Microsoft Excel files from within an Oracle Database. Now I realize, that Michael Hunger does the same for Neo4j 🙂

Anyway. APOC offers “Load JDBC” and Michael has a nice introduction at hand.

To use APOC you have to find the plugin folder of your Neo4j installation. Download the APOC release matching your database version from the above GitHub link and add it to the plugin folder. APOC doesn’t distribute the JDBC drivers itself, those have to be installed as well. As my stats db is PostgreSQL, I grabbed the PostgreSQL JDBC Driver and put it into the plugin folder as well.

Things are super easy from there on. With the following Cypher statement, the Neo4j instances connects agains the PostgreSQL instance, executes a select * from artists and returns each row. YIELD is a subclause to CALL and selects all nodes, relationships or properties returned from the procedure for further processing as bound variables in a Cypher query. I then used them to merge all artists. In case they already existed, I updated some auditing attributes. The associations between some artists have already been there, so merge didn’t remove that:

WITH "jdbc:postgresql://localhost:5432/bootiful-music?currentSchema=dev&user=statsdb-dev&password=dev" as url
CALL apoc.load.jdbc(url,"artists") YIELD row
MERGE (artist:Artist {name: row.name})
   ON CREATE SET artist.createdAt = localdatetime()
   ON MATCH SET artist.updatedAt = localdatetime()
RETURN artist


The merge-clause is really fascinating. It matches a pattern and the merge is applied to the full pattern, meaning it must exist as a whole. In the case above, the pattern is simple, it’s just a single node. If you want to create or update an album released by an artist and use something like this MERGE (album:Album {name: $someAlbumName}) - [:RELEASED_BY] -> (artist:Artist {name: $someNewArtistName}), the statement will always create a new artist node, regardless wether the artist already exists or not as the whole pattern is checked. You have to do those things in separate steps like this:

MERGE (artist:Artist {name: $someNewArtistName})
MERGE (album:Album {name: $someAlbumName}) - [:RELEASED_BY] -> (artist)

First, merge the artist. Then, merge the album and the relationship to the existing artist node. For me, coming from with SQL background, this feels unusual at first, much like I can use something in an imperative AND declarative way.

Now, the above example is still not what I want. apoc.load.jdbc basically gives me the whole table and I have to do every post-processing in afterwards. Luckily, that second parameter can also be a full SQL statement. Now, to create a graph structure like above (minus the tracks though), I did something like this. Behold, all the SQL and Cypher in one statement:

WITH "jdbc:postgresql://localhost:5432/bootiful-music?currentSchema=dev&user=statsdb-dev&password=dev" as url,
     "SELECT DISTINCT a.name as artist_name, t.album, g.name as genre_name, t.year
      FROM tracks t JOIN artists a ON a.id = t.artist_id JOIN genres g ON g.id = t.genre_id
      WHERE t.compilation = 'f'" as sql
CALL apoc.load.jdbc(url,sql) YIELD row
MERGE (decade:Decade {value: row.year-row.year%10})
MERGE (year:Year {value: row.year})
MERGE (year) - [:PART_OF] -> (decade)
MERGE (artist:Artist {name: row.artist_name})
   ON CREATE SET artist.createdAt = localdatetime()
   ON MATCH SET artist.updatedAt = localdatetime()	 
MERGE (album:Album {name: row.album}) - [:RELEASED_BY] -> (artist)
   ON CREATE SET album.createdAt = localdatetime()
   ON MATCH SET album.updatedAt = localdatetime()
MERGE (genre:Genre {name: row.genre_name})
   ON CREATE SET genre.createdAt = localdatetime()
   ON MATCH SET genre.updatedAt = localdatetime()
MERGE (album) - [:HAS] -> (genre)
MERGE (album) - [:RELEASED_IN] -> (year)

Read: Bind the JDBC url and a SQL statement to variables. The sql selects a distinct set of all artists and album, including genre and year, thus denormalizing my relational schema by joining the artists and genres. The resulting tuples are then used to create decades and years, merge artists, genres and albums and finally relating them to each other. I was deeply impressed after I ran that.

First: It’s amazing how well Neo4j integrates with other stuff and secondly, it’s actually pretty readable I think.

In my talk, I will dig deeper in the Cypher used. Also, the statements for generating the track data in the above Graph are bit more complex (you find them here). But in the end, I could have stopped here. And probably, I would have done so, if I had started with APOC in the beginning.

But me being the mad scientist I am, started somewhere else:

Writing your own tool

I did this right after my first look at the ETL tool. Neo4j can be extended with procedures and functions. Procedures stream a result with several attributes, functions return single values. When I joined Neo4j, my dear colleague Gerrit impressed me with knowledge about that stuff and I wanted to keep up with his pace, so I saw a good learning opportunity here.

One does not have to extend from a dedicated class or have to implement an interface for writing a Neo4j stored procedure. It’s enough to annotate the methods that should be exposed with either @Procedure or @UserFunction. It’s necessary that org.neo4j:neo4j is a provided dependency on the class path.

Neo4j provides detailed information how to create a package that just can be dropped into Neo4js plugin folder, find the instructions here. They worked as promised.

What I did was to apply one of my favorite libraries again, run jOOQ against my stats database, generating a type safe DSL dedicated to my domain, packaged it up and than used it in a Neo4j stored procedure. In the following listing you’ll find that I use Neo4js @Context to get hold of the graph database service in which the procedure is running, pass in user name, password and JDBC url to my procedure. Those are used to connect against PostgresSQL (try (var connection = DriverManager.getConnection(url, userName, password);) {} (yeah, that’s Java 11 code), open a Neo4j transaction and than just executing my SQL and manually creating nodes.

public class StatsIntegration {
	@Context
	public GraphDatabaseService db;
 
	@Context
	public Log log;
 
	@Procedure(name = "stats.loadPlayCounts", mode = Mode.WRITE)
	@Description("Loads the play counts for tracks. Note: All the tracks and albums need to be there, use loadAlbumData before.")
	public void loadPlayCounts(
			@Name("userName") final String userName,
			@Name("password") final String password,
			@Name("url") final String url) {
 
		try (var connection = DriverManager.getConnection(url, userName, password);
			 var neoTransaction = db.beginTx()) {
 
			var statsDb = DSL.using(connection);
			var isNoCompilation = TRACKS.COMPILATION.eq("f");
 
			final String columnNameYear = "yearValue";
			final String columnNameMonth = "monthValue";
			final String columnNameNewPlayCount = "newPlayCount";
 
			var year = extract(PLAYS.PLAYED_ON, DatePart.YEAR).as(columnNameYear);
			var month = extract(PLAYS.PLAYED_ON, DatePart.MONTH).as(columnNameMonth);
 
			log.info("Deleting existing playcounts");
			executeQueryAndLogResults("MATCH (:Track) - [playCount:HAS_BEEN_PLAYED_IN] -> () DETACH DELETE playCount", Map.of());
 
			log.info("Loading playcounts from statsdb");
			statsDb
					.select(
							ARTISTS.NAME,
							TRACKS.NAME,
							year, month,
							count().as("newPlayCount")
					)
					.from(PLAYS)
					.join(TRACKS).onKey()
					.join(ARTISTS).onKey()
					.where(isNoCompilation)
					.groupBy(ARTISTS.NAME, TRACKS.NAME, year, month)
					.forEach(r -> {
						var parameters = Map.of(
								"artistName", r.get(ARTISTS.NAME),
								"trackName", r.get(TRACKS.NAME),
								columnNameYear, r.get(columnNameYear),
								columnNameMonth, r.get(columnNameMonth),
								columnNameNewPlayCount, r.get(columnNameNewPlayCount)
						);
 
						executeQueryAndLogResults(CREATE_PLAY_COUNTS, parameters);
					});
 
			log.info("Adding weight to artists");
			executeQueryAndLogResults(WEIGHT_ARTISTS_BY_PLAYCOUNT, Map.of());
			neoTransaction.success();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
}

The complete sources for that exercise are in the etl module of my project. As interesting as the stored procedure itself is the integration test, that needs both a Neo4j instance and PostgreSQL. For the first I use our own test harness for the later one, Testcontainers.

Conclusion

As much as the last exercise was fun and I learned a ton about extending Neo, working directly with the engine and so on, it’s not only slower (probably due to context switches and the relatively large transaction), but it’s hard to read, by magnitudes. As with a relational database, declarative languages like Cypher and SQL beat explicit programming hand down. If you ever have a need to migrate or aggregate and duplicate data from a relational database into Neo4j, have a look at apoc.load.jdbc. Select the stuff you need in one single context switch and then process as needed, either in a single transaction or also in batches and parallel.

For my music project, I’ll probably keep the ETL Tool around as I like the structure I created for the tool itself, but will rewrite the process from relational to graph based on APOC and plain SQL and Cypher statements, if it ever sees production of any kind.

Overall verdict: It took me a while to find a graph model I like and from which I am convinced I can actually draw some conclusions about the music I like, but after the graph started to resonate with me, it felt very natural and enjoyable.

Next step: Creating an Neo4j-OGM model for the different nodes.


Images in this post: Myself. Featured image – Markus Spiske.

| Comments (0) »

12-Oct-18


From relational databases to databases with relations

In the summer of 2018, I joined Neo4j. This seems odd at first, my “love” for relational databases and SQL is known. I didn’t have this slide in my jOOQ presentation for two years now without reason. But: Looking at the jOOQ and SQL talks from the perspective of early 2000s, they also seemed odd at first. Back then, I never thought doing that much with databases. That changed a lot.

My experience is that all the data your deal with usually has a much longer lifetime than any of your applications sitting on top of that. Knowing one or more database management systems is essential. Being able to query them even more: What Neo4j and relational databases have in common: A great, declarative way to tell them which data to return and not how to return them. In the case of a relational database, this is obviously SQL, which had quite the renaissance for a few years now. Neo4j’s lingua franca is Cypher.

I get to play with Cypher a lot, but in the end this is not what I am working on at Neo4j. My work is focused on our Object Graph Mapper, Neo4j-OGM and the related Spring Data Module, Spring Data Neo4j. We have written about that a bit on medium. Given my experience with Spring, Spring Data and Spring Boot, the role suddenly makes much more sense.

People who entered the IT-conference circus may know the merry-go-round (or should I say “trap”?) of “talks stress me out a lot” – “hey, this is great, just enter another CfP”. I fell for it again and proposed a talk with the above title to several conferences. Now, I have to come up with something.

Back in 2016, when I first held my talk Database centric applications with Spring Boot and jOOQ, I started to write my story down. Back then, it helped me a lot, so here we go again. Join me on my way from relational databases to databases with relations.

Content

What are we talking about here?

The domain will be music. I have been tracking my musical habits for more than 10 years now at my side project Daily Fratze and I enjoy looking back to what I listened like this months but 5 years ago.

Edgar F. Codd

The guy on the left, Edgar F. Codd invented the relational model back in the 1970s. A relation in this model doesn’t describe a relation like the one between two people or a musician associated with a band who released a new track. A relation in the relational model is a table itself. Foreign keys between tables ensure referential integrity, but cannot define relations themselves. I put down some thoughts a while back in this German deck. This is what a relation looks like in a relational database:



One kind of sport to do with relational databases is the process of normalizing data. There are several normal forms. Their goal is to keep a database redundancy free, for several reasons. Back in the 1970, disk space being one of them. First normal form (1NF): All attributes should be atomic. 2NF is 1NF plus no functional dependencies on parts of any candidate keys. That is: There must not be a pair of attributes appearing twice in a relation’s tuples. 3NF forbids transitive dependencies (“Nothing but the key, so help me Codd”) and it gets complicated from there on. I have to say though, that normalization up to 3NF is still relevant today in a relational systems, at least if you’re a friend of (strong) consistent data.

However, as my colleague Rik van Bruggen points out in his book “Learning Neo4j”, relational databases are quite anti-relational.

Why is this? Relations in NF can be queried in many, many ways. Each query send via SQL returns a new relation, what’s the problem? It depends. In a strictly analytical use case, there’s often not a problem. Recreating object hierarchies however, joining things back together, is. A handful of joins is not hard to understand, even without a tool, but self referential joins or a sheer, huge amount, is. It also gets increasingly hard on the database management system.

This is where graphs can come into play. Graphs are another mathematical concept, this time from Graph theory. Sometimes
people call a chart graph by coincident, but this is wrong. A graph is a set of objects with pairs of objects being related. In mathematical terms, those objects are vertices and the relations between them, edges. We call them nodes and relationships in the Neo4j database.

Neo4j is a Property Graph. A property graph adds labels and properties to both nodes and relationships:



The above picture is from our excellent post What is a Graph Database?.

One takeaway from that post is: Neo4j is referred to as a native graph database because it efficiently implements the property graph model down to the storage level. Or in technical terms: Neo4j employs so called index-free adjacency, which is the most efficient means of processing data in a graph because connected nodes physically point to each other in the database. This obliterates the needs for complex joins, either directly or via intersection tables. One just can tell the database to retrieve all nodes connected to another node.

This is not only super nice for simple aggregations of things, but especially for many graph algorithms.

So what has this todo with my talk? My SQL talk was all about doing analytics. That is, retrieving data like in this image from a relational database with build-in analytic functions. Computing running totals, differences from previous windows and so on (read more here).

First of all I’m gonna analyze how to create a graph structure from the very same dataset I used in the SQL talk. There are different tools out there with different approaches. I’ve chosen a technique that resonated for various reasons with me. As I want to enrich the existing dataset, I’ll model a domain around it, with Java, putting Neo4j-OGM to use. I’ll show how Spring Data Neo4j helps me not having to deal with a lot of cruft. In the end, I’ll show that I can build my own music recommendation engine based on 10 years of tracking my musical habits by applying some of the queries and algorithms possible with Neo4j.

Next step: Loading data.


Pictures in this post: Codd and Table – Wikipedia, Property Graph – Neo4j, Featured image – Sarah Cervantes.

| Comments (3) »

11-Oct-18


Validate nested Transaction settings with Spring and Spring Boot

The Spring Framework has had an outstanding, declarative Transaction management for years now.
The configurable options maybe overwhelming at first, but important to accommodate many different scenarios.

Three of them stick out: propagation, isolation and to some lesser extend, read-only mode (more on that a bit later)

  • propagation describes what happens if a transaction is to be opened inside the scope of an already existing transaction
  • isolation determines among other whether one transaction can see uncommitted writes from another
  • read-only can be used as a hint when user code only executes reads

I wrote “to some lesser extend” regarding read-only as read-only transactions can be a useful optimization in some cases, such as when you use Hibernate. Some underlying implementations treat them as hints only and don’t actually prevent writes. For a full description of things, have a look at the reference documentation on transaction strategies.

Note: A great discussion on how setting read-only to true can affect performance in a positive way with Spring 5.1 and Hibernate 5.3 can be find in the Spring Ticket SPR-16956.

Some of the transactions settings are contradicting in case of nested transaction scenarios. The documentation says:

By default, a participating transaction joins the characteristics of the outer scope, silently ignoring the local isolation level, timeout value, or read-only flag (if any).

This service here is broken in my perception. It explicitly declare a read-only transaction and than calls a save on a Spring Data repository:

import org.springframework.transaction.annotation.Transactional;
import org.springframework.stereotype.Service;
 
@Service
public class BrokenService {
	private final ThingRepository thingRepository;
 
	public BrokenService(ThingRepository thingRepository) {
		this.thingRepository = thingRepository;
	}
 
	@Transactional(readOnly = true)
	public ThingEntity tryToWriteInReadOnlyTx() {
		return this.thingRepository.save(new ThingEntity("A thing"));
	}
}

This can be detected by using a PlatformTransactionManager that supports validation of existing transactions. The JpaTransactionManager does this as well as Neo4js Neo4jTransactionManager (both extending AbstractPlatformTransactionManager).

To enable validation for JPA’s transaction manager in a Spring Boot based scenario, just make use of the provided PlatformTransactionManagerCustomizer interface. Spring Boots autoconfiguration calls them with the corresponding transaction manager:

import org.springframework.boot.autoconfigure.transaction.PlatformTransactionManagerCustomizer;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.transaction.support.AbstractPlatformTransactionManager;
 
@Configuration
class TransactionManagerConfiguration {
 
	@Bean
	public PlatformTransactionManagerCustomizer<AbstractPlatformTransactionManager> transactionManagementConfigurer() {
		return (AbstractPlatformTransactionManager transactionManager) -> transactionManager
			.setValidateExistingTransaction(true);
	}
}

In the unlikely scenario you’re not using Spring Boot, you can always let Spring inject an EntityManagerFactory or for example Neo4j’s SessionFactory and create and configure the corresponding transaction manager yourself. My Neo4j tip does cover that as well.

If you try to execute the above service now, it’ll fail with an IllegalTransactionStateException indicating “save is not marked as read-only but existing transaction is”.

The question if a validation is possible arose in a discussion with customers. Funny enough, even working with Spring now for nearly 10 years, I never thought of that and always assumed it would validate those automatically but never tested it. Good to have learned something new, again.

Featured image on this post: Transaction by Nick Youngson CC BY-SA 3.0 Alpha Stock Images.

| Comments (0) »

25-Sep-18


Donating to Médecins Sans Frontières (Ärzte ohne Grenzen)

Some weeks ago, my friends Judith and Christian, who write great Steampunk, Fantasy and in the recent time science fiction books for whom I wrote this little Kotlin app had a good idea:

And I was like:

My book Spring Boot 2 has been a bestseller this year and I still got quite a decent revenue from arc42 by example that I co-authored with Gernot Starke and Stefan Zörner.

Keeping my promises, here are the numbers: My share of royalties of the arc42 by example book has been 112,15€ from June to August. Gernot, Stefan and I split revenues, so that is only my share. Gernot himself already donates through Leanpubs causes. I don’t have numbers from my publisher for Spring Boot Book. I therefore decided to round this number to 250€:



Me and my family have been incredibly lucky the last years and I’ more than happy that one can give. We live in Germany and despite some idiots on various (social) media, it’s really fortunate to live here. Health care is working, social system as well. We don’t have war but clean water, food and everything. We should not forget that this is by far not self-evident for many, many people on this planet-

| Comments (0) »

01-Sep-18