Passion.

In “my” industry people speaking a lot of doing things “with passion or not at all.” I’m wonder, are we aware of what passion meant?

The word itself means both in Greek and Latin “to suffer”, which fits very well with the some of the crazy work ethics out there.

I for myself try to stay with the “good feelings” that arose with from the Stoic and might add to them: Engagement instead of passion. I’ll always do my best, but I will not suffer. Or at least, I try not to.

Just a late night thought.

| Comments (0) »

30-Nov-18


Modeling a domain with Spring Data Neo4j and OGM

This is the forth post in this series and I want to keep it short and simple.

A domain can be modeled in many ways and so can databases. As long as I deal with them, I always preferred the approach: Database (model) first. Usually, data is much longer around than applications and I don’t want my first application instance or version define the model for all eternity.

Using an Object-Graph-Mapper or Object-Relational-Mapper can be slightly danger. One tends to write down some class hierarchy and just let the tool do its magic. In the end, there are schemes that sometimes are very hard to read for humans. The danger might be a bit smaller with an OGM as hierarchies and connections map quite nicely onto a graph, but still, I don’t want that to be the default.

My domain can be summarized with a few sentences:

  • There are Artists, that might be highlighted as Bands or Solo Artists
  • Bands have Member
  • Bands are founded in and solo artists are born in countries
  • Sometimes Artists are associated with other Artists
  • Albums are released by Artists in a year which is part of a decade
  • Albums contain multiple tracks that have been played several times in a month of a given year

I spare you the logical ER-diagram for a relational database here and jump straight to the nodes. I highlighted all the important “classes” and their relations.

Modelling data with Neo4j feels a lot like modeling on a whiteboard. And actually, it really is: The whiteboard model ends up being the physical model in the end with Neo4j.

Neo4j is a property graph database. It stores Nodes with one or more Labels, their Relationships with a type among each other and properties for both nodes and relationships.

A label starts with a colon and is usually written with a initial upper case letter, i.e. :Artist and :Album, the type of a relationship is written with a colon and than all uppercase, :RELEASED_BY and properties in camel case, without a colon, i.e. name and firstName. The above list translates in my application to a model like this:



I really find it fascinating how that model reads: Pretty much the same as my verbal description.

How to model this with Neo4-OGM and Spring Data Neo4j? You might want to recap the previous post to get an idea of the moving parts.

Value objects that happen to be persisted

In my domain the most simple objects are probably the year and the decade of the year. Those objects are value objects, they don’t have an identity. A year with a given value is as good as another instance with the same value.

@NodeEntity("Decade")
public class DecadeEntity {
 
    @Id @GeneratedValue
    private Long id;
 
    @Index(unique = true)
    private long value;
 
    public DecadeEntity(long value) {
        this.value = value;
    }
}
 
@NodeEntity(label = "Year")
public class YearEntity {
 
    @Id @GeneratedValue
    private Long id;
 
    @Index(unique = true)
    private long value;
 
    @Relationship("PART_OF")
    private DecadeEntity decade;
 
    YearEntity(final DecadeEntity decade, final long value) {
        this.decade = decade;
        this.value = value;
    }
}

I did model both of them to use them further on in aggregates, for example in the relationship RELEASED_IN, but I don’t see providing dedicated repositories for them. They only have a meaning in connection with other nodes. Things to notice here are: I followed a naming convention: All classes that are mapped to something inside the graph ends in “Entity”. Thus I have to use label attribute of @NodeEntity (or the default attribute) to specify a “nice” label, i.e. @NodeEntity(label = "Year"). I use @Index for completeness. One can configure Neo4j-OGM to automatically create indexes, but TBH, I prefer to create them by hand. There’s also one outgoing relationship from year to decade. A year is part of a decade: @Relationship("PART_OF").

You also notice that I didn’t model any of the other outgoing relationships from the year: Like all the albums released in this year, all the months with play counts in that year or the foundation years of band.

While all of Neo4j, Neo4j-OGM and Spring Data Neo4j could map relationships many levels deep, I don’t think it’s wise from an application performance point of view. I’d rather explicitly select the stuff I need.

A common base class for aggregates

I’m known for my dislike of having common base entities:

For this project however, I included one for several reasons: To not pester every other entity with the technical id, to audit interesting entities and plain simple, to have an example that actually uses our (Spring Data Neo4j) support of Spring Data’s auditing in inheritance scenarios. This is what the class looks like:

public abstract class AbstractAuditableBaseEntity {
    @Id @GeneratedValue // 1
    private Long id;
 
    @CreatedDate // 2
    @Convert(NoOpLocalDateTimeConversion.class) // 3
    private LocalDateTime createdAt;
 
    @LastModifiedDate // 2
    @Convert(NoOpLocalDateTimeConversion.class)
    private LocalDateTime updatedAt;
}
  1. Maps Neo4j’s technical (internal) id to an attribute
  2. Audits the creation respectively modification date of an entity
  3. Due to some issues with the transport, we have to force Neo4j-OGM here to not convert anything. Neo4j 3.4 supports Java 8 time types natively

Simple aggregates or entities

I see something like the Genre as a simple entity:

@NodeEntity("Genre")
public class GenreEntity extends AbstractAuditableBaseEntity {
    @Index(unique = true)
    private String name;
 
    public GenreEntity(String name) {
        this.name = name;
    }
}

It is useful to maintain on its own, has an identity but no mapped relationships. I usually declare a repository for such an entity:

import org.springframework.data.neo4j.repository.Neo4jRepository;
 
public interface GenreRepository extends Neo4jRepository<GenreEntity, Long> {
}

While it’s often preferable to extend such a repository interface not from the concrete store, I do think it’s better having the concrete store at hand in the case of Neo4j. While the concrete implementation brings a lot of CRUD method one doesn’t need all the time, it also brings in overloaded versions of them that take the depth into account as well.

To mitigate the large surface of repository methods, it’s often a good idea to reduce the repositories visibility to a minimum.

I don’t see a problem using such an entity and repository directly from a controller, for example like this:

@Controller
@RequestMapping("/genres")
public class GenreController {
    private final GenreRepository genreRepository;
 
    public GenreController(GenreRepository genreRepository) {
        this.genreRepository = genreRepository;
    }
 
    @GetMapping(value = { "", "/" }, produces = MediaType.TEXT_HTML_VALUE)
    public ModelAndView genres() {
 
        var genres = this.genreRepository.findAll(Sort.by("name").ascending());
        return new ModelAndView("genres", Map.of("genres", genres));
    }
 
    @GetMapping(value = "/{genreId}", produces = MediaType.TEXT_HTML_VALUE)
    public ModelAndView genre(@PathVariable final Long genreId) {
 
        var genre = this.genreRepository.findById(genreId)
            .orElseThrow(() -> new NodeNotFoundException(GenreEntity.class, genreId));
 
        var model = Map.of(
            "genreCmd", new GenreCmd(genre),
            "genre", genre
        );
        return new ModelAndView("genre", model);
    }
 
    @PostMapping(value = { "", "/" }, produces = MediaType.TEXT_HTML_VALUE)
    public String genre(@Valid final GenreCmd genreForm, final BindingResult genreBindingResult) {
        if (genreBindingResult.hasErrors()) {
            return "genre";
        }
 
        var genre = Optional.ofNullable(genreForm.getId())
            .flatMap(genreRepository::findById)
            .map(existingGenre -> {
                existingGenre.setName(genreForm.getName());
                return existingGenre;
            }).orElseGet(() -> new GenreEntity(genreForm.getName()));
        genre = this.genreRepository.save(genre);
 
        return String.format("redirect:/genres/%d", genre.getId());
    }
 
    static class GenreCmd {
 
        private Long id;
 
        @NotBlank
        private String name;
 
        GenreCmd(GenreEntity genreEntity) {
            this.id = genreEntity.getId();
            this.name = genreEntity.getName();
        }
 
        public GenreCmd() {
        }
 
        public Long getId() {
            return this.id;
        }
 
        public String getName() {
            return this.name;
        }
 
        public void setId(Long id) {
            this.id = id;
        }
 
        public void setName(String name) {
            this.name = name;
        }
    }
}

Find the sources here: GenreEntity, GenreRepository and the GenreController.

One use case is definitely “give me all the albums having a specific main genre.” To implement such a case, I rather access that sub collection from the owning site of the relationship. Here, the albums:

@NodeEntity("Album")
public class AlbumEntity extends AbstractAuditableBaseEntity {
 
    @Relationship("RELEASED_BY")
    private ArtistEntity artist;
 
    private String name;
 
    @Relationship("RELEASED_IN")
    private YearEntity releasedIn;
 
    @Relationship("HAS")
    private GenreEntity genre;
 
    private boolean live = false;
 
    public AlbumEntity(ArtistEntity artist, String name, YearEntity releasedIn) {
        this.artist = artist;
        this.name = name;
        this.releasedIn = releasedIn;
    }
}

The repository for that entity declares a method like this:

interface AlbumRepository extends Neo4jRepository<AlbumEntity, Long> {
    List<AlbumEntity> findAllByGenreNameOrderByName(String name, @Depth int depth);
}

It makes use of Spring Data Neo4j’s traversal feature: It traverses the relationship from album to genre and asks for all albums having a relationship of type “HAS” to a genre with the given name.

That’s also a method that I’d wrap in a service, as the method name is super technical and I don’t want that:

@Service
public class AlbumService {
    private final AlbumRepository albumRepository;
 
    public List<AlbumEntity> findAllAlbumsWithGenre(GenreEntity genre) {
        return albumRepository.findAllByGenreNameOrderByName(genre.getName(), 1);
    }
}

Complex aggregates

The Artist is a complex thing. It exists in three different forms: An unspecified artist, solo artist and as bands.

While Neo4j-OGM allows you to add a list of labels to your domain and thus allowing one entity to be mapped to several labels, I don’t like that approach.

Bands and solo artists have quite different attributes, as you can see in the sources linked above and I don’t want them to mix up.

By declaring the Artist class with @NodeEntity("Artist") and the band, which extends from it, with @NodeEntity("Band") and solo artist accordingly, bands and solo artists are stored with this two labels. Polymorphic queries works to some extend with a repository for the base entity, but as Neo4j-OGM applies schema based loading, stuff can be missing from the result.

While polymorphic queries are not a daily use case, this one is:

A band has one or more members:

@RelationshipEntity("HAS_MEMBER")
public static class Member {
    @Id @GeneratedValue
    private Long memberId;
 
    @StartNode
    private BandEntity band;
 
    @EndNode
    private SoloArtistEntity artist;
 
    @Convert(YearConverter.class)
    private Year joinedIn;
 
    @Convert(YearConverter.class)
    private Year leftIn;
 
    Member(final BandEntity band, final SoloArtistEntity artist, final Year joinedIn, final Year leftIn) {
        this.band = band;
        this.artist = artist;
        this.joinedIn = joinedIn;
        this.leftIn = leftIn;
 
    }
}

And in BandEntity:

@NodeEntity("Band")
public class BandEntity extends ArtistEntity {
 
    @Relationship("FOUNDED_IN")
    private CountryEntity foundedIn;
 
    @Relationship("ACTIVE_SINCE")
    private YearEntity activeSince;
 
    @Relationship("HAS_MEMBER")
    private List<Member> member = new ArrayList<>();
 
    public BandEntity(String name, @Nullable String wikidataEntityId, @Nullable CountryEntity foundedIn) {
        super(name, wikidataEntityId);
        this.foundedIn = foundedIn;
    }
 
    BandEntity addMember(final SoloArtistEntity soloArtist, final Year joinedIn, final Year leftIn) {
        Optional<Member> existingMember = this.member.stream()
            .filter(m -> m.getArtist().equals(soloArtist) && m.getJoinedIn().equals(joinedIn)).findFirst();
        existingMember.ifPresentOrElse(m -> m.setLeftIn(leftIn), () -> {
            this.member.add(new Member(this, soloArtist, joinedIn, leftIn));
        });
 
        return this;
    }
 
    public List<Member> getMember() {
 
        return Collections.unmodifiableList(this.member);
    }
}

As you see: No setters for the member and a getter that returns an unmodifiable list. Thus adding (and removing) members goes only through the band. The modified band is then returned. As Neo4j-OGM and Spring Data Neo4j don’t do dirty tracking and don’t save things automatically at the end of a transaction, we have to take care here. Again, I recommend a service layer:

@Service
public class ArtistService {
        @Transactional
        public BandEntity addMember(final BandEntity band, final SoloArtistEntity newMember,  final Year joinedIn, @Nullable final Year leftIn) {
                return this.bandRepository.save(band.addMember(newMember, joinedIn, leftIn));
        }
}

This is again from ArtistService.

To close this up, one final example: When entities are being deleted through Neo4j-OGM, it deletes only relationships, not the target nodes of the relationships. You have to decide wether you want “dangling” nodes in your database or not. Sometimes this is ok, sometimes not. As of today, Neo4j itself has no foreign key constraint on the relationship. And how so? It’s complete ok that a node exists for its own.

In my domain here however, albums without an artist and tracks without albums serve no purpose. To delete them when I delete an artist, I do this again through a service. The session in the following snippet is the autowired OGM session. It’s completely ok to access it. Spring Data Neo4j takes care that it participates in ongoing transactions:

@Service
public class ArtistService {
    private final Session session;
 
    @Transactional
    public void deleteArtist(Long id) {
        this.findArtistById(id).ifPresent(a -> {
                session.delete(a);
                session.query("MATCH (a:Album) WHERE size((a)-[:RELEASED_BY]->(:Artist))=0 DETACH DELETE a", Map.of());
                session.query("MATCH (t:Track) WHERE size((:Album)-[:CONTAINS]->(t))=0 DETACH DELETE t", Map.of());
                session.query("MATCH (w:WikipediaArticle) WHERE size((:Artist)-[:HAS_LINK_TO]->(w))=0 DETACH DELETE w", Map.of());
        });
    }
}

Yes, there is Cypher hidden away in a class. Sometimes there are compromises to be taken, and this one is a comprise that’s ok for me. There’s also JCypher, maybe that would be something to try out in the future.

With all the things here in this post, it’s easy to write a nice application, that deals not only with CRUD, but already presents all the interesting associations:





The complete application is available on GitHub as “bootiful music”. It has some rough edges, also ops wise, but the repository along with the posts of this series should help to get you started.

I’d like to thank Michael a lot for the idea of this query, which results in nice micro genres or categories:



| Comments (1) »

02-Nov-18


No silver bullets here: Accessing data stored in Neo4j on the JVM

In the previous post I presented various ways how to get data into Neo4j. Now that you have a lot of connected data and it’s attributes, how to access, manipulate, add to them and delete them?

I’m working with and in the Spring ecosystem quite a while now and for me the straight answer is – without much surprises – just use the Spring Data Neo4j module if you work inside the Spring ecosystem. But to surprise of some, there’s more than just Spring there outside.

In this blog post I walk you through

  • Using the Neoj4 Java-Driver directly
  • Creating an application based on Micronaut, which went 1.0 GA these days, the Neo4j Java-Driver and Neo4-OGM
  • A full blown Spring Boot application using Spring Data Neo4j

Before we jump right into some of the options you as an application developer have to access data inside Neo4j, we have to get a clear idea of some of the building blocks and moving parts involved. Let’s get started with those.

Building blocks and moving parts

Neo4j Java-Driver

The most important building block for access Neo4j on the JVM is possibly the Neo4j Java Driver. The Java driver is open source and is available on Github under the Apache License. This driver uses the binary “Bolt” protocol.

You can think of that driver as analogue to a JDBC driver that available for a relational database. Neo4j also offers drivers for different languages based on the Bolt protocol.

As with Java’s JDBC driver, there’s a bit of ceremony involved when working with this driver. First you have to acquire a driver instance and then open a session from which you can query the database:

try (
    Driver driver = GraphDatabase.driver( uri, AuthTokens.basic( user, password ) );
    Session session = driver.session()
) {
    List<String> artistNames =
        session
            .readTransaction(tx -> tx.run("MATCH (a:Artist) RETURN a", Map.of()))
            .list(record -> record.get("a").get("name").asString());
}

With that code, one connects against the database and retrieves the names of all artists, I imported in my previous post. What I omitted here is the fact that the driver does connection pooling and one should not open and close it immediately. Instead, you would have to write some boiler plate code to handle this.

There are some important things to notice here: The code speaks of a driver. That is org.neo4j.driver.v1.Driver. The session is also from the same package: org.neo4j.driver.v1.Session. Those both are types from the driver itself. You have to know this things, because those terms will pop up later again. Neo4j-OGM, the object graph mapper, also speaks about drivers and session, but those are completely different things.

The Java driver has a nice type system (see The Cypher type system) and gets you quite far.

Most of the time however, people in the Java ecosystem prefer nominal typing over structural typing and want to map “all the things database” to objects of some kind. Let’s not get into bikeshedding here but just accept things as they are. Given a database model where a musical artist has multiple links to different wikipedia sites, represented like this (I omitted getter and setter for clarity):

public class WikipediaArticleEntity implements Comparable<WikipediaArticleEntity> {
 
    private Long id;
 
    private String site;
 
    private String title;
 
    private String url;
 
    public WikipediaArticleEntity(String site, String title, String url) {
        this.site = site;
        this.title = title;
        this.url = url;
    }
}
 
public class ArtistEntity {
 
    private String name;
 
    private String wikidataEntityId;
 
    private Set<WikipediaArticleEntity> wikipediaArticles = new TreeSet<>();
 
    public ArtistEntity(String name, String wikidataEntityId, Set<WikipediaArticleEntity> wikipediaArticles) {
        this.name = name;
        this.wikidataEntityId = wikidataEntityId;
        this.wikipediaArticles = wikipediaArticles;
    }
}

To fill such a model directly by interacting purely with the driver, you’ll have to do something like this: A driver session get’s opened, than we write a query in Neo4j’s declarative graph query language called Cypher, execute and map all the returned records and nodes:

public List<ArtistEntity> findByName(String name) {
    try (Session s = driver.session()) {
        String statement
            = " MATCH (a:Artist) "
            + " WHERE a.name contains $name "
            + " WITH a "
            + " OPTIONAL MATCH (a) - [:HAS_LINK_TO] -> (w:WikipediaArticle)"
            + " RETURN a, collect(w) as wikipediaArticles";
 
        return s.readTransaction(tx -> tx.run(statement, Collections.singletonMap("name", name)))
            .list(record -> {
                final Value artistNode = record.get("a");
                final List<WikipediaArticleEntity> wikipediaArticles = record.get("wikipediaArticles")
                    .asList(node -> new WikipediaArticleEntity(
                        node.get("site").asString(), node.get("title").asString(), node.get("url").asString()));
 
                return new ArtistEntity(
                    artistNode.get("name").asString(),
                    artistNode.get("wikidataEntityId").asString(),
                    new HashSet<>(wikipediaArticles)
                );
            });
    }
}

(This code is part of my example how to interact with Neo4j from a Micronaut application, find its source here and the whole application here.)

While this works, it’s quite an effort: For a simple thing (one root aggregate, the artist, with some attributes), a query that is not that simple anymore and a lot of manual mapping. The query makes good use of a standardized multiset (the collect-statement), to avoid having n+1 queries or deduplication of things on the client site, but all this mapping is kinda annoying for a simple READ operation.

Enter

Neo4j-OGM

Neo4j-OGM stands for Object-Graph-Mapper. It’s on the same level of abstraction as JPA/Hibernate are for relational databases. There’s extensive documentation: Neo4j-OGM – An Object Graph Mapping Library for Neo4j. An OGM maps nodes and relationships in the graph to objects and references in a domain model. Object instances are mapped to nodes while object references are mapped using relationships, or serialized to properties. JVM primitives are mapped to node or relationship properties.

Given the example from above, we only have to add a handful of simple annotations to make our domain usable with Neo4j-OGM:

@NodeEntity("WikipediaArticle")
public class WikipediaArticleEntity implements Comparable<WikipediaArticleEntity> {
 
    @Id
    @GeneratedValue
    private Long id;
 
    private String site;
 
    private String title;
 
    private String url;
 
    WikipediaArticleEntity() {
    }
 
    public WikipediaArticleEntity(String site, String title, String url) {
        this.site = site;
        this.title = title;
        this.url = url;
    }
}
 
 
@NodeEntity("Artist")
public class ArtistEntity {
 
    @Id
    @GeneratedValue
    private Long id;
 
    private String name;
 
    private String wikidataEntityId;
 
    @Relationship("HAS_LINK_TO")
    private Set<WikipediaArticleEntity> wikipediaArticles = new TreeSet<>();
 
    ArtistEntity() {
    }
 
    public ArtistEntity(String name, String wikidataEntityId, Set<WikipediaArticleEntity> wikipediaArticles) {
        this.name = name;
        this.wikidataEntityId = wikidataEntityId;
        this.wikipediaArticles = wikipediaArticles;
    }
}

Notice @NodeEntity on the classes, @Relationship on the attribute wikipediaArticles of the ArtistEntity-class and some technical details, mainly @Id @GeneratedValue, needed to map Neo4j's internal, technical ids to instances of the classes and vice-versa.

@NodeEntity and @Relationship are used not only to mark the classes and attributes as something to store in the graph, but also to specify labels to be used for the nodes and names for the relationship.

The whole query than folds together into something like this:

public Iterable<ArtistEntity> findByName(String name) {
    return this.session
        .loadAll(ArtistEntity.class, new Filter("name", ComparisonOperator.CONTAINING, name), 1);
}

Quite a different, right? Dealing with the driver, the driver's session and Cypher has been abstracted away. Take note that the above session attribute is not a Driver's session, but OGM's session. This is a bit confusing when you start using those things.

Again, this code is part of my example how to interact with Neo4j from a Micronaut application. The complete source of the above is here and the whole application here.

To be fair, Neo4j-OGM needs to be configured as well. This is done in it's simplest form with a drivers instance and a list of packages that contains domain entities as described above, for example like this:

public SessionFactory createSessionFactory(Driver driver) {
    return new org.neo4j.ogm.session.SessionFactory(
        new BoltDriver(driver), "ac.simons.music.micronaut.domain");
}

The driver instance in the example above is instantiated by Micronaut. With Micronaut's configuration support, it would have been manually configured as in the very first example.

In a Spring Boot application, Spring Boot takes care of the driver and Spring Data Neo4j creates the OGM session and deals with transactions, among other things:

Spring Data Neo4j

Let's start with quoting Spring Data:

Spring Data’s mission is to provide a familiar and consistent, Spring-based programming model for data access while still retaining the special traits of the underlying data store. It makes it easy to use data access technologies, relational and non-relational databases, map-reduce frameworks, and cloud-based data services.

That goes so far, that Craig Walls is fairly correct when he says, that many stores "are mostly the same from a Spring Data perspective":

Spring Data Neo4j has some specialities, but on a superficial level, the above statement is correct.

Spring Data depends on the Spring Framework and given that, it's kinda hard to get it to work in environments other than Spring. If you're however using Spring Framework already, I wouldn't think twice to add Spring Data to the mix, regardless whether I have to deal with a relational database or Neo4j.

Given the entity ArtistEntity above, one can just declare a repository as this:

interface ArtistRepository extends Neo4jRepository<ArtistEntity, Long> {
    List<ArtistEntity> findByNameContaining(String name);
}

There is no need to add an implementation for that interface, this is done by Spring Data. Spring Data also wires up a Neo4j-OGM session that is aware of Spring transactions.

From an application developers point you don't have to deal with mapping, opening and closing sessions and transactions any longer, but only with one single "repository" as abstraction over a set of given entities.

Please be aware that the idea behind Spring Data and its repository concept is not having a repository for each entity there is, but only for the root aggregates. To quote Jens Schauder: "Repositories persist and load aggregates. An aggregate is a cluster of objects that form a unit, which should always be consistent. Also, it should always get persisted (and loaded) together. It has a single object, called the aggregate root, which is the only thing allowed to touch or reference the internals of the aggregate." (see Spring Data JDBC, References, and Aggregates).

In my "music" example, I deal with albums released in a given year. The release year is an integral part of the album and it would be weird having an additional repository for it.

So what are the specialities of Spring Data Neo4j? First of all, in the pure Neo4-OGM example you might have noticed the single, lone "1". That specifies the fetch depth in which entities should be loaded. Depending on how entities are modeled, you could ran in the problem, that you fetch your whole graph with hone single query. Specifying the depth means specifying how deep relationships should be fetch. The repository method can be declared analog:

interface ArtistRepository extends Neo4jRepository<ArtistEntity, Long> {
    List<ArtistEntity> findByNameContaining(String name, @Depth int depth);
}

People familiar with Spring Data know that derive query method like the findByNameContaining can be much more complicated. You could even write down

interface ArtistRepository extends Neo4jRepository<ArtistEntity, Long> {
    List<ArtistEntity> findByNameContainingOrWikipediaArticlesTitleIs(String name, String title, @Depth int depth);
}

and so on. I have seen some interesting finder methods here and there. While this is technically possible, I would recommend using the @Query annotation on the method name, write down the query myself and chose a method name that corresponds to the business.

Different abstraction levels

At this point it should be clear, that Neo4j Java-Driver, Neo4j-OGM and Spring Data act on different abstraction levels:



In your application, you have to decide which level of abstraction you need. You can come along way with direct interaction with the driver, especially for all kind of queries that facilitates your database for more than simple CRUD operations. However, I don't think that you want to deal with all the cruft of CRUD yourself throughout your application.

When to use what?

All three abstractions can execute all kind of Cypher queries. If you want to deal with result sets and records yourself and don't mind mapping stuff as you go along, use the Java driver. It has the least overhead. Not mapping stuff to fixed objects has the advantage that you can freely traverse relationships in your queries and use the results as needed.

As soon as you want to map nodes with the same labels and their relationship to other nodes more often than not, you should consider Neo4j-OGM. It takes away the "boring" mapping code from you and helps you to concentrate on your domain. Also, Neo4j-OGM is not tied to Spring. I didn't write application outside the Spring ecosystem for quite a while now. For this post, I needed an example where I don't have Spring, so I came up with the Micronaut demo, that uses both plain Java-Driver access and OGM access. Depending on what you want to achieve, you can combine both approaches: Mapping the boring stuff with Neo4j-OGM, handling "special" results yourself.

If you're writing an application in the Spring-Eco-System and decided for OGM, please also add Spring Data Neo4j to the mix. While it doesn't put any further abstraction layer on the mapping itself and thus is not slowing things down, it takes away the burden dealing with the session and transaction from you.

I do firmly believe that Spring Data Neo4j is the most flexible solution.

  1. Start with a simple repository, relying on the CRUD methods
  2. If necessary, declare your queries with @Query
  3. To differentiate between write and read models, execute writes through mapped @NodeEntities and reads through read-only @QueryResults
  4. Write a custom repository extension and interact directly with the Neo4j-OGM or Neo4j Java-Driver session
  5. d

To complete this post, I'll show you option 2 and 3. Given my AlbumEntity, TrackEntity and a AlbumRepository.

First of all I want a query that retrieves all the albums containing one specific track. That is pretty easy to write in Cypher:

interface AlbumRepository extends Neo4jRepository<AlbumEntity, Long> {
    @Query(value
        = " MATCH (album:Album) - [:CONTAINS] -> (track:Track)"
        + " MATCH p=(album) - [*1] - ()"
        + " WHERE id(track) = $trackId"
        + "   AND ALL(relationship IN relationships(p) WHERE type(relationship) <> 'CONTAINS')"
        + " RETURN p"
    )
    List<AlbumEntity> findAllByTrack(Long trackId);
}

By declaring this additional method on the repository, I know have mapped a simple Cypher query that does complex thinks (Here match all albums that contain a specific track and all the relationships of that album and return that all apart from the other tracks) to my entity. I benefit from SDNs mapping and have all the queries in one place.

In my domain, I didn't model the track as part of the album. Those tracks should be explicitly read and not all the time. I therefore added an additional class, called AlbumTrack. Again, accessors omitted for brevity:

@QueryResult
public class AlbumTrack {
 
	private Long id;
 
	private String name;
 
	private Long discNumber;
 
	private Long trackNumber;
}

Notice the @QueryResult annotation. This is special to Spring Data Neo4j. It marks this as a class that is instantiated from arbitrary query result but doesn't have a lifecycle. It then can be used as in a declarative query method, similar to the first one:

interface AlbumRepository extends Neo4jRepository<AlbumEntity, Long> {
    @Query(value
        = " MATCH (album:Album) - [c:CONTAINS] -> (track:Track) WHERE id(album) = $albumId"
        + " RETURN id(track) AS id, track.name AS name, c.discNumber AS discNumber, c.trackNumber AS trackNumber"
        + " ORDER BY c.discNumber ASC, c.trackNumber ASC"
    )
    List<AlbumTrack> findAllAlbumTracks(Long albumId);
}

while this query is indeed much simpler as the first one, it's important to be able to do such things for designing an application that performs well. Think about it: Is it really necessary to have all the relations to all other possible nodes at hands all the time?

In the end, you might have guess it: There are no silver bullets. There are situations where an approach close to the database is more appropriate than another, sometimes a higher abstraction level is better. Whatever you chose, try not be to dogmatic.

All the examples are part of my bootiful music project, more specifically, the "knowledge" submodule. With the building blocks described here, you can develop an web application that is used for reading and writing data.

The example application uses a simple, server side rendered approach for the frontend, but Spring Data Neo4j plays well with Spring Data Rest and that makes many different approaches possible.

In the next installment of this series, we have a look at the concrete domain modeling with Spring Data Neo4j.

| Comments (5) »

29-Oct-18


How to get data into Neo4j?

This is the second post in the series of From relational databases to databases with relations. Check the link to find all the other entries, too. The source code for everything, including the relational dataset is on GitHub: https://github.com/michael-simons/bootiful-music.
The post has also been featured in This Week in Neo4j.

To feel comfortable in a talk, I need to have a domain in front of me I’m familiar with. Neo4j just announced the Graph Gallery which is super nice to explore Graphs, but I wanted to have something of my own and I keep continue using the idea of tracking my musical habits. In the end I want to have knowledge base in addition to my chart applications (both linked above).

There are several ways to get data into your Neo4j database. One is a simple Cypher statement like this

CREATE (artist:Artist {name: 'Queen'}) RETURN artist

which creates a node with the label Artist and a property name for one of my favorite bands of all time. I can use the MERGE statement, which ensures the existence of a whole pattern (note only a node), too. More on that later, though. In any case, that would be a real effort todo manually. Therefor, let’s have a look how to import data. I found several options:

I find it quite fascinating in how many ways CSV data can be processed from within Neo4j itself. It reads through CSV and provides each row in a way that you can interact with from a Cypher statement like it comes from the graph itself. Those CSV files can be put into a dedicated import folder in the database instance but can also be retrieved from a URL. You can come a long way with that import if you already have CSV or a willing to massage your data a bit so that it fits a plain structure. Rik did this with a beer related data, check it out here: Fun with Beer – and Graphs in Part 1: Getting Beer into Neo4j.

JSON is an option if you want to work agains the many nice APIs out there, but my data sits in a relational database. It looks like this:



This tables store all tracks I have listened to, their artists and genres and in a time series table all plays. If you read the previous post closely, you’ll notice that this database, dubbed statsdb is only in NF2. There is no separate relation for the albums of an artist. They are stored with the tracks. I actually forgot why I modeled it that way. I vaguely remember that I was enjoyed that album names are not necessarily unique and I could find a good business primary key. Anyway, the schema is still running and gives me each month something like that which btw takes only three SQL queries:



In my property graph model I wanted to have separate albums nodes, though, so I had to massage and aggregate my data a bit before creating nodes. As I didn’t want to do this based on CSV, I looked at all the options JDBC. First, Neo4j ETL Tool:

Neo4j ETL Tool

The Neo4j ETL Tool can be downloaded through Neo4j Desktop and needs a free activation key. The link guides you through all the steps necessary to connect the tool to both the Graph database as well as the relational database. The nice thing here: You don’t have to install a driver for popular databases like MySQL or PostgreSQL as it comes with our tool.

I ran the tool agains my database and stopped here:



The ETL Tool recognized my tables and also the foreign keys and proposed a node and relationship structure. This is nice, but doesn’t fit exactly what I want. The relationships can easily be renamed and so can the node labels, but I don’t want the structure. I could just ran this and than transform everything via Cypher again, but that feels weird.

What I want is something like this (plus the play counts, for sure):

MATCH path1=(album:Album {name: 'Jazz ist anders'}) - [:RELEASED_BY] -> (:Artist)
MATCH path2=(album) - [:RELEASED_IN] -> (:Year) - [:PART_OF] -> (:Decade)
MATCH (album) - [:CONTAINS] -> (tracks:Track) 
RETURN path1, path2, tracks


Getting all the artists is simple. It’s basically a “select * from artists” and the corresponding Cypher. This is what Apoc does:

Importing data with JDBC into Neo4j

APOC is not only the name of a technician in the famous Movie “The Matrix”, but also an acronym for “Awesome Procedures on Cypher”. Neo4j is extensible via custom, stored procedures and functions much like PostgreSQL and Oracle. I have been a friend of those for years, I even read and generated Microsoft Excel files from within an Oracle Database. Now I realize, that Michael Hunger does the same for Neo4j 🙂

Anyway. APOC offers “Load JDBC” and Michael has a nice introduction at hand.

To use APOC you have to find the plugin folder of your Neo4j installation. Download the APOC release matching your database version from the above GitHub link and add it to the plugin folder. APOC doesn’t distribute the JDBC drivers itself, those have to be installed as well. As my stats db is PostgreSQL, I grabbed the PostgreSQL JDBC Driver and put it into the plugin folder as well.

Things are super easy from there on. With the following Cypher statement, the Neo4j instances connects agains the PostgreSQL instance, executes a select * from artists and returns each row. YIELD is a subclause to CALL and selects all nodes, relationships or properties returned from the procedure for further processing as bound variables in a Cypher query. I then used them to merge all artists. In case they already existed, I updated some auditing attributes. The associations between some artists have already been there, so merge didn’t remove that:

WITH "jdbc:postgresql://localhost:5432/bootiful-music?currentSchema=dev&user=statsdb-dev&password=dev" as url
CALL apoc.load.jdbc(url,"artists") YIELD row
MERGE (artist:Artist {name: row.name})
   ON CREATE SET artist.createdAt = localdatetime()
   ON MATCH SET artist.updatedAt = localdatetime()
RETURN artist


The merge-clause is really fascinating. It matches a pattern and the merge is applied to the full pattern, meaning it must exist as a whole. In the case above, the pattern is simple, it’s just a single node. If you want to create or update an album released by an artist and use something like this MERGE (album:Album {name: $someAlbumName}) - [:RELEASED_BY] -> (artist:Artist {name: $someNewArtistName}), the statement will always create a new artist node, regardless wether the artist already exists or not as the whole pattern is checked. You have to do those things in separate steps like this:

MERGE (artist:Artist {name: $someNewArtistName})
MERGE (album:Album {name: $someAlbumName}) - [:RELEASED_BY] -> (artist)

First, merge the artist. Then, merge the album and the relationship to the existing artist node. For me, coming from with SQL background, this feels unusual at first, much like I can use something in an imperative AND declarative way.

Now, the above example is still not what I want. apoc.load.jdbc basically gives me the whole table and I have to do every post-processing in afterwards. Luckily, that second parameter can also be a full SQL statement. Now, to create a graph structure like above (minus the tracks though), I did something like this. Behold, all the SQL and Cypher in one statement:

WITH "jdbc:postgresql://localhost:5432/bootiful-music?currentSchema=dev&user=statsdb-dev&password=dev" as url,
     "SELECT DISTINCT a.name as artist_name, t.album, g.name as genre_name, t.year
      FROM tracks t JOIN artists a ON a.id = t.artist_id JOIN genres g ON g.id = t.genre_id
      WHERE t.compilation = 'f'" as sql
CALL apoc.load.jdbc(url,sql) YIELD row
MERGE (decade:Decade {value: row.year-row.year%10})
MERGE (year:Year {value: row.year})
MERGE (year) - [:PART_OF] -> (decade)
MERGE (artist:Artist {name: row.artist_name})
   ON CREATE SET artist.createdAt = localdatetime()
   ON MATCH SET artist.updatedAt = localdatetime()	 
MERGE (album:Album {name: row.album}) - [:RELEASED_BY] -> (artist)
   ON CREATE SET album.createdAt = localdatetime()
   ON MATCH SET album.updatedAt = localdatetime()
MERGE (genre:Genre {name: row.genre_name})
   ON CREATE SET genre.createdAt = localdatetime()
   ON MATCH SET genre.updatedAt = localdatetime()
MERGE (album) - [:HAS] -> (genre)
MERGE (album) - [:RELEASED_IN] -> (year)

Read: Bind the JDBC url and a SQL statement to variables. The sql selects a distinct set of all artists and album, including genre and year, thus denormalizing my relational schema by joining the artists and genres. The resulting tuples are then used to create decades and years, merge artists, genres and albums and finally relating them to each other. I was deeply impressed after I ran that.

First: It’s amazing how well Neo4j integrates with other stuff and secondly, it’s actually pretty readable I think.

In my talk, I will dig deeper in the Cypher used. Also, the statements for generating the track data in the above Graph are bit more complex (you find them here). But in the end, I could have stopped here. And probably, I would have done so, if I had started with APOC in the beginning.

But me being the mad scientist I am, started somewhere else:

Writing your own tool

I did this right after my first look at the ETL tool. Neo4j can be extended with procedures and functions. Procedures stream a result with several attributes, functions return single values. When I joined Neo4j, my dear colleague Gerrit impressed me with knowledge about that stuff and I wanted to keep up with his pace, so I saw a good learning opportunity here.

One does not have to extend from a dedicated class or have to implement an interface for writing a Neo4j stored procedure. It’s enough to annotate the methods that should be exposed with either @Procedure or @UserFunction. It’s necessary that org.neo4j:neo4j is a provided dependency on the class path.

Neo4j provides detailed information how to create a package that just can be dropped into Neo4js plugin folder, find the instructions here. They worked as promised.

What I did was to apply one of my favorite libraries again, run jOOQ against my stats database, generating a type safe DSL dedicated to my domain, packaged it up and than used it in a Neo4j stored procedure. In the following listing you’ll find that I use Neo4js @Context to get hold of the graph database service in which the procedure is running, pass in user name, password and JDBC url to my procedure. Those are used to connect against PostgresSQL (try (var connection = DriverManager.getConnection(url, userName, password);) {} (yeah, that’s Java 11 code), open a Neo4j transaction and than just executing my SQL and manually creating nodes.

public class StatsIntegration {
	@Context
	public GraphDatabaseService db;
 
	@Context
	public Log log;
 
	@Procedure(name = "stats.loadPlayCounts", mode = Mode.WRITE)
	@Description("Loads the play counts for tracks. Note: All the tracks and albums need to be there, use loadAlbumData before.")
	public void loadPlayCounts(
			@Name("userName") final String userName,
			@Name("password") final String password,
			@Name("url") final String url) {
 
		try (var connection = DriverManager.getConnection(url, userName, password);
			 var neoTransaction = db.beginTx()) {
 
			var statsDb = DSL.using(connection);
			var isNoCompilation = TRACKS.COMPILATION.eq("f");
 
			final String columnNameYear = "yearValue";
			final String columnNameMonth = "monthValue";
			final String columnNameNewPlayCount = "newPlayCount";
 
			var year = extract(PLAYS.PLAYED_ON, DatePart.YEAR).as(columnNameYear);
			var month = extract(PLAYS.PLAYED_ON, DatePart.MONTH).as(columnNameMonth);
 
			log.info("Deleting existing playcounts");
			executeQueryAndLogResults("MATCH (:Track) - [playCount:HAS_BEEN_PLAYED_IN] -> () DETACH DELETE playCount", Map.of());
 
			log.info("Loading playcounts from statsdb");
			statsDb
					.select(
							ARTISTS.NAME,
							TRACKS.NAME,
							year, month,
							count().as("newPlayCount")
					)
					.from(PLAYS)
					.join(TRACKS).onKey()
					.join(ARTISTS).onKey()
					.where(isNoCompilation)
					.groupBy(ARTISTS.NAME, TRACKS.NAME, year, month)
					.forEach(r -> {
						var parameters = Map.of(
								"artistName", r.get(ARTISTS.NAME),
								"trackName", r.get(TRACKS.NAME),
								columnNameYear, r.get(columnNameYear),
								columnNameMonth, r.get(columnNameMonth),
								columnNameNewPlayCount, r.get(columnNameNewPlayCount)
						);
 
						executeQueryAndLogResults(CREATE_PLAY_COUNTS, parameters);
					});
 
			log.info("Adding weight to artists");
			executeQueryAndLogResults(WEIGHT_ARTISTS_BY_PLAYCOUNT, Map.of());
			neoTransaction.success();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
}

The complete sources for that exercise are in the etl module of my project. As interesting as the stored procedure itself is the integration test, that needs both a Neo4j instance and PostgreSQL. For the first I use our own test harness for the later one, Testcontainers.

Conclusion

As much as the last exercise was fun and I learned a ton about extending Neo, working directly with the engine and so on, it’s not only slower (probably due to context switches and the relatively large transaction), but it’s hard to read, by magnitudes. As with a relational database, declarative languages like Cypher and SQL beat explicit programming hand down. If you ever have a need to migrate or aggregate and duplicate data from a relational database into Neo4j, have a look at apoc.load.jdbc. Select the stuff you need in one single context switch and then process as needed, either in a single transaction or also in batches and parallel.

For my music project, I’ll probably keep the ETL Tool around as I like the structure I created for the tool itself, but will rewrite the process from relational to graph based on APOC and plain SQL and Cypher statements, if it ever sees production of any kind.

Overall verdict: It took me a while to find a graph model I like and from which I am convinced I can actually draw some conclusions about the music I like, but after the graph started to resonate with me, it felt very natural and enjoyable.

Next step: Creating an Neo4j-OGM model for the different nodes.


Images in this post: Myself. Featured image – Markus Spiske.

| Comments (0) »

12-Oct-18


From relational databases to databases with relations

In the summer of 2018, I joined Neo4j. This seems odd at first, my “love” for relational databases and SQL is known. I didn’t have this slide in my jOOQ presentation for two years now without reason. But: Looking at the jOOQ and SQL talks from the perspective of early 2000s, they also seemed odd at first. Back then, I never thought doing that much with databases. That changed a lot.

My experience is that all the data your deal with usually has a much longer lifetime than any of your applications sitting on top of that. Knowing one or more database management systems is essential. Being able to query them even more: What Neo4j and relational databases have in common: A great, declarative way to tell them which data to return and not how to return them. In the case of a relational database, this is obviously SQL, which had quite the renaissance for a few years now. Neo4j’s lingua franca is Cypher.

I get to play with Cypher a lot, but in the end this is not what I am working on at Neo4j. My work is focused on our Object Graph Mapper, Neo4j-OGM and the related Spring Data Module, Spring Data Neo4j. We have written about that a bit on medium. Given my experience with Spring, Spring Data and Spring Boot, the role suddenly makes much more sense.

People who entered the IT-conference circus may know the merry-go-round (or should I say “trap”?) of “talks stress me out a lot” – “hey, this is great, just enter another CfP”. I fell for it again and proposed a talk with the above title to several conferences. Now, I have to come up with something.

Back in 2016, when I first held my talk Database centric applications with Spring Boot and jOOQ, I started to write my story down. Back then, it helped me a lot, so here we go again. Join me on my way from relational databases to databases with relations.

Content

What are we talking about here?

The domain will be music. I have been tracking my musical habits for more than 10 years now at my side project Daily Fratze and I enjoy looking back to what I listened like this months but 5 years ago.

Edgar F. Codd

The guy on the left, Edgar F. Codd invented the relational model back in the 1970s. A relation in this model doesn’t describe a relation like the one between two people or a musician associated with a band who released a new track. A relation in the relational model is a table itself. Foreign keys between tables ensure referential integrity, but cannot define relations themselves. I put down some thoughts a while back in this German deck. This is what a relation looks like in a relational database:



One kind of sport to do with relational databases is the process of normalizing data. There are several normal forms. Their goal is to keep a database redundancy free, for several reasons. Back in the 1970, disk space being one of them. First normal form (1NF): All attributes should be atomic. 2NF is 1NF plus no functional dependencies on parts of any candidate keys. That is: There must not be a pair of attributes appearing twice in a relation’s tuples. 3NF forbids transitive dependencies (“Nothing but the key, so help me Codd”) and it gets complicated from there on. I have to say though, that normalization up to 3NF is still relevant today in a relational systems, at least if you’re a friend of (strong) consistent data.

However, as my colleague Rik van Bruggen points out in his book “Learning Neo4j”, relational databases are quite anti-relational.

Why is this? Relations in NF can be queried in many, many ways. Each query send via SQL returns a new relation, what’s the problem? It depends. In a strictly analytical use case, there’s often not a problem. Recreating object hierarchies however, joining things back together, is. A handful of joins is not hard to understand, even without a tool, but self referential joins or a sheer, huge amount, is. It also gets increasingly hard on the database management system.

This is where graphs can come into play. Graphs are another mathematical concept, this time from Graph theory. Sometimes
people call a chart graph by coincident, but this is wrong. A graph is a set of objects with pairs of objects being related. In mathematical terms, those objects are vertices and the relations between them, edges. We call them nodes and relationships in the Neo4j database.

Neo4j is a Property Graph. A property graph adds labels and properties to both nodes and relationships:



The above picture is from our excellent post What is a Graph Database?.

One takeaway from that post is: Neo4j is referred to as a native graph database because it efficiently implements the property graph model down to the storage level. Or in technical terms: Neo4j employs so called index-free adjacency, which is the most efficient means of processing data in a graph because connected nodes physically point to each other in the database. This obliterates the needs for complex joins, either directly or via intersection tables. One just can tell the database to retrieve all nodes connected to another node.

This is not only super nice for simple aggregations of things, but especially for many graph algorithms.

So what has this todo with my talk? My SQL talk was all about doing analytics. That is, retrieving data like in this image from a relational database with build-in analytic functions. Computing running totals, differences from previous windows and so on (read more here).

First of all I’m gonna analyze how to create a graph structure from the very same dataset I used in the SQL talk. There are different tools out there with different approaches. I’ve chosen a technique that resonated for various reasons with me. As I want to enrich the existing dataset, I’ll model a domain around it, with Java, putting Neo4j-OGM to use. I’ll show how Spring Data Neo4j helps me not having to deal with a lot of cruft. In the end, I’ll show that I can build my own music recommendation engine based on 10 years of tracking my musical habits by applying some of the queries and algorithms possible with Neo4j.

Next step: Loading data.


Pictures in this post: Codd and Table – Wikipedia, Property Graph – Neo4j, Featured image – Sarah Cervantes.

| Comments (3) »

11-Oct-18