Hibernate Search and Spring Boot: Simple yet powerful archiving

Build an archive of Tweets with full text search in no time
September 6, 2016 by Michael

This post has been featured in the Hibernate Community Newsletter 18/2016.

Before my summer holidays I mentioned my personal twitter archive on Twitter again….

This time, Vlad from Hibernate reacted on my tweet:

More reactions came from Sanne and Emmanuel and here we go:

Content

  1. Source
  2. Background
  3. Features
  4. Tools used
  5. Application
  6. Database schema
  7. The Tweet entity
  8. Storing new entities
  9. Querying entities
  10. Conclusion
  11. Try it out yourself

Source

The whole project, which has already grown into more than a tech demo, is on github: michael-simons/tweetarchive.

What I skipped is a fancy gui. So far, it only has a REST interface. But, it can be run as a docker image with local, persistent storage. Check it out, star it, maybe even add stuff to it… Feel free!

Background

I’m running my archive for several years now, from Daily Fratze. Daily Fratze contains a home grown crawler that checks my user time line and stores my tweets in a MySQL database. I’m using JPA with Hibernate as my database access tool, so Hibernate Search fit’s nicely and is really easy to implement. Hibernate Search is a super easy way to add an Apache Lucene full text index to your entities.

For large scale applications, Elastic Search or similar maybe more fitting, but I’m really content with my “small” (at the end of last year ~50Mb) search index and it’s performance. It doesn’t add much (if any) overhead to development and on production.

For the demo, I’ve taken my entities but not the parser. For parsing in the demo I use Twitter4J. Twitter4J is apparently not made for parsing static tweets, so there are some ugly constructs for getting a Twitter archive into the app, but that should not be the point here. The entities have been adapted and refreshed according to my current skills. Some things I created years ago should never see the light of day.

Features

  • I want to be able to search my tweets. With keywords and with full blown Lucene queries
  • The application should track new tweets
  • The original JSON content should be stored as well

Tools used

In order:

Application

The application is a standard Spring Boot application. It’s 2016, you should find several real good guides out there and also on this blog how such an application is build.

I also assume that you have an idea what Apache Lucene is about.

Database schema

My migrations are inside src/main/resources/db/migration/ where Flyway automatically finds it. Flyway itself is recognized by Spring Boot if on the classpath.

I have this PostgresSQL cast

that allows me to store a string java attribute inside a JSONB column without a bunch of custom converters, without explicitly casting it but with type checks.

The table definition for tweets looks like this:

Nothing fancy here except the raw_data column, which contains the tweets original source. You can use PostgreSQLs JSON operators to query it, if you like.

The Tweet entity

You’ll find the Tweet entity here src/main/java/ac/simons/tweetarchive/tweets/TweetEntity.java. Basically, it is a standard JPA entity. I use Project Lombok to get rid of boiler plate code, so you’ll find no getters and setters.

For the following stuff, I assume you know JPA, because I’m not gonna covering that.

To make Hibernate Search aware of an entity, that should be indexed, you have to annotate the entity:

That is already all there is!

Next step: Add a simple field, for example the screen name, just annotate it with @Field:

That actually reads: Index that field, store the value with the index so that it can be searched without hitting the database but don’t to further analysis.

If you read through the entity, you’ll find several such fields.

Next: Analyzing fields. I want to search for similar words in the content of the tweet. While receiving the tweet, the application resolves URLs and stuff and replaces the short urls, see TweetStorageService.

The entity takes this one step further. The content field is annotated with:

Here the @Field annotation says: Index the content, don’t store it, but analyze it. It also says, through @AnalyzerDiscriminator, with which analyzer.

I have defined my analyzers right with the entity, but they can be defined elsewhere, on a package for example, too:

I have 3 analyzers in place: An English analyzer, wo tokenizes the input, lower cases it and then does english based word stemming. The same for German and last but not least, an analyzer that just tokenizes and filters the content.

The analyzer itself can be dynamically inferred with a discriminator, which looks like this:

Read: If the language of the tweet is available and supported, use the fitting analyzer, otherwise use the default analyzer for undefined languages.

Hibernate Search allows spatial queries. You can annotate the whole class or an attribute, that returns Coordinates:

Also nested entities are supported. My example: The information regarding a reply. I have InReplyTo as an @Embeddable class and an attribute inReplyTo

This reads: Please index the embedded class, add a prefix “reply.” to all fields and otherwise, check for @Field annotations in the embedded class.

So far: Not much!

Storing new entities

If you use Spring Boot together with Hibernate and Spring Data JPA, you have nothing to take care of except configuring the database (and you can even skip this, if you use an in memory database).

This is all the configuration it takes, to get Hibernate Search up and running with that setup, if you add org.springframework.boot:spring-boot-starter-data-jpa, org.postgresql:postgresql and org.hibernate:hibernate-search-orm to the classpath:

spring.datasource.platform = postgresql
spring.datasource.driver-class-name = org.postgresql.Driver
spring.datasource.url = jdbc:postgresql://localhost:5432/tweetArchive
spring.datasource.username = tweetArchive
spring.datasource.password = tweetArchive
 
spring.jpa.hibernate.ddl-auto = validate
 
spring.jpa.properties.hibernate.search.default.directory_provider = filesystem
spring.jpa.properties.hibernate.search.default.indexBase = ${user.dir}/var/index/default

Just go ahead and define a Repository the TweetEntity:

This is an Interface with no implementation in my application. It inherits from org.springframework.data.repository.Repository, thus providing means access entities already. I chose the simplest form of repository so that I don’t clutter my application with methods I wouldn’t need. If I instead would have inherited from CrudRepository, I wouldn’t have do define save or delete methods.

Calling the save or delete method from my tweet storage service already updates my search index.

Querying entities

But take good note that this interface inherits also from TweetRepositoryExt. This is the recommended way by Spring Data JPA to add custom behavior. This interface defines to search methods which I actually have to define. This is done in TweetRepositoryImpl and I’m gonna walk you through the search method:

First I retrieve a new FullTextEntityManager inside the declarative transaction and instantiate a query builder. The query builder exposes a nice, fluent interface to define my Lucene query. You’ll see how I add a keyword query on one specific field and also, if the user provided a date range, I add some range queries to a bracing boolean condition.

The FullTextEntityManager is then used again to instantiate a JPA query from the full text query and retrieve the result.

And that’s all there is: I can use (and hide!) the full text queries inside the same repositories I would use elsewhere.

Conclusion

If you already are using Hibernate as your ORM, have embraced Spring Data repositories and you’ll need to search some entities then Hibernate Search maybe the right approach for your project. It’s really easy to implement and also easy to use. One downside for a 12 factor app could be the fact, that the index is directory based in the default setting. You can work around it, though, by using JMS or JGroups.

I have been using Hibernate Search for quite a while now on Daily Fratze and on several other projects intern as well and for my respectively our purpose it has been enough.

Try it out yourself

There’s much more to learn in the demo application. Go to michael-simons/tweetarchive and see for yourself. There’s an extensive README, that should guide you through running the application yourself. The easiest way is to use a local Docker based instance.

If you like it, follow me on Twitter, I am @rotnroll666, leave a comment or a star.

2 comments

  1. Al Grant wrote:

    How do you decide what version of hibernate-search-orm to add to your build.gradle to be compatiable with springboot 1.5.2 ?

    Posted on July 12, 2017 at 8:08 AM | Permalink
  2. Michael wrote:

    Hi Al. I used the one that is compatible with the Hibernate Version Spring Boot used at the point of writing with this post.

    Posted on July 12, 2017 at 8:45 AM | Permalink
3 Trackbacks/Pingbacks
  1. […] Michael Simons wrote an interesting post about his personal Twitter archive. He wrote his own application which stores all his tweets in a database and uses Hibernate Search to provide full-text search functionality: Hibernate Search and Spring Boot: Simple yet powerful archiving. […]

  2. Java Web Weekly, Issue 142 | Baeldung on September 15, 2016 at 3:05 PM

    […] >> Hibernate Search and Spring Boot: Simple yet powerful archiving [info.michael-simons.eu] […]

  3. […] weeks ago, I wrote a post on how to use Hibernate Search with Spring Boot. The post got featured on the Hibernate community newsletter as well as on Thorbens blog Thoughts […]

Post a Comment

Your email is never published. We need your name and email address only for verifying a legitimate comment. For more information, a copy of your saved data or a request to delete any data under this address, please send a short notice to michael@simons.ac from the address you used to comment on this entry.
By entering and submitting a comment, wether with or without name or email address, you'll agree that all data you have entered including your IP address will be checked and stored for a limited time by Automattic Inc., 60 29th Street #343, San Francisco, CA 94110-4929, USA. only for the purpose of avoiding spam. You can deny further storage of your data by sending an email to support@wordpress.com, with subject “Deletion of Data stored by Akismet”.
Required fields are marked *