This post has been featured in the Hibernate Community Newsletter 18/2016.
Before my summer holidays I mentioned my personal twitter archive on Twitter again….
This time, Vlad from Hibernate reacted on my tweet:
@rotnroll666 If you write a blog post, I'll feature it on our Newsletter
— Vlad Mihalcea (@vlad_mihalcea) 27. Juli 2016
More reactions came from Sanne and Emmanuel and here we go:
Content
- Source
- Background
- Features
- Tools used
- Application
- Database schema
- The Tweet entity
- Storing new entities
- Querying entities
- Conclusion
- Try it out yourself
Source
The whole project, which has already grown into more than a tech demo, is on github: michael-simons/tweetarchive.
What I skipped is a fancy gui. So far, it only has a REST interface. But, it can be run as a docker image with local, persistent storage. Check it out, star it, maybe even add stuff to it… Feel free!
Background
I’m running my archive for several years now, from Daily Fratze. Daily Fratze contains a home grown crawler that checks my user time line and stores my tweets in a MySQL database. I’m using JPA with Hibernate as my database access tool, so Hibernate Search fit’s nicely and is really easy to implement. Hibernate Search is a super easy way to add an Apache Lucene full text index to your entities.
For large scale applications, Elastic Search or similar maybe more fitting, but I’m really content with my “small” (at the end of last year ~50Mb) search index and it’s performance. It doesn’t add much (if any) overhead to development and on production.
For the demo, I’ve taken my entities but not the parser. For parsing in the demo I use Twitter4J. Twitter4J is apparently not made for parsing static tweets, so there are some ugly constructs for getting a Twitter archive into the app, but that should not be the point here. The entities have been adapted and refreshed according to my current skills. Some things I created years ago should never see the light of day.
Features
- I want to be able to search my tweets. With keywords and with full blown Lucene queries
- The application should track new tweets
- The original JSON content should be stored as well
Tools used
In order:
- Spring Boot for creating the application
- PostgreSQL 9.5, supporting JSON natively inside the database
- Hibernate 5 as ORM
- Hibernate Search on top to index my entities
- Spring Data JPA for storing and accessing entities
Application
The application is a standard Spring Boot application. It’s 2016, you should find several real good guides out there and also on this blog how such an application is build.
I also assume that you have an idea what Apache Lucene is about.
Database schema
My migrations are inside src/main/resources/db/migration/
where Flyway automatically finds it. Flyway itself is recognized by Spring Boot if on the classpath.
I have this PostgresSQL cast
that allows me to store a string java attribute inside a JSONB column without a bunch of custom converters, without explicitly casting it but with type checks.
The table definition for tweets looks like this:
Nothing fancy here except the raw_data
column, which contains the tweets original source. You can use PostgreSQLs JSON operators to query it, if you like.
The Tweet entity
You’ll find the Tweet entity here src/main/java/ac/simons/tweetarchive/tweets/TweetEntity.java. Basically, it is a standard JPA entity. I use Project Lombok to get rid of boiler plate code, so you’ll find no getters and setters.
For the following stuff, I assume you know JPA, because I’m not gonna covering that.
To make Hibernate Search aware of an entity, that should be indexed, you have to annotate the entity:
That is already all there is!
Next step: Add a simple field, for example the screen name, just annotate it with @Field:
That actually reads: Index that field, store the value with the index so that it can be searched without hitting the database but don’t to further analysis.
If you read through the entity, you’ll find several such fields.
Next: Analyzing fields. I want to search for similar words in the content of the tweet. While receiving the tweet, the application resolves URLs and stuff and replaces the short urls, see TweetStorageService.
The entity takes this one step further. The content field is annotated with:
Here the @Field
annotation says: Index the content, don’t store it, but analyze it. It also says, through @AnalyzerDiscriminator
, with which analyzer.
I have defined my analyzers right with the entity, but they can be defined elsewhere, on a package for example, too:
I have 3 analyzers in place: An English analyzer, wo tokenizes the input, lower cases it and then does english based word stemming. The same for German and last but not least, an analyzer that just tokenizes and filters the content.
The analyzer itself can be dynamically inferred with a discriminator, which looks like this:
Read: If the language of the tweet is available and supported, use the fitting analyzer, otherwise use the default analyzer for undefined languages.
Hibernate Search allows spatial queries. You can annotate the whole class or an attribute, that returns Coordinates:
Also nested entities are supported. My example: The information regarding a reply. I have InReplyTo
as an @Embeddable
class and an attribute inReplyTo
This reads: Please index the embedded class, add a prefix “reply.” to all fields and otherwise, check for @Field
annotations in the embedded class.
So far: Not much!
Storing new entities
If you use Spring Boot together with Hibernate and Spring Data JPA, you have nothing to take care of except configuring the database (and you can even skip this, if you use an in memory database).
This is all the configuration it takes, to get Hibernate Search up and running with that setup, if you add org.springframework.boot:spring-boot-starter-data-jpa
, org.postgresql:postgresql
and org.hibernate:hibernate-search-orm
to the classpath:
spring.datasource.platform = postgresql spring.datasource.driver-class-name = org.postgresql.Driver spring.datasource.url = jdbc:postgresql://localhost:5432/tweetArchive spring.datasource.username = tweetArchive spring.datasource.password = tweetArchive spring.jpa.hibernate.ddl-auto = validate spring.jpa.properties.hibernate.search.default.directory_provider = filesystem spring.jpa.properties.hibernate.search.default.indexBase = ${user.dir}/var/index/default |
Just go ahead and define a Repository
the TweetEntity:
This is an Interface with no implementation in my application. It inherits from org.springframework.data.repository.Repository
, thus providing means access entities already. I chose the simplest form of repository so that I don’t clutter my application with methods I wouldn’t need. If I instead would have inherited from CrudRepository
, I wouldn’t have do define save
or delete
methods.
Calling the save or delete method from my tweet storage service already updates my search index.
Querying entities
But take good note that this interface inherits also from TweetRepositoryExt
. This is the recommended way by Spring Data JPA to add custom behavior. This interface defines to search methods which I actually have to define. This is done in TweetRepositoryImpl
and I’m gonna walk you through the search method:
First I retrieve a new FullTextEntityManager
inside the declarative transaction and instantiate a query builder. The query builder exposes a nice, fluent interface to define my Lucene query. You’ll see how I add a keyword query on one specific field and also, if the user provided a date range, I add some range queries to a bracing boolean condition.
The FullTextEntityManager is then used again to instantiate a JPA query from the full text query and retrieve the result.
And that’s all there is: I can use (and hide!) the full text queries inside the same repositories I would use elsewhere.
Conclusion
If you already are using Hibernate as your ORM, have embraced Spring Data repositories and you’ll need to search some entities then Hibernate Search maybe the right approach for your project. It’s really easy to implement and also easy to use. One downside for a 12 factor app could be the fact, that the index is directory based in the default setting. You can work around it, though, by using JMS or JGroups.
I have been using Hibernate Search for quite a while now on Daily Fratze and on several other projects intern as well and for my respectively our purpose it has been enough.
Try it out yourself
There’s much more to learn in the demo application. Go to michael-simons/tweetarchive and see for yourself. There’s an extensive README, that should guide you through running the application yourself. The easiest way is to use a local Docker based instance.
If you like it, follow me on Twitter, I am @rotnroll666, leave a comment or a star.
2 comments
How do you decide what version of hibernate-search-orm to add to your build.gradle to be compatiable with springboot 1.5.2 ?
Hi Al. I used the one that is compatible with the Hibernate Version Spring Boot used at the point of writing with this post.
3 Trackbacks/Pingbacks
[…] Michael Simons wrote an interesting post about his personal Twitter archive. He wrote his own application which stores all his tweets in a database and uses Hibernate Search to provide full-text search functionality: Hibernate Search and Spring Boot: Simple yet powerful archiving. […]
[…] >> Hibernate Search and Spring Boot: Simple yet powerful archiving [info.michael-simons.eu] […]
[…] weeks ago, I wrote a post on how to use Hibernate Search with Spring Boot. The post got featured on the Hibernate community newsletter as well as on Thorbens blog Thoughts […]
Post a Comment