Atif ullah Baig - Java Tech Blog

Sunday, 20 January 2013

Hibernate as Persistence Provider and ORM Solution - VII

Continuous from Part VI

Caching in Hibernate and its benefits

Hibernate caches persistent object for later use in the application which results in fewer database round trips and ultimately enhances the overall performance. One issue with caching is the possibility of presence of stale data in cache which can be overcome to some extent by setting a time interval for refreshing the objects residing in the cache memory. Caching can be very helpful indeed if the application data is not likely to change frequently.

How data is stored and retrieved from Cache conceptually ?

Lets discuss the internal working of the caching in hibernate framework. It is very important to note that Hibernate stores state of entities in the cache by indexing them with corresponding keys that are Serializable identifiers. Similarly in order to fetch data of an entity from the cache all hibernate needs is key. So in other words we can say that a unique identifier value that is also known as a key is associated with each entity stored in the cache.

It is important to note that the entity instance is never cached but only the values of individual properties which define the state of that entity are cached.

Suppose we have an entity named as 'Comment' which has attributes like id(which is the Serializable identifier), visitorName, description and createdDate then following table shows how it is conceptually stored in the cache.

As all Comment entities are indexed with the id field which is a Serializable identifier(and primary key), we need id value to look up any particular Comment entity from the cache.

Types of Caching available in Hibernate

Now we will look at different types of Caching available in Hibernate and how they play very significant role in improving the performance of the application. We will also look at the possible scenarios in which hibernate caching may result in poor performance.

There are three types of caching available in hibernate.

Session Cache or First Level Cache
Query Cache
Second Level Cache

Searching an entity data in Cache

Hibernate first looks up the data associated with an entity in the first level cache if it is not there then it looks for the same data in the second level cache if it is enabled. If data is not found in either of the caches then it queries the underlying database to get the data directly from the database.

Why Hibernate forces id field of an Entity to implement Serializable?

It is really important to understand why the hibernate get() and load() methods force primary key field to implement Serializable. There are mainly two reasons for this requirement as given below.

The entity data being looked up might not be present in first level cache and second level cache(if it is enabled).
The underlying database server might be running on a different machine in the network.

Suppose our database server is on another machine and we pass an identifier value in hibernate get() or load() method. If the data associated with that identifier is not found in the first level cache and second level cache (if it is enabled) then underlying database will be queried which in our case is on a remote machine. Therefore only the Serializable identifier will be required to send it over the network and get the related record data from database. If implementing Serializable was not forced then it would never fetch the data in this scenario as non-serialized id would not be passed over the network.

Session Cache or First Level Cache

Enabled by default

As the name suggests this cache is used to store objects within the current session. Since all the objects are stored in the current session by this cache therefore it is enabled by default. This default caching is provided by hibernate framework and therefore unlike second level cache we do not have to go for any third party solution.

Using Session Cache for better performance

As each session is basically associated with a corresponding database connection (which is short lived), there is very less likelihood of presence of a lot of objects in the session cache which is a good thing from the memory perspective. We should never use same session in multiple threads as it is a memory leak because it will allow more objects to reside in session cache.

It is not possible to cache the state of entities related to HQL query and its parameters only by using session cache as in that case hibernate can not associate queries and their parameters with the corresponding identifier values that are actually used to index the cached entities.

Example:

Session session = getSessionFactory().openSession();
Transaction tx = session.beginTransaction();
Post p1 = (Post) session.createQuery("FROM Post p WHERE p.id=1").uniqueResult() ;

// session cache will not store the results 
// of the above HQL query as without enabling the
// query cache
System.out.println(p1.getTitle());

Post p2 = (Post) session.createQuery("FROM Post p WHERE p.id=1").uniqueResult() ;

// As result of the above query was not cached   
// the same query requires another database round trip
System.out.println(p2.getTitle());

tx.commit();
session.close();

In the above example session cache will not store the results of the first HQL query which will cause the same query to run twice in the background. (Since Hibernate fails to associate HQL queries with valid identifier values (or keys) without enabling Query Cache).

Session session = getSessionFactory().openSession();
Transaction tx = session.beginTransaction();

// Gets Post entity with id 1 from database
Post p1 = (Post) session.get(Post.class, 1L);
 
System.out.println(p1.getTitle());

// Gets Post entity with id 1 again but this time from cache
Post p2 = (Post) session.get (Post .class, 1L); 
System.out.println(p2.getTitle());
tx.commit();
session.close();

In the above example session cache will store the data fetched with get() method. As we already have passed the key (or Serializable identifier) that is internally used for indexing the entity stored in session cache, therefore on calling load() method again with same key it will return the same entity from the session cache. In this case only one SELECT query will run in the background as data is retrieved from the session cache on the second call of load() method.

Note: The hibernate get() method returns null if no record is found while load() method throws an exception in this case. Other than that both the methods have same purpose of looking up an entity by its unique and Serializable identifier value.

Query Cache

Query Cache is very useful for storing the results of queries that run frequently with same parameters. This cache is not enabled by default. It is comprised of two cache regions namely StandardQueryCache and UpdateTimeStampsCache. We should always be very careful while using Query Cache as it is harmful to latency and scalability in many common scenarios.

Note: Query Cache is a Hibernate-specific feature and its not part of the JPA specification.

Enabling Query Cache

This caching is enabled by adding following line in the hibernate configuration file.

Along with adding the above property we also need to set the query as Cacheable i.e query.setCacheable(true).

StandardQueryCache Region

The Query Cache associates HQL query and its parameters to related identifier values (one or more). Following table shows how it happens in the background using the post-comments example (as mentioned above).

Now the question arises that if query cache stores only the identifier values for each query and its parameters then where is the state of the entities stored that represents that actual query results. This question is partially answered by the hibernate documentation which is also quoted below;

“Note that the query cache does not cache the state of the actual entities in the result set; it caches only identifier values and results of value type. The query cache should always be used in conjunction with the second-level cache.”

Does it mean that the state of entities can only be stored in second level cache?

Answer is No (which is very well described in this post). As session itself is supposed to be short lived and therefore it is very unlikely to execute the same select queries with same parameters frequently in the same session. Therefore although Query Cache can be used with session cache without having to enable the second level cache yet we should always enable the second level cache in real scenarios.

Session session = getSessionFactory().openSession();
Transaction tx = session.beginTransaction();
Query query = session.createQuery("FROM Post p WHERE p.id=1");
query.setCacheable(true);
Post p1 = (Post) query.uniqueResult();

// The results of the above HQL query  
// are cached with the help of query cache
System.out.println(p1.getTitle());

Query query = session.createQuery("FROM Post p WHERE p.id=1");
query.setCacheable(true);
Post p2 = (Post) query.uniqueResult();

// As result of the above query was already cached   
// the same query will not run again and it will 
// retirieve query result from the cache this time
System.out.println(p2.getTitle());

tx.commit();
session.close();

UpdateTimeStampsCache Region

This cache region is used to keep track of the timestamps of each recent update to queryable tables so that the stale results could be identified and refreshed. While using any query results from the cache if the corresponding queryable table(s) data is found to be changed after that query result was cached then that query result is invalidated and refreshed again.

Worst cases for Query Cache

Case 1: Suppose we are using native query in hibernate and our query cache and second level cache is enabled. Following code shows how native query is created in hibernate;

String sqlString = "UPDATE Comment SET description='test query cache behavior'"; 
SQLQuery nativeQuery = getSession().createSQLQuery(sqlString); 
nativeQuery.executeUpdate();

Now it is noticeable that as hibernate does not know about the associated entity for the above query it updates UpdateTimestampsCache for each table in the database. This is really a great overhead particularly if there are large number of tables in database. Another impact that it has is that it will invalidate all the query results that were cached with use of Query Cache.

In order to avoid the above situation we need to tell hibernate about the entity class that the above query belongs to. Following is the solution;

String sqlString = "UPDATE Comment SET description='test query cache behavior'"; 
SQLQuery nativeQuery = getSession().createSQLQuery(sqlString); 
nativeQuery.addSynchronizedEntityClass(Comment.class) 
nativeQuery.executeUpdate();

Now UpdateTimestampsCache for only Comment table will be updated which is correct.

Case 2: We should not use objects for the parameters of HQL as well as Criteria queries because in that case the object itself along with all other objects it references will be stored unnecessarily in the cache and drastically increase the use of query cache until either the memory consumption exceeds query cache configured limits and it is evicted, or the table data is updated making the query results dirty(or stale).

In order to avoid above situation we should always use primitive data types for the parameters in HQL and Criteria queries. Since we can reference dotted property paths in criteria restrictions as well as HQL, we can always optimize code to use primitive types as parameters instead of objects.

For the illustration purpose consider the following example;

Comment comment = new Comment();
comment.setVisitorName(“Peter”);
comment.Description(“Well done!”);

session.createCriteria(Post.class) 
    .add( Restrictions.naturalId() 
        .set("comment", comment) 
    ).setCacheable(true) ;

// The above code should be optimized as follows

session.createCriteria(Post.class) 
    .add( Restrictions.naturalId() 
        .set("comment.visitorName", "Peter") 
        .set("comment.description", "Well done!") 
    ).setCacheable(true) ;

Note: We should also not use objects for the parameters of HQL queries.

When to use Query Cache

Query Cache is very helpful when it comes to lookup entities based on immutable or constant natural keys. A natural key is a field or set of fields that can be used to uniquely identify a record in a table and have some business meanings (i.e unlike surrogate key it is meaningful to the end user). For example, consider a Post table with an auto-generated primary key in the database. In addition to the primary key one or more other identifiers in the Post table might form a natural key, such as title and authorName.

Suppose if we need to retrieve any entity frequently using its natural key then second level cache cannot be used as it uses primary key to look up entities. In this case, Criteria queries can be used to first lookup primary key corresponding to the given natural key which can be later used by second level cache to lookup the desired entity. Following code snippet shows natural id cache optimization through using criteria.

@Entity 
public class Post { 
  @Id 
  @GeneratedValue 
  private long id; 

  @NaturalId 
  private String title; 
  @NaturalId 
  private String authorName; 
 
  // ....  
} 

session.createCriteria(Post.class) 
    .add( Restrictions.naturalId() 
        .set("title", "Introduction to Hibernate") 
        .set("authorName", "Atif") 
    ).setCacheable(true) 
    .uniqueResult();

The following definition from Hibernate's reference manual explains why hibernate forces unique and non-null constraints on table columns that form a natural key.

"A natural key is a property or combination of properties that is unique and non-null. It is also immutable. Map the properties of the natural key as @NaturalId or map them inside the <natural-id> element. Hibernate will generate the necessary unique key and nullability constraints and, as a result, your mapping will be more self-documenting."

Since by default, natural identifier properties are assumed to be immutable (constant) unless we explicitly specify @NaturalId(mutable = true) therefore in the above example modification of 'Post' table will never change the natural key to primary key mappings. Due to this reason the check on the timestamp cache is skipped and performance is increased by avoiding the invalidation problem and ensuring more hits on query cache.

Some weaknesses of Query Cache

Table query results are frequently invalidated by table modification even if the modification is not related to the cached query results because any table modification causes the timestamp cache to be updated be it related or not.
The query string may comprise of hundreds of characters that may even repeat with different paramters, therefore there is always likelihood of great memory usage as query along with its parameters form a key that is stored by the Query Cache.
Operations like update to timestamp of a table or lookup through the query cache acquire the lock on the table which can easily become a performance bottleneck when some other thread tries to run queries on same table simultaneously.

Note: The timestamp cache eviction should always be after the query cache as it enables query cache to find the last update timestamp in the timestamp cache.

Second Level Cache

The second level cache stores the data across the sessions and it has a scope of SessionFactory level. Data is stored in the form of key value pair where key is Serializable identifier and value is the state of the entity. Therefore cached entities are always looked up in the cache through identifier value. The second level cache is most suitable for entities that are frequently read, infrequently updated, and not critical if stale.

Note: For the queries that are frequently executed with same parameters for tables that are rarely updated we can use the query cache for storing queries results along with the second level cache.

Cache Level 2 - Choosing the right implementation

In order to select from the available implementations of second level cache we need to identify the desired concurrency strategies for our application.

The primary goal of a concurrency strategy is to store items of data in the cache and retrieve them from the cache. Hibernate support several concurrency strategies which are discussed below.

Transactional

The transactional strategy is synchronous and is updated within the transaction. This strategy is used for data that is often read and rarely updated yet it is important to avoid stale data in concurrent transactions.

Read/Write

In this strategy an entity is soft locked whenever is updated or read, so any simultaneous access is sent to the database. This strategy can also be used in the same scenario where data is often read and rarely updated yet it is important to avoid stale data.

Nonstrict – Read/Write

This strategy never locks an entity and therefore there is no guarantee of consistency between the cache and the database. This strategy is used when data hardly ever changes and a small likelihood of stale data is not of great concern. In this strategy we define cache timeout appropriately to avoid possibility of presence of stale data. This strategy is slower than Read-Only strategy yet faster than Read-Write strategy.

Read Only

This strategy is used for data that is often read yet never updated and therefore there is no possibility of stale data in such situation. This is the simplest and the fastest strategy.

Note: NonStrict R/W and R/W are both asynchronous strategies, which means they are updated after the transaction is completed.

Following table compares some available second level cache implementations;

Cache	Read-Only	Nonstrict Read/Write	Read/Write	Transactional
EHCache	Yes	Yes	Yes	No
OSCache	Yes	Yes	Yes	No
SwarmCache	Yes	Yes	No	No
JBoss TreeCache	Yes	No	No	Yes

Enabling Second Level Cache & Configuring EHCache

Now lets configure EHCache which is most commonly used CacheProvider. Please refer to the EhCache documentation for its configuration which is available here.

Note: The default hibernate cache provider is NoCacheProvider. It means if we don't specify and configure any cache provider, then there will be no caching at all and attempts to cache entities will simply be ignored.

Following information is extracted from the user guide documentation available here.

“If you are enabling both second-level caching and query caching, then your hibernate config file should contain the following:

<property name="hibernate.cache.use_second_level_cache">true</property>

<property name="hibernate.cache.use_query_cache">true</property>

<property name="hibernate.cache.region.factory_class">net.sf.ehcache.hibernate.EhCacheRegionFactory</property>

...

In addition to configuring the Hibernate second-level cache provider, Hibernate must also be told to enable caching for entities, collections, and queries. For example, to enable cache entries for the domain object com.somecompany.someproject.domain.Country there would be a mapping file something like the following:

<hibernate-mapping>

<class name="com.somecompany.someproject.domain.Country" table="ut_Countries" dynamic-update="false" dynamic-insert="false" > ... </class>

</hibernate-mapping>

To enable caching, add the following element.

<cache usage="read-write|nonstrict-read-write|read-only" />

For example:

<hibernate-mapping>

<class name="com.somecompany.someproject.domain.Country" table="ut_Countries" dynamic-update="false" dynamic-insert="false" >

<cache usage="read-write" /> ...

</class>

</hibernate-mapping>

This can also be achieved using the @Cache annotation, e.g.

@Entity

@Cache(usage = CacheConcurrencyStrategy.READ_WRITE)

public class Country {

...

}”

References

http://ehcache.org/documentation/user-guide/hibernate#Configure-Hibernate-Entities-to-use-Second-Level-Caching

http://www.javalobby.org/java/forums/t48846.html

http://apmblog.compuware.com/2009/03/24/understanding-caching-in-hibernate-part-three-the-second-level-cache/
http://www.devx.com/dbzone/Article/29685
http://java.dzone.com/articles/second-level-cache-hibernate
http://tech.puredanger.com/2009/07/10/hibernate-query-cache/

Sunday, 13 January 2013

Hibernate as Persistence Provider and ORM Solution - VI

Continues from Part V

[N + 1] Selects Problem

Suppose there is a blog with several posts and each post has its own set of comments. In this case there will be one-to-many association between the post and comments. Lets say we have N number of posts and we want to read the data of all the posts along with the associated comments;

/* Run below query for only one time */
SELECT * FROM post;

/* Run below query for N number of times for each post */ 
SELECT * FROM comment WHERE post_id = ?

It is obvious that in the above situation we need to run N + 1 queries to fetch the desired data. This could be a great performance problem if there are large number of posts as number of posts (N) will determine how many round trips should be there to fetch the complete data. If the database server is running on a different machine then more number of round trips will be even more expensive.

Lazy fetching vs Eager fetching

Lazy fetching is most of the times desirable as it avoids unnecessarily loading data of each entity and collection associated with other entity while fetching it which helps use memory efficiently.

Suppose we want to write a program to print the title of each post. For just printing the post title we obviously do not need the comments associated with each post. Thus in this case Lazy fetching is the best option as loading comments data unnecessarily will cause memory issues.

Now if we change the above scenario and write a program that lists down the name of visitors who left their comments for each post along with its title. In this situation we need to have the information about both the posts and the associated comments. Here Lazy fetching will be a bad choice because lazy fetching will fail to fetch the data of associated comments while fetching the posts data which will bring a performance penalty. Only Eager fetching will be helpful in such situations which will ensure that comments data is fetched whenever any post data is loaded in memory.

Note: We should use lazy fetching by default and only selectively enable eager fetching whenever needed.

Fetching Strategies in Hibernate

Fetching strategies has a very great role in improving the performance of an ORM framework but at the same time it may have disastrous impacts if used inappropriately. Following are the fetching strategies available in Hibernate;

Select Fetching
Batch Fetching
Sub-select Fetching
Join Fetching

Note: Annotations used for choosing the fetch strategies are not part of JPA specification.

SELECT Fetching

By default hibernate uses lazy select fetching strategy. This strategy is helpful if we do not iterate through our results and access the association of each of them. This strategy is most vulnerable to [N + 1] queries problem. Consider the extreme case where we have large number of records and the data of one or more associated entities/collections also need to be loaded then Select strategy runs into [N + 1] queries problem.

@OneToMany(fetch=FetchType.LAZY)
@Fetch(FetchMode.SELECT)
private Set<Comment> comments;

Suppose there are 200 posts in our post-comments scenario and 5 comments on average associated with each post in this case there will be 201 queries that will run in the background.

/* 1 query to fetch all posts */
SELECT * FROM post;

/* 200 queries to fetch all associated comments */ 
SELECT * FROM comment WHERE post_id =  1
SELECT * FROM comment WHERE post_id =  2 
  .                                    .  
  .                                    .  
  .                                    .  
SELECT * FROM comment WHERE post_id =  200

Note: It is noticeable that access to a lazy association after closing the hibernate Session will result in an exception.

Note: Hibernate3 supports the lazy fetching of individual properties (i.e table columns). This optimization technique is also known as fetch groups.

BATCH Fetching

The Select strategy can be further tuned by setting a batch size as it reduces the number of queries required to fetch the data. Lets say we set batch size as 10 then in this case for 200 posts data in the above scenario the number of queries will be reduced to 200/10 + 1 = 21. In general the reduced number of queries in Batch fetching can be expressed as;

N/B + 1; where B is the batch size

Therefore for a large number of records a batch size of 100 can reduce the number of queries to some reasonable extent.

@OneToMany(fetch=FetchType.LAZY)
@BatchSize(size=10)
private Set<Comment> comments;

With batch size set as 10 following queries will be generated;

/* 1 query to fetch all posts */
SELECT * FROM post;

/* 20 queries to fetch all associated comments */ 
SELECT * FROM comment WHERE post_id IN (1,2,3...,10)
SELECT * FROM comment WHERE post_id IN (11,12,13...,20) 
  .                                          .  
  .                                          .  
  .                                          .  
SELECT * FROM comment WHERE post_id IN (191,192,193...,200)

Note: we may optionally set hibernate.default_batch_fetch_size

property

for
the batch size.

SUBSELECT Fetching

Sub-select fetching strategy requires only two queries to fetch all the data. This strategy is helpful if we iterate through our results and access the association of each of them, otherwise it may cause performance issues. In the above post-comments scenario the first query will fetch all the posts records and the second query will fetch all the associated comments records. Although apparently it looks very attractive yet it may cause great performance issues. One very convincing reason for which I avoid this fetching strategy is one of the hibernate bugs reported by Gavin King which can be found here.

Following is a simple example that illustrates sub-select fetching strategy;

@OneToMany(fetch=FetchType.LAZY)
@Fetch(FetchMode.SUBSELECT)
private Set<Comment> comments;

/* Query to fetch all posts */
SELECT * FROM post;

/* Query to fetch all associated comments */ 
SELECT * FROM comment WHERE post_id IN (SELECT id FROM post)

When maxResults is set for a query, the hibernate subselect fetching strategy ignores it which is an existing bug in hibernate. Following is a simple example that illustrates this issue;

List posts = session.createCriteria(Post.class)
     .addOrder(Order.desc("postId"))
     .setMaxResults(10)
     .list();

/* Query to fetch all posts */
SELECT * FROM post ORDER BY id LIMIT 10;

/* Query to fetch all associated comments */ 
SELECT * FROM comment WHERE post_id IN (SELECT id FROM post)

JOIN Fetching

In Join strategy only single query (containing joins for different tables) is responsible for fetching all the data. Since the join strategy avoids calling second SELECT query it disables lazy fetching. Number of queries is only one in this strategy which makes it ideal strategy especially if the database sever is running on a different server as it will require only one database round trip.

Following is the example that illustrates this strategy.

@Fetch(FetchMode.JOIN)
private Set<Comment> comments;


SELECT * FROM post pt INNER JOIN comment ct ON pt.id = ct.post_id

The above query illustrates how hibernate uses JOIN to fetch data from post and comment tables.

One scenario in which one may use fetch strategy is when it is used for more than one collection of a particular entity instance. Please refer to this hibernate article that explains the reason why we should avoid join fetch in such cases which is also quoted below.

“With fetch="join" on a collection or single-valued association mapping, you will actually avoid the second SELECT (hence making the association or collection non-lazy), by using just one "bigger" outer (for nullable many-to-one foreign keys and collections) or inner (for not-null many-to-one foreign keys) join SELECT to get both the owning entity and the referenced entity or collection. If you use fetch="join" for more than one collection role for a particular entity instance (in "parallel"), you create a Cartesian product (also called cross join) and two (lazy or non-lazy) SELECT would probably be faster.”

Pages