Banner

Performance of Apache SOLR can be boosted with persistent caching

Apache Solr is a standard search engine suitable for Web applications. To meet high traffic it has two built-in in-memory caches, one for search results and one for documents. They are typically limited to several thousands entries each. A higher limit would increase memory consumption with the well-known effects.

Europeana portal offers searching 16 million documents, and its users produce several hundred thousands different queries in a week. It has to hold the traffic of one million page downloads per day. Built-in caches are not efficient for this scale, and we were searching for a persistent caching solution to cache millions entries.

File-based caching

File-based caching of web requests is common practice. The time of loading a file is much shorter than the time needed for a response to travel over the network, and be rendered as an html page. Typically, it takes 100 ms for a web server and a network to serve a page, but then it takes a whole second for a browser to render it.

Why not using a memory cache? First, the cache tends to grow into hundreds of gigabytes. Second, we are effectively caching http responses, where response time of 100 ms is more than adequate.

Open source caching solutions

We looked at two leading Java caching solutions:

  • Ehcache that is functional and is still being improved, and
  • OSCache that is simple, robust, and frozen in 2007.

We want a cache that stores each cached document in a separate file. This was the default OSCache behaviour, while Ehcache was writing to a single large file. That made us choose OSCache.

Repackaging Solr with OSCache

We download solr.war from the standard Solr distribution. Then we

  • unzip it,
  • add oscache.jar to the WEB-INF/lib
  • add oscache.properties to WEB-INF/classes
  • modify web.xml
  • modify solrconfig.xml
  • zip it again.

Well, in practice, we unzipped it, created a maven project, added oscache.properties to main/resources , pulled oscahce-2.4.1.jar as a maven dependency, and used standard maven install to create a war file. To the same effect.

oscache.properties

We create a configuration with no in-memory cache, but with a disk cache limited to 200.000 entries. OSCache is designed in such a way that you cant have both in-memory and persisten cache, you should choose one of them.

cache.persistence.class=com.opensymphony.oscache.plugins.diskpersistence.HashDiskPersistenceListener
cache.path=/path-to-cache-files
cache.memory=false
cache.capacity=200000

web.xml

In the standard Solr web.xml we add a caching filter in the top:

<web-app>
 <filter>
  <filter-name>SimplePageCachingFilter</filter-name>
    <filter-class>com.opensymphony.oscache.web.filter.CacheFilter</filter-class>

      <init-param>
        <param-name>disableCacheOnMethods</param-name>
        <param-value>POST,PUT,DELETE</param-value>
      </init-param>
      <init-param>

        <param-name>time</param-name>
        <param-value>-1</param-value>
      </init-param>
      <init-param>
         <param-name>expires</param-name>

         <param-value>off</param-value>
     </init-param>
     <init-param>
         <param-name>lastModified</param-name>
         <param-value>off</param-value>

     </init-param>
     <init-param>
         <param-name>max-age</param-name>
         <param-value>0</param-value>
     </init-param>

   </filter>

   <filter-mapping>
     <filter-name>SimplePageCachingFilter</filter-name>
     <url-pattern>/select/*</url-pattern>

   </filter-mapping>
   
   ...

This sets a never-expires cache that responds to Solr select requests, and would be ignored on all other requests, such as update .

solrconfig.xml

Solr is configured in file solrconfig.xml . By default, it is not friendly to external cachers: it tends to assume that a requester has cached the response and sends the 304 response (not modified).

We need to stop that to ensure that Solr responds with a complete response on each request. Uncomment this:

  <httpCaching never304="true"> 

Operation

On each request Solr would first check the cache, and either serve it from cache, or do a Solr query and create a new cache file with the response.

That reduces performance a lot. In the Europeana case a clean cache would make it fully busy:

  • for several hours to populate it with 200.000 queries, and
  • for several months to populate it with 16 mln pages.

To work around it, we pre-populate the cache on a non-production machine, and then copy it over to production servers.

Final performance

Persistent cache gives a tremendous performance boost. The vast majority of the million page downloads are served from cache within 100 ms each. And they are not served from Solr, that releases Solr resources for new, uncached, queries.

Contact

Please contact the author if you have questions or comments.

(c) 2011, Borys Omelayenko,