Apache Solr is a standard search engine suitable for Web applications. To meet high traffic it has two built-in in-memory caches, one for search results and one for documents. They are typically limited to several thousands entries each. A higher limit would increase memory consumption with the well-known effects.
Europeana portal offers searching 16 million documents, and its users produce several hundred thousands different queries in a week. It has to hold the traffic of one million page downloads per day. Built-in caches are not efficient for this scale, and we were searching for a persistent caching solution to cache millions entries.
File-based caching of web requests is common practice. The time of loading a file is much shorter than the time needed for a response to travel over the network, and be rendered as an html page. Typically, it takes 100 ms for a web server and a network to serve a page, but then it takes a whole second for a browser to render it.
Why not using a memory cache? First, the cache tends to grow into hundreds of gigabytes. Second, we are effectively caching http responses, where response time of 100 ms is more than adequate.
We looked at two leading Java caching solutions:
We want a cache that stores each cached document in a separate file. This was the default OSCache behaviour, while Ehcache was writing to a single large file. That made us choose OSCache.
We download solr.war from the standard Solr distribution. Then we
Well, in practice, we unzipped it, created a maven project, added oscache.properties to main/resources , pulled oscahce-2.4.1.jar as a maven dependency, and used standard maven install to create a war file. To the same effect.
We create a configuration with no in-memory cache, but with a disk cache limited to 200.000 entries. OSCache is designed in such a way that you cant have both in-memory and persisten cache, you should choose one of them.
cache.persistence.class=com.opensymphony.oscache.plugins.diskpersistence.HashDiskPersistenceListener cache.path=/path-to-cache-files cache.memory=false cache.capacity=200000
In the standard Solr web.xml we add a caching filter in the top:
<web-app>
<filter>
<filter-name>SimplePageCachingFilter</filter-name>
<filter-class>com.opensymphony.oscache.web.filter.CacheFilter</filter-class>
<init-param>
<param-name>disableCacheOnMethods</param-name>
<param-value>POST,PUT,DELETE</param-value>
</init-param>
<init-param>
<param-name>time</param-name>
<param-value>-1</param-value>
</init-param>
<init-param>
<param-name>expires</param-name>
<param-value>off</param-value>
</init-param>
<init-param>
<param-name>lastModified</param-name>
<param-value>off</param-value>
</init-param>
<init-param>
<param-name>max-age</param-name>
<param-value>0</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>SimplePageCachingFilter</filter-name>
<url-pattern>/select/*</url-pattern>
</filter-mapping>
...
This sets a never-expires cache that responds to Solr select requests, and would be ignored on all other requests, such as update .
Solr is configured in file solrconfig.xml . By default, it is not friendly to external cachers: it tends to assume that a requester has cached the response and sends the 304 response (not modified).
We need to stop that to ensure that Solr responds with a complete response on each request. Uncomment this:
<httpCaching never304="true">
On each request Solr would first check the cache, and either serve it from cache, or do a Solr query and create a new cache file with the response.
That reduces performance a lot. In the Europeana case a clean cache would make it fully busy:
To work around it, we pre-populate the cache on a non-production machine, and then copy it over to production servers.
Persistent cache gives a tremendous performance boost. The vast majority of the million page downloads are served from cache within 100 ms each. And they are not served from Solr, that releases Solr resources for new, uncached, queries.
Please contact the author if you have questions or comments.
(c) 2011, Borys Omelayenko, 