I wanted to go through the exercise of contributing to open source with a project of my own. After thinking about it for probably 15 minutes, I decided I wanted to try to build my own caching system in Java. Too bad I knew next to nothing about caching. I went off and did some research.
There are certain known algorithms that have become popular when implementing caches. Given that caches have a finite size (either you run out of space or memory), the cache algorithms are used to manage the cache. These algorithms determine things like how long an item remains in the cache and what gets booted out of the cache when it reaches its maximum size. Wikipedia describes the most efficient caching algorithm “would be to always discard the information that will not be needed for the longest time in the future”. You need to take a look at the data you want to cache before deciding on a caching strategy. Do you need to support random access (the access to the data is uniformly distributed) or sequential access ( you’re interested in large chunks of data at a time)? Is certain data accessed more often that other pieces of data?
Here’s a couple common algorithms:
- Least Recently Used (LRU) – the items that haven’t been accessed the longest get the boot first. This is implemented by keeping a timestamp for all items in the cache. Check out this simple LRU implementation.
- Least Frequently Used (LFU) – the items that are sitting in the cache but have been accessed the least are booted out first. This is implemented by a counter to see how often an item is accessed.
- First In First Out (FIFO) – the item that first entered the cache is the first to go when it gets full. This can be easily implemented by a queue.
Of course, there are projects like EHCache and OSCache out there that have addressed this issue.
OSCache provides a FIFO and a LRU implementation of a cache.
In addition to FIFO and LRU, EHCache provides a LFU implementation of a cache.
Thinking about how these algorithms work, it is easy to see that there are certain cases where using one over the other provides a great advantage. For example in the case of LRU, which seems to be the widely accepted and most used caching algorithm, this cache works great when the majority of the hits come to a very concentrated group of items. This way, most hits, if not all, are retrieved from the cache. However, if there is a large scan of all the data, once the cache reaches its max size LRU will just remove items out on every hit. If the cache can hold a max of 50 items and you have 100 records, as you iterate over the 100 records, the cache will empty out the first 50 records to put in the second half of the records, resulting in lots of add/removing to the cache and 0 cache hits. Algorithms that prevent this from happening, like LFU, are known as scan-resistant.
I was interested in finding if there was some middle ground that gave me the best of both worlds LRU and LFU. It turns out there is.
The algorithm is known as Adaptive Replacement Cache (ARC). It gives you the benefits of LRU as well does a balancing act to prevent data scans from polluting the cache. It does by keeping track of two lists, one for recently references items and another or frequently referenced items. If you read about it, it’s a pretty cool algorithm.
I was excited when I came across this algorithm because I thought it would make such a fine addition as an open source project. And then I discovered it was patented. Apparently, PostgreSQL already went through this exercise and deemed it safer to not use it.
So, now I’m thinking I need a new idea for a project.
Shameless plug, but if you care about this kind of stuff maybe you like distributed computing?
At Terracotta we have a ton of “opportunities” just waiting to happen – how about clustering JCS? Writing a distributed Master/Worker pattern?
Check out the possibilities on our Project Ideas page.
Maybe you can come up with something we haven’t thought of yet?
Thanks for the link. While I’ve only briefly read about Terracotta, I’ll be sure to take a closer look at the project.
very nice info. thanx..
Thanks, nice post! I worked on implementing couple of eviction policies for ehcache. Now, I’d be interested in trying out ARC and how it compares with the other policies.
In your LRU example, there’s no timestamp used in the implementation… so, why do you say that we need a timestamp?
The Indexed Disk Cache is a fast, reliable, and highly configurable swap for cached data. The indexed disk cache follows the fastest pattern for disk swapping.
Very good article! We are linking to this particularly great post on our
website. Keep up the great writing.