The Resplendent Developer

Software Development and Software Quality Assurance

Reading the Cassandra Database with Java

The company I currently work at is predominantly a Linux shop working with Java/Mysql/php/Cassandra. As I become more fluent in these technologies, having come from a strictly Windows/.Net world, I find myself tripping over a number of things that people who work with MS take for granted. The main thing I ran into recently was trying to figure out how to query the Cassandra database with Java. The main problem I ran into was a great lack of really useful or up to date documentation. This post is in hopes of helping muddle through and get started. Hopefully, someone will find this useful rather than having to “read the code” to infer functionality.

Cassandra queries are in the form of “row keys”. As a non-relational database, each row can have a different number/name of columns from the previous/next row. A row key will allow you to pull back one or more rows in a result set to work with.

In this article, we will access Cassandra with the software package called “Hector”. https://github.com/hector-client/hector
One other detail you will notice is the reference “Thrift” while working with Java/Cassandra. Thrift is the legacy API for older clients. It is recommended that going forward, projects use CQL for accessing Cassandra. We are working with Hector which will handle much of the heavy lifting for us.

My example will assume a fairly simple Composite row key and result set. I’ve been working with this type of thing recently, so I have the most familiarity with it.

Row Key

Name Type
varname1 String
varname2 String
varname3 String

Column Names

Name Type
varname1 String
varname2 String
varname3 String

First off, you will need the Hector library. You can obtain this via Maven (I eventually want to talk about the configuration nightmares I’ve had with Eclipse/Spring Tool Suite, but that is a future article).

The libraries I’ve imported are as follows

import me.prettyprint.cassandra.model.ConfigurableConsistencyLevel;
import me.prettyprint.cassandra.serializers.*;
import me.prettyprint.cassandra.service.*;
import me.prettyprint.hector.api.*;
import me.prettyprint.hector.api.beans.*;
import me.prettyprint.hector.api.factory.*;

I could probably get away with fewer imports, but I tend to over import when I prototype knowing that the optimizer SHOULD take care of most of the unused code.

The first thing I’m going to do is put together my various settings that I will need to form the query.

ConfigurableConsistencyLevel consistency = new ConfigurableConsistencyLevel();
consistency.setDefaultReadConsistencyLevel(HConsistencyLevel.valueOf("ONE"));
consistency.setDefaultWriteConsistencyLevel(HConsistencyLevel.valueOf("ONE"));

For those not familiar with Cassandra, consistency levels relate to the configuration of the Cassandra cluster. A cluster can have one or more servers that mirror each other. When a query is executed, all servers have the opportunity to answer (it’s a tad more complex than this, I don’t want to get any further into the weeds). The read consistency is “How many responses do I need that agree until I can return?”. In this case? One.

WriteConsistency is similar but it’s “How many servers have to be written to until I cam move on?”. In this case? One.

Next, we will create a basic configuration object. This is not technically necessary, but when dealing with timeouts, it’s makes life easier. Remember, the Hector default is that if you don’t set a timeout, there isn’t one.
CassandraHostConfigurator cassandraHostConfigurator = new CassandraHostConfigurator("localhost:9160");
cassandraHostConfigurator.setMaxActive(20);
cassandraHostConfigurator.setCassandraThriftSocketTimeout(5000);
cassandraHostConfigurator.setMaxWaitTimeWhenExhausted(5000);

My Cassandra database lives on my local system. If I opened up a SSH tunnel, the “localhost” would still work. Cassandra talks on port 9160.

‘setMaxActive’ method sets the maximum number of active connections to Cassandra.
‘setCassandraThriftSocketTimeout’ method takes the number, in milliseconds, to pass to java.net.Socket#setSoTimeout. (http://en.wikipedia.org/wiki/Unix_domain_socket)
‘setMaxWaitTimeWhenExhausted’ is the global timeout for the Cassandra query.

Now, lets create a cluster interface to talk to!

Cluster myCluster = HFactory.getOrCreateCluster("TestCluster", cassandraHostConfigurator);

Now, when I say “TestCluster” there? You can put anything there. It doesn’t matter. As far as I can tell, it doesn’t get used outside of this line. The second argument is the configurator object I created previously.

Next up, we point the query at the Keyspace. In Cassandra, a keyspace is like a “schema” in a SQL database.

Keyspace ksp = HFactory.createKeyspace("KeyspaceName", myCluster, consistency);

Obviously, you’d replace “KeyspaceName” with whatever keyspace you are working with. In the code I was writing, the keyspace I worked with was called “Analytics”. The second argument, is Cluster object we just created. Finally, we have the consistency settings object.

Now we can start building our query. Unlike working with SQL, you need to know exactly what your data looks like in Cassandra. For example, in MySql/SQLServer when you query, you can specify the datatype when you parse the results – e.g. a field may be a number, but you could manipulate it as a string if you so chose. In Cassandra, you cannot do this. If the value is an Int, you need to receive an Int. We start seeing this here.

MultigetSliceQuery<Composite, Composite, BigDecimal>
multigetSliceQuery = HFactory.createMultigetSliceQuery(ksp,
new CompositeSerializer(),
new CompositeSerializer(),
me.prettyprint.cassandra.serializers.BigDecimalSerializer.get());

OK, You are using the MultiGetSliceQuery class. This is the class that will actually be running the query you specify. You use pass in your keyspace and then your serializers. My row key is of “Composite” type, so that is my first argument. The second field represents the column name structure expected. The last one is the serializer to read the value of the column. You no doubt notice that I had to specifically reference the class. I ran into an issue where there was a conflict with java.math that it tried to pull instead. So, I was forced to dereference in my prototype code.

Now, I can add the column family I want to query. For SQLers, Column Family is like a table.

multigetSliceQuery.setColumnFamily("MyValueTable");

As before, use the name of the column you are querying.

Finally, we can actually start building the parameters of our query. For my cassandra-cli query, I could have a query that looks like this:

get MyValueTable["arg1:arg2:arg3"];

When you are querying Cassandra, and your row key is a Composite (that is a key that is formed by multiple arguments, it could like the above with the “:” character separating them when you type the query. In that case, my Composite row key would be built like this.

List<Composite> keys = new ArrayList<Composite>(); Composite rk = new Composite();
rk.addComponent("arg1", StringSerializer.get());
rk.addComponent("arg2", StringSerializer.get());
rk.addComponent("arg3", StringSerializer.get());
keys.add(rk);

We could actually create a large number of rows keys to query and return multiple result sets. However, we will just stick with the one and

multigetSliceQuery.setKeys(keys);

add it to our query object. Finally, we set the range we want to return. We want to pull the entire dataset, but we could choose smaller values.

Composite startRange = new Composite();
startRange.add(0, new Long(0));
Composite endRange = new Composite();
endRange.add(0, Long.MAX_VALUE);
multigetSliceQuery.setRange(startRange, endRange, false, Integer.MAX_VALUE);

We set start and end points (format Composite). The third argument indicates whether we want the data to be sent to us in reverse. If we did, we set that to true, but, we’d also reverse the startRange and endRange arguments. Note that for end range, I gave it an arbitrarily large number. I’ve stolen this idea from a coworker for the point of this demo, because it greatly simplifies things. Same for the “count” which is the last item.

We can finally run the query!

QueryResult<Rows<Composite, Composite, BigDecimal>> result = multigetSliceQuery.execute();
Rows<Composite, Composite, BigDecimal> rows = result.get();

“rows” is now a container of Rows of Columns. All that is left, is to pull the data out and show it!

for (Row<Composite, Composite, BigDecimal> row : rows)
{
ColumnSlice<Composite, BigDecimal> cs = row.getColumnSlice();
for(HColumn<Composite, BigDecimal> col : cs.getColumns())
{
BigDecimal value = col.getValue();
System.out.print(col.getName().get(0,StringSerializer.get()));
System.out.print(":");
System.out.print(col.getName().get(1,StringSerializer.get()));
System.out.print(":");
System.out.print(col.getName().get(2,StringSerializer.get()));
System.out.print(" = ");
System.out.print(col.getValue().toString());
System.out.print("\tRecorded At\t"+(new Date(col.getClock())).toString());
System.out.println();
}
}

And we turn to our handy nested, “foreach” loop. Note the structure of the Row container is consistent with the above structure.

Inside the loop, our first command is to break the Row down into a list of columns using the “getColumns” method. I pull each column, one by one, into a “HColumn” object to work with.  In this example, my columns have a Composite name. It’s similar to the row key shown above, 3 string variables. I grab the value and print it out.

All in all, it’s pretty easy. The only difficult part is the uphill battle of figuring out what/why you are supposed to do.

Recommended Reading:
http://www.amazon.com/Cassandra-Definitive-Guide-Eben-Hewitt-ebook/dp/B004FGMTZY/
http://www.datastax.com/sites/default/files/hector-v2-client-doc.pdf
http://hector-client.github.io/hector/source/content/API/core/1.0-1/index.html
http://wiki.apache.org/cassandra/GettingStarted

Comments are closed.