Perform a full table scan with Apache Cassandra
16 Feb 2017In this post I’ll talk about a technique for performing the equivalent of the following query with Apache Cassandra:
select * from mytable;
Now, in general, this is a bad idea. Apache Cassandra is an amazing data store, allowing you to persist billions or trillions of rows in a single table, all while still guaranteeing constant* time performance. But if you try to execute this query blindly it generally won’t work; the command may never return, and likely, crush your cluster in the interim.
If you need to scan through a large dataset like this, you should consider using something like Apache Spark. Spark has tight integration with Cassandra and can be deployed alongside your Cassandra nodes for efficiency and performance.
In this post, I’m not going to try and duplicate what you can do with Spark. This post will however, provide you a mixin interface you can use with just the DataStax Java Driver to scan the full table in a fairly sensible way that will complete.
Technique
This post describes the use of the token function:
Scanning the entire cassandra column family with CQL
The token function allows us to interact with the partitioning done by Cassandra. To satisfy our goal of observing every row, we can perform a series of limited sub-queries by token ranges.
These sub-queries look like:
select id/*primary key*/, ..., token(id)
from mytable
where token(id) >= -9223372036854775808
limit 10000;
We take the token(id)
value from the last row in the result set and run the query again, using that value + 1
, until we get no more results.
The results will always be returned in ascending order by token - that’s just how Cassandra’s partitioning works.
Each of these sub-queries then can (most often) get be satisfied from a single partition/node.
Code
The code for the mixin can be found at:
There are 3 methods you will have to implement with this interface:
/**
*
* @return the name of the table (column family) to query, must not be null
*/
String table();
/**
*
* @return a list containing the names of the columns that compose the partition key; must not be empty
*/
List<String> partitionKeys();
/**
*
* @param row a single row in the result set
* @return a corresponding object that can be hydrated from the row, must not be null
*/
T mapRow(Row row);
Additional behavior:
- If your mapRow method needs to inspect additional columns other than what are provided by partitionKeys, override
List<String> columns()
. - The tableScan method will default to using the current keyspace for the provided
Session
. You can overrideString keyspace()
to change this behavior. - The tableScan will limit the token sub-queries to 10,000 rows by default; override
int limit()
to change this. Use with caution. - The default
CONSISTENCY
for the query isONE
, which is the lowest consistency but highest availability. Trade off for performance; trying to do this with high consistency with large data sets is probably a worse idea, use with caution.
The containing module for this class contains a reference use case and an integration test to experiment.
References
Other references:
Addendum
* Inserts yes; Selects yes if you include the partition key in your where clause.