Time Series Based Data and Druid

Imagine if we decide to add electronic chips to every vehicle in our city. We would setup drones or receiver towers which would continuously receive messages from these chips. The chip would continuously send data related to the vehicle such as speed, geo coordinates of the vehicle etc. Considering the number of vehicles we have on the road,imagine the amount of data that would be flowing to these drones and receivers. It is going to be humongous!

Why do I need this data?

This is essentially very useful in analyzing several time based data points.

  • How many vehicles flow from point A to point B in a particular time period?
  • How many vehicles are in a particular area of the city at a particular time
  • Retention: How many vehicles of the same type travel in or to a particular area?
  • Which signals on the road halt vehicles for long? At what time of the day?
  • Which is the busiest time of the day or the busiest days in a week on the roads w.r.t the no of vehicles?

If we observe closely, all these data points have one factor in common and that is “time”.

Time series databases are useful here. Time series databases are systems which are built for handling data indexed on time.

Considering the example we just discussed, we already have tons of GBs or TBs of data generated every day. The data generated has following things in common:

  • The data arrives with an event time
  • The data points are ordered in time.Time is the main axis

Unlike conventional database systems, TSDB are predominantly “insert only” and focuses more on tracking the fluctuations of the data points over the course of time. This gives more flexibility for analysing the data.

One of such TSDB is Druid. Just like the name the tool is also interesting. As Druid’s official document says, Druid combines the principles of Analytics databases, Time series databases and search systems.

In our vehicle tracking system, we need to store the each flow of incoming events from various vehicles. There are going to be huge amount of inserts we need to handle. We also need to come up with analytical reports and we need aggregation and a quick query response.

How Druid is useful here and how does it stores the data?

One of the important factor for time stamped data is how it is stored. More intelligently the data is stored, more efficient is the access. Druid stores the data in datasources. Datasources are equivalent to Tables in RDBMS systems.

On high level, Druid stores the data based on data schema provided for these data sources. Druid schema needs information on three important column families,

  • Dimension columns: These columns have values we want to filter or group by on. The data type of these columns are Strings or arrays of Strings.
  • Metric columns: These are the fields that can be aggregated. They are often stored as numbers (integers or floats)
  • Timestamp column: The event time. The data is partitioned on this column.

Druid partitions the data on time the event was inserted in the druid database. Druid divides the data in chunks. Druid further divides this each chunk into segments. While designing your data schema, you can define the segment granularity. Suppose you define it to be ‘hour’, Druid will store all the incoming data within an hour in a single segment. When the hour changes, druid will generate another segment and will start storing the incoming data in newly created segment.

You can imagine chunks and segments like:

ch1

How our vehicle traffic system data source would look like:

ch2

Querying the data:

This is where segments play important role, while querying for the data, druid also requires the time interval it should look at.

Suppose we want to know on 27th July 2018 17:00 pm to 21:00 pm how many vehicles were on the street, druid will look for chunk designated for 27th July and look for the data in 4 segments (assuming we have segment granularity of hour).

Segment 1 (17:00 – 18:00)

Segment 2 (18:00 – 19:00)

Segment 3 (19:00 – 20:00) and

Segment 4 (20:00 – 21:00 ).

Druid now has to look for data only in these four segments. Lesser no of segments druid has to search, faster the query response be.

Suppose you want to get the data of entire week from 23rd July to 29th July 2018 with day wise distribution, Druid provides ‘timeseries’ query which returns data against each day as follows:

ch3

These no of vehicles are in thousands. We can then plot the data on timeline graph as follows:

ch4

With the historic data gathered with TSDB, one can study the fluctuation in data point over the time period and come up with solution to resolve heavy traffic. Another possibility is to come up with prediction for possible no of vehicles next week by analysing last ‘n’ weeks data.

One of the very important features of Druid is ‘Roll-up’. It aggregates based on selected set of columns which reduces the size of the segment significantly.

Suppose for our traffic tracking system, following is the sample events generated:ch5

We have a metric column ‘count’. By default, druid will add all the events that are coming in. Druid allows to add ‘queryGranularity’ where you can define the granularity either of ‘min’/’hr’/’day’/’week’.

Suppose in our example, we set the queryGranularity = ‘minute’ the resultant dataset would be:

ch6

If we observe closely, Druid now has reduced the data set and increased the “count”

The input rows have been grouped by the timestamp and dimension columns {timestamp, vehicleid, event-type} with sum aggregations on the metric column “count”.

In real life dataset this helps reducing the data size significantly.

If you have data which is timestamp based, you want to generate report based on historical data, you want to analyse and predict the outcome in next course of time then Druid is an excellent tool. At the end, the schema design, column type, data source granularity, query granularity plays very important role.

Time series data accumulates at high pace and conventional RDBMS is not designed to handle such a big amount of data. You can always have the same data and effect in RDBMS but it with the cost of scalability and efficiency. The possibilities are endless with TSBDs. One has to check the requirements and see if this is the right type of database.