I want to build a system similar to Google Analytics (only used for internal use, less traffic and less feature), and mainly focus on
Real time counting of unique URI visit/PV by different dimensions of user demographic information, e.g. User Agent, OS, Country etc
Real time calculation of average user session length (if the different between two request from the same IP is less than 1 minute)
Are there any good database store enable this kind of query in real time?
p.s. I am currently testing InfiDB.
There is a trick to building fast realtime analytics, regardless of the platform. I’ve done this with Microsoft Analysis Services, but you can use similar techniques with other platforms as well.
The trick is to have a leading partition that can be populated with near-realtime data and a historical partition (or partitions) that are optimised for fast queries. If you keep the leading partition small enough it will be quick to query as well.
To manage this, your ETL processing populates the leading partition and you build a supplementary process that periodically converts the partitions to the fast query optimised format. The exact nature of this process will vary with your platform.
On MS Analysis Services the leading partition is done as a ROLAP partition that reads directly off the table. Trailing partitions are converted to MOLAP with aggregates. Other OLAP systems will work similarly. On Oracle you can create bitmap indexes and materialised view partitions on your trailing partitions to speed up queries. Some other systems have this type of feature as well, although I’m not aware of MySQL supporting it.
At a guess, I’d say the cheapest mainstream platform that would do this is MS Analysis Services, which is only available bundled with SQL Server and can cannot be purchased separately. For the partitioning with 2008 R2 you will need Enterprise Edition of SQL Server, which runs to about £22,000 per CPU socket in the UK and a bit less on the other side of the pond. Microsoft are shipping a new ‘Business Intelligence’ edition of SQL Server with 2012. Once this hits RTM the B.I. edition of the product does support partitioned cubes and is substantially cheaper than Enterprise Edition. Depending on your budget and time constraints you may be able to use that instead.
Another aspect of the problem you will have to tackle is changed data capture – efficiently identifying and pushing new or changed data rows into the ETL process. Most DBMS vendors’ CDC features only work with their own databases, so if you want a CDC solution you may have to go to a third party app or triggers on the source.
Various third parties punt CDC applications that will migrate across database platforms. A list of CDC products can be seen on the wikipedia entry on the subject. Note that you may still have issues with integration. For example IBM Infosphere CDC can only trigger external processes on a per-row basis, rather than per batch, which could cause efficiency problems on large data volumes.
You can create a set of triggers on the source tables that push out the changes into a staging area. This would require you to have sufficient access to the source database to do this, so it may not be an option on vendor-supported databases.
If the data is from a file (for example a web server log) you would have to write a client side process that monitors the tail of the files for new records.
It is quite likely you will end up having to implement a pull process that polls the data sources. In this case you have to work out the tolerable latency and write your process so it is efficient enough at detecting changes so it can be run sufficiently quickly. There is an old saying, sometimes found in embedded systems circles, to the effect of: ‘You know they’re getting serious about reliability when they start polling‘
It’s possible to build something that matches your basic requirements in PostgreSQL. You can build materialized views that are kept up to date with triggers, and use partitioning to cut down the overhead of maintaining older data. But that is all a very manual process in that database still, so you’d need custom coding for several components. And performance of the result won’t be very good compared with commercial solutions optimized for this specific purpose.
There is some analytics innovation in this area provided by commercial forks of PostgreSQL aiming at this market, such as those from Greenplum and Truviso. The completely free PostgreSQL isn’t a great solution to this sort of problem at larger scales yet.