Last Thursday, I gave a talk – at Google Offices in Sydney for a monthly meeting of the Sydney Python Users Group SYPY – on an emerging architecture for aligning a range of components that would enable statistically powered web-sites to operate easily. I gave an earlier version of the talk at SURF (Sydney Users R Forum) in December. Unlike having a database powered web-site or a real-time web-site mashing up web services, a data-powered web site needs to ‘crunch’ and combine a large amount of internal and external data to deliver results to users. The crunching typically takes time. Statistical algorithms can hog lots of computing resources. Their results may even need to be rendered to the reader as interactive graphs, so a reader can ‘model’ different options, seeing the graphs change accordingly.
I introduced the example of trying to have a web site that could report your contribution to traffic congestion. With moves to have registration fees for cars linked to how much they are used during congestion, some such web site will emerge. For this to occur, the web site not only has to show how much driving you have done but also how many others were on the same roads at the same time.
Another example relates to carbon footprint: suppose a web site shows you not only how much CO2 you have been producing but whether you outperform or under-perform some benchmark you have selected (for instance some peer group). Similarly, exercise groups might compete with each other or individuals within groups. In most of these cases, the results do not need to be instantaneous but they can become quite complex, particularly when you begin to ‘weight’ for, or adjust performance, by various characteristics. For instance, if your car is particularly small, then its contribution to congestion might be adjusted down; or as an older person, your projected weight loss trajectory is adjusted to be made comparable to younger trajectories. Note, in all such cases, some kind of oversight group will typically exist that needs summary overviews of the whole situation (traffic, fitness, weight, or whatever).
To serve both the user and any oversight entity, a statistical ‘engine’ must come into play. The R programming language provides a powerful framework for transforming data into information a web site could deliver: data may need to be cleaned, transformed, analytically dissected, missing data interpolated, predictions extrapolated and error bands (of uncertainty) defined.
The talk produced a lot of discussion and, for the first time for me, someone blogged the talk as I was giving it! See Net Traveller. Thanks to all who turned up for useful questions and a prompt to write this post-in-review.
The title of the talk, gives a clue to the architecture and the actual deployments possible: Making R RESTfully enterprising with Django. Because R focuses on statistical operations, and the Python language has such a rich library of mathematical packages, an ‘elective affinity’ exists between these two languages. In this architecture, Django , an application server written in Python – pitched as a framework for perfectionists with deadlines - ‘exposes’ R to the web (or an intranet) as a ‘service’. This means other software programs and not simply individuals, can invoke R to do some statistical magic, consuming the R output as required. The other side of this, of course, means protecting R from an onslaught of requests. R gobbles up a lot of memory and processing. Queuing the requests provides one pathway (see below). Django needs to ‘talk’ to R and other components and of course needs its own database. The following criteria determined the selection of these components: Ease-of-use, vibrant developer community, open source, robust software and ‘plug & play’ ‘framework structure, low maintenance and easy access to ‘script-savvy’ personnel:
I’ll describe the basic configuration by starting at the bottom and working up – as subterranean spring might be followed to where it finally gushes forth
Vast amounts of data already exist on the internet. We are witnessing a tsunami of data. See Freebase for a specific instance and Open Data for the ‘movement’ involved and Gapminder for fantastic videos and discussion on data. Google of course continues to make data available through GData (eg Book search results).
Django can connect and ingest such data. It keeps such data in its own Mysql database (maybe after cleaning – perhaps via Google new Refine 2.0). MySQL-Python links Django to Mysql nicely and transparently. Django itself searates out the ‘model’ (or objects it deals with, typically mapped from a database), a controller (which relates to its whole framework ) and views (which define what data are to be presented). Note: this so-called MVC design pattern in Django distinguishes between a View, data delivered, and a Template, data presented which, in this diagram, gets addressed ‘higher up’ the chain. In effect, if there are work-flows to be followed in processing data, the Django framework can control all this. Often, many human interventions nudge data processing along: particularly when complex problems need urgently to be addressed (as I routinely deal with).
At various points, if data requires multivariate analyses, Django invokes R, via a nice mapping system called Pyper that opens ‘pipes’ between the python objects in Django and corresponding objects in R. This makes for elegance and clarity. Within a serious application of R, the services it provides would be delivered a R packages. In this diagram, note the package itself has three ‘environments‘ (or nested contexts that define objects that ‘live’ there). Functions transform data, however, with environments, these functions can be made generic and then ‘morph’ to suit the specific environment within which they are invoked.
Managing Requests: Messaging Systems and the REST protocol
The description above does not cover the interaction between other systems and this R service. It happens that other systems may have their own data and wish to invoke the R service (which in turn may grab additional data from other sources). If you have the resources of Google, all could be made available as a service. But even Google draws the line at computing resources. Google cleverly designs its services to involve mostly retrieving data, not processing data. Typically the above user cases, do not require real-time responsiveness. Rather they involve some kind of ’accounting period’ – within which each activity or user could have their own specified date for ‘re-calibration’.
Thus a first line of adaptive defense against swamping R involves using Messaging systems. These work particularly well within organisations (ie intranets, rather than the internet). So if a Roads Authority did issue registrations based on traffic congestion and drivers could look up and model their ‘congestion surcharge’ and off-peak discounts, the system would need to queue (and indeed cue) the requests on R to suit both their priority within a wider schedule of routine runs. Messaging and scheduling systems really ‘queue cues’ that trigger processes. The AMQP (Advanced Message Queuing Protocol) cleverly delivers queued cues of processes within an organisation. An good overview can be found here. The implementation suggested here, RabbitMQ which is written in Erlang, not Python, has an interface to Django – called Celery. It works well. The more complex the workflow and the need to meld distinct results from different processes, the more AMQP systems matter. But they presupposes a central authority or organisation, rather than the wilds of the internet.
The internet itself functions of HTTP (Hyper Text Transport Protocol). Despite functioning well, it took several years before reflective minds discerned, within it, a deeper architecture, ie REST (Representational State Transfer). The entire internet, with all its servers, proxies, caches, look-up directories (eg DNS) can be seen as a protocol that moves information about resources around. Now this thinking has affected communication between computers so that, for instance, in seeking a result from Google, instead of going to the browser a piece of software can access the same functionality ‘RESTfully’ by sending the request akin to a browser URL except there are a series of additional rules about response (so that machines can re-direct themselves intelligently to other resources if their first request fails). Here Django – Piston exposes Django as a RESTful service (and through it, R).
Rounding Out the Service Bundle – RESTful Services
What I have described above I have managed to make work at least in test settings. If an organisation were to go-for-broke on this logic, several other support packages would be relevant, though I have yet to try these. These are on the right side of the above diagram.
First, any system that involves points, monetization, pricing, or harvesting behavioural profiles can be ‘gamed’. This has less chance of occurring if organisations are layered, sluggish, bureaucratic, differentiated etc. The obstacles to coordinating ‘gaming’ or fraud become too hard. With internet exposed activity, particularly where calculations based on streams of data, game become possible. In some cases, one finds recommendations in favour of it – eg SEO: Search Engine Optimization. But in more mundane and focused settings, new kinds of fraud and anomaly detection becomes appropriate and a nice example of developments of frameworks for this is Picalo. Much of this work could be done in R, but the author of this Python package has used considerable experience with accounting processes to develop a plug in framework for data analytic audits. I have yet to get this going, but it looks interesting.
Second systems of the kind that model aggregations of behaviours will need to simulate future scenarios. This involves more than simple predictions (or even complex statistical predictions). Organisations often need future scenarios in which diverse options are contrasted using available knowledge. Even Excel can do simple version of this. Another Python package, SimPy provides what has become a mature package for simulation. Thus, to use the traffic congestion example, if alternate usage rules, infrastructure builds (or widened roads etc), and pricing mechanisms were to be combined in an optimum way (to maximize flow and minimize congestion), then this package has the necessary analytical and programmatic power.
Third, individual participants to whatever processes delivered by the above system, may need tracking and their nature and extend of engagement monitored. Here one finds so called Customer Relationship Management (CRM) software. At one level CRMs try to track how sales work. Django could do this directly. However when the individual has a relationship in which they aim to be transformed (for example, lose weight, become green, become creative, develop a family, start up a business) and this very transformation requires support, then CRM becomes critical and SugarCRM (community edition) provides an interesting open source framework for this kind of work. Not shown here, but of even greater long term significance, are VRM (Vendor Relationship Management systems) – these turn the CRM on its head: thus a web site that paced and paired training (calibrating and showing progress against various benchmarks) would naturally be of interest to the vendors of training materials, guides, etc. Rather than ‘advertise’ on the site, the very site could enable users to make requests (put options in effect) and vendor systems would aim to deliver – in competition with each other – what had been requested.
Finally, in so far as web content has to be managed, ranging from blogs, activity organisations, disucssion, papers, forms and document management etc, then WordPress combined with BuddyPress (to super-charge with social media) provide powerful frameworks in their own right).
The above four packages can be accessed via their own interfaces (API) or RESTfully or both. As yet these have not been bundled as suggested here. But I am keeping a watch on them in terms of the above architecture.
Enough Self-Clarification for now……