Best design practices for large-scale analytics projects

Best design practices for large-scale analytics projects

Feature articles |
By Julien Happich

Away from the hype, modern search technologies have radically altered the speed and scale of what is possible. Analysis applied to larger volumes, higher velocities and wider varieties of heterogeneous data reveals new patterns that will never become apparent on a smaller scale.

More recent developments, combining search, graph technologies, machine learning and behavioural analytics, put an array of algorithmic assistants to work for the end-user. In the hands of experts, these tools are the beginnings of AI.


AI for electronic engineering

The abilities to analyse and learn from vast quantities of data can deliver value in nearly all areas of electronic engineering. Integrating these powerful search technologies with Time Series analysis offers massive benefits in areas as broad as: silicon fabrication; process monitoring; weather station sensor networks; data networking infrastructure; voice communications design; radio signal processing; and electrical grid usage.

Within industrial control systems, the impact of moving from arduous monitoring, towards intelligent anomaly detection, systemic behavioural analysis and more accurate prediction, can’t be understated. As we enter an era where almost all electronic devices will deliver sensor data and receive instruction across the Internet, well-engineered, intelligent software platforms will form a major part of the value of all electronic devices.


Platform implementation

Infrastructure decisions will largely be determined by an organisation’s existing IT strategy. Nonetheless, high level scoping is essential and key considerations include, details of capturing the data source, data format, volumes/speeds and acceptable latency.

Most companies start small with pilot projects, before growing into broader, mission critical usage. The overriding rule for infrastructure is that organisations will benefit from more open, flexible software platforms allowing them to move implementations between different environments as needs change. Popular open source distributions, with active user communities, will offer the best mix between customisation, innovation and differentiation; allowing companies to focus on the core business.

Acute skills shortages within data science will also be a determining factor in the end choice of the software itself. While Big Data and machine learning are often associated with Spark, Hadoop and MapR, the skillsets required to build applications in R and Python are both scarce and in high demand from sectors such as finance and pharma. Explore more accessible technologies using subject-specialists rather than dedicated data scientists. 

Making sense of data: a four-step process.


Best practice 1: data capture

The source and nature of the data being captured form a vital starting point. The source may be a factor in determining platform and software choice. Source-target connectivity, the availability of open APIs and suitability of the analytics platform for specific types of data will all play a role.

Within Elastic’s own platform the Beats project offers ready built open source solutions for Time Series data, packet monitoring, metrics and sensor data with timestamping, while the Community Beats project offers a community driven initiative to create, share and maintain a rapidly growing range of connectors to specific applications and environments.


Best practice 2: data ingestion
If analysis is being applied to large existing datasets, uploaded as batch jobs, the initial ingest process is straightforward. However, in the increasing number of cases where users are looking to ingest live data, system architecture needs to take account of likely flow of traffic. Within large scale analytics platforms, data ingestion is almost inevitably subject to latency. So where live data arrives in sporadic peaks or surges, additional queueing or buffering solutions may be vital.

In live data environments, the connectivity of the analytics platform to popular real-time streaming environments such as Apache Kafka, Redis, ZeroMQ or buffering solutions may also be an important consideration. The ingestion process also gives organisations a one-time opportunity to clean, enrich and optimise data before indexing. This stage allows users to transform data from the format determined by its source, into a format more suited to efficient search and analysis. The common example for data augmentation is geolocation. More broadly, data that has been optimised for efficient transport across a restricted fabric (mobile/wireless) may contain key values that unlock far richer levels of context and information in other systems/applications.

Best practice 3: indexing

A database index significantly improves the speed of data retrieval operations. Indices are used to quickly locate data without having to search through everything. When and how indexing is carried out within the system architecture can be a critical factor in determining the choice of the system. There are two essential models:

Schema-on-write. Here, the schema is defined before data ingestion. Data is indexed according to predefined schema. When results are presented, they get their format from the same schema. Historically, enrichment of data was accomplished with often complex operations, requiring time-consuming re-indexing. Advantages of this approach include high-performance queries. Drawbacks of this approach are potentially slower ingestion/indexing rates. Proper schema definition in advance is key.

Schema-on-read. In this model no schema is predefined. Data is simply ingested into the data store as it arrives; the schema and enrichment applied at the time queries are converted to data. This approach avoids the need for careful schema definition in advance, and can achieve high ingestion rates. It sometimes results in slower query performance, though, due to lack of pre-calculated indices. Most schema-on-read data stores thus have the capability to optimise common query types, once they’ve been learned, to improve query performance.

Some data stores employ hybrids of the two-models, offering the advantages of one approach without common disadvantages. Elasticsearch can operate in a schema-less mode, or with automatically created schemas, so new queries/searches, devised to uncover new relationships, can operate perfectly. Elasticsearch can parallelise the ingestion/indexing process to overcome the slow indexing traditionally associated with schema-on-write models.

Best practice 4: automated analytics and visualizations

The results that can be delivered from the right platform choice are mind-blowing. The combination of simple search, with other emerging technologies – machine learning, anomaly detection, graph analysis are hard to underestimate. And when it’s open source – free to evaluate – it’s not a hard sell. Processing is where the fun starts.


Machine learning

Almost all automated analytics utilising unsupervised machine learning have skill sets based on modern data science. Sometimes referred to as “algorithmic assistants,” they baseline normal behaviour by accurately modelling time series data; they identify anomalous data points or “outliers”; they score the level of anomalousness of these outliers. This set of skills is often packaged up under the term “machine learning anomaly detection.”

Time series analysis using Elastic’s Kibana

Recent developments in machine learning-based analytics have additional capabilities; think of these as “senior algorithmic assistants”; taking the work of their subordinate assistants, performing advanced functions such as influencer analysis, correlation, causation, and forecasting, to provide even more context for engineers.


Graph analysis

If your anticipated analysis focuses heavily on relationships between entities such as computing optimal paths between nodes in a physical topology, or determining communities of users in a social graph, then a graph database might be an appropriate platform. Technically, another form of NoSQL database, graph databases store relationships between entities in structures called nodes, edges, and properties, and allow efficient multi-level searches of stored data.

Depending on your requirements, you may be able to gain some benefits of graph exploration without the need for a dedicated graph database, such as Neo4j, or Elastic’s own Graph analytics feature.

Some platforms can provide graph exploration of data based on relationships determined from relevance data maintained by their search functions. These functions provide the ability to explore potential relationships living amongst the data stored in the platform; linkages between people, places, preferences, products, you name it, can provide tremendous insight.


While the automated analytics described above can offload human analysts from repetitive searching and pivoting through data, the data needs to be visualized in order to enable teams to gain the insight they’re seeking to begin with.

There are dedicated visualization platforms, Kibana for example, that retrieve data from your data store, creating reports to gain and communicate insights. Other tools provide a rich set of visualizations from your data store which can be shared across analysts.

Analysing time series data such as those types outlined may require a tighter integration between your data store and your visualization platform. Visualizations such as simple metrics, data tables, line charts, time series charts, bar charts, pie charts, and geographical tile maps should represent a minimum set of capabilities.

Geopoint analysis performed by Elastic’s Kibana visualization platform.


Tackling a large-scale analytics projects can be daunting. However, if engineering discipline is applied to the project, carefully modelling the process of converting data into insight before making larger investments of time or money, success is increasingly achievable.

Consider and model the format, volumes, speeds, and variability of the input data, the acceptable latency between data creation and results, the need for high-availability, period for data retention, and anticipated query volume.

Finally, remember that moving data from one data store to another (aka overcoming “data gravity”) can be costly, time-consuming, and complex. Consider a platform that offers a mashup of capabilities in a single data store that meets your specific needs.


About the author:

Mike Paquette is Director of Security Market Solutions at Elastic –

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News


Linked Articles