GPU-based database analytics platform maps data in milliseconds

GPU-based database analytics platform maps data in milliseconds

Technology News |
By Rich Pell

GPUs were, and are, primarily designed for image processing. Developed for video games in the 1990s, modern GPUs are specialized circuits with thousands of small, efficient processing units, or “cores,” that work simultaneously to rapidly render graphics on screen. They are perhaps the most widespread application of parallel processing, with arrays of (usually) identical processing units applying the same transformations to many elements or blocks of elements, simultaneously.

Dating from the late 1990s, engineers and researchers noted that GPUs could also be viewed as highly-parallel general computing unit – the term GPGPU, general purpose computing with GPUs – appeared and led to a variety of derivatives, not least programming language variants such as OpenCL. Not entirely fanciful, proponents used descriptions such as “a supercomputer on a desktop”. Because of their parallel-computing speeds and high-performance memory, GPUs are today used for advanced lab simulations and deep-learning programming, among other things.

Mostak is now Founder and CEO of “data exploration” company MapD ( , whose main product has the same name. MapD is essentially a form of a commonly used database-management system that’s modified to run on GPUs instead of the central processing units (CPUs) that power most traditional database-management systems.

By doing so, MapD can process billions of data points in milliseconds, making it 100 times faster than traditional systems. Moreover, MapD visualizes all processed data points nearly instantaneously — such as, say, plotting tweets on a world map — and parameters can be modified on the fly to adjust the visualized display.

With its first product launched last March, MapD’s clients already include Verizon and other big-name telecommunications companies, a social media giant, and financial and advertising firms. In October, the investment arm of the U.S. Central Intelligence Agency, In-Q-Tel, announced that it had invested in MapD’s latest funding round to accelerate the development of certain features for the U.S. intelligence community.

“The CIA has] a lot of geospatial data, and they need to be able to form, visualize, and query that data in real-time. It’s a real need across the intelligence community,” Mostak says.

GPUs are designed specifically for parallel computing, with thousands of energy-efficient cores that can, for example, simultaneously determine the colour of each pixel on a computer screen to render an image. GPUs also use high bandwidth memory (RAM) that’s about an order of magnitude faster than CPUs.

Today, some databases are being powered by GPUs. But these systems suffer from a major design flaw, Mostak says: “In most implementations, the data is initially stored on a CPU, moved to the GPU for a query, and results are moved back to the CPU for storage. Even if you speed up the computation time of a query [by using a GPU, you lose most of the speed by transferring from CPU to GPU and back.”

But with MapD, Mostak says, the goal “is making GPUs first-class citizens.”

Instead of storing the data on CPUs, MapD caches as much data as possible on multiple GPUs, so there’s no moving back and forth between the different circuits and pulling from the hard drive, which saves a lot of time.

The trick, Mostak says, is giving each GPU its own buffer pool — portions of a database memory that temporarily caches the most recent data pulled from the hard drive. If a database then needs to query the same data point over and over, which is quite common, it accesses that data point in the GPU’s ultrafast RAM, instead of pulling from the CPU or hard drive.

By carefully managing the memory on the GPU, MapD can deliver performance that is two to three orders of magnitude faster than CPU-powered database systems, Mostak says.

In one example of what MapD can do, the system analyzed a dataset that’s considered the benchmark for large-scale analytics — a 1.2 billion-record New York City taxi dataset. In a test by an independent big-data consultant, MapD ran 74 times faster than numerous advanced CPU database systems, completing several queries in milliseconds.

A further example is that] Verizon used MapD to analyze the activity of updating SIM cards on each of its 85 million subscribers’ phones on a weekly basis. With other database systems, the query would take hours to run and hours to evaluate, so the company only did so periodically. Using MapD, Verizon found a glitch in its system that led to SIM card updates upward of a million times per year, which used a lot of server power and was a nuisance for subscribers.

The idea for MapD came to Mostak when he was at Harvard University in 2012, writing his political-science master’s thesis on the Arab Spring, and analyzing hundreds of millions of Egyptian tweets sent out during the uprisings.

Using CPU-based database-management systems to analyze the data was a time-waster. Often he would run queries overnight and wake up to find an error, meaning the long process would need to be repeated. “It was a frustrating experience,” Mostak says.

At the time, Mostak was also taking a CSAIL database course taught by the co-directors of the MIT Database Group: Michael Stonebraker, an adjunct professor in computer science who founded the pioneering database-management company Vertica; and Sam Madden, a professor of electrical engineering and computer science who serves as a MapD advisor.

As a personal project to speed up his thesis research, Mostak invented an early MapD prototype. The professors were impressed. After Mostak completed his thesis, they asked him to join CSAIL as a researcher and build out the prototype, which he did in 2013.

With Madden’s encouragement, Mostak also began showcasing the speedy system around MIT’s Industrial Liaison Program (ILP), which connects MIT community members with corporations around the world. Companies started asking Mostak where they could buy it. “At the time, I said it was purely an academic project,” Mostak says. “But it got me thinking that this was a widespread problem — getting real-time insights out of big data.”

In January 2014, Mostak officially launched MapD. Joining ILP’s Startup Exchange, an online community for MIT-affiliated startups to connect with each other and with other companies, “put [MapD] on the map with commercial entities,” Mostak says.

From there, the startup, then headquartered in Cambridge, Massachusetts, hit the ground running. In March 2014, it won a $100,000 prize from an early startup contest put on by Nvidia, a prominent GPU manufacturer and current MapD partner. That fall, the startup landed $2 million in seed funding from Nvidia and Google, followed by a $10 million Series A funding round the following year.

Today, MapD is expanding in its new San Francisco headquarters. It’s also looking to capitalize on an increased user base, as more companies start launching GPU programming platforms in the cloud. “That’ll give us more access to customers,” Mostak says, adding, “I feel like we’re just getting started.”

Related articles:
Deep learning teaching kit educates on GPU-accelerated computing
Best design practices for large-scale analytics
Mobile Internet to drive demand for big data
Microsoft, Cray collaborate on deep learning at scale

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News


Linked Articles