MongoDB — Case Study
McAfee Global Threat Intelligence (GTI) is a cloud-based intelligence service that correlates data from millions of sensors around the globe. A critical element of McAfee’s ability to protect customers from cyberthreats, GTI “connects the dots” between malicious web sites and associated malware, viruses and more, and delivers real-time threat information to McAfee end client products.
In 2010, it became clear that McAfee’s existing database solutions would not be able to handle the demands of exponential data growth. The team spent a significant amount of time investigating workarounds and fixes, which created new cracks in the system. McAfee turned to MongoDB to achieve the scale, performance and flexibility required for big data analysis.
Learn about MongoDB: MongoDB — Complete Guide | by Gursimar Singh | May, 2021 | Medium
McAfee GTI analyzes cyberthreats from all angles, identifying threat relationships, such as malware used in network intrusions, websites hosting malware, botnet associations, and more. Threat information is extremely time sensitive; knowing about a threat from weeks ago is useless.
In order to provide up to date, comprehensive threat information, McAfee needs to quickly process terabytes of different data types (such as IP address or domain) into meaningful relationships: e.g. Is this web site good or bad? What other sites have been interacting with it? The success of the cloud-based system also depends on a bidirectional data flow: GTI gathers data from millions of client sensors and provides real-time intelligence back to these end products, at a rate of 100 billion queries per month.
McAfee was unable to address these needs and effectively scale out to millions of records with their existing solutions. For example, the HBase / Hadoop setup made it difficult to run interesting, complex queries, and experienced bugs with the Java garbage collector running out of memory. Another issue was with sharding and syncing; Lucene was able to index in interesting ways, but required too much customization. McAfee compensated for all the rebuilding and redeploying of Katta shards with “the usual scripting duct tape,” but what they really needed was a solution that could seamlessly handle the sharding and updating on its own.
“We were spending more time building solutions in-house rather than focusing on threat research,” said McAfee IT Architect Wes Widner. “We needed a database engine to take care of itself and let us do our jobs — find interesting bits in the data, figure out who’s being naughty on the web at any given moment, and report that up the chain for whoever wants to use it.”
McAfee selected MongoDB, which had excellent documentation and a growing community that was “on fire.”
The authoritative source for McAfee threat information, MongoDB enables big data analytics and supports the real-time flow of cyberthreat data between GTI’s cloud-based system and end client products. It currently stores 4 billion documents — terabytes of data.
Easy to Increase Storage Capacity by Orders of Magnitude
Auto-sharding makes it easy to add more servers at any time to handle GTI’s increasing data needs. They’ve seen a two-fold increase in data over the last two years, and expect that trend to continue. “Putting MongoDB in place was like opening up a water spigot,” said Widner. With the capacity to store more data, McAfee gains more visibility into threats and is able to perform more interesting data analysis.
Lowered Latency, Easy to Interact with JSON
GTI receives queries of its data as JSON objects, which it can pass with minimal transformation into MongoDB. This greatly simplifies query workflow, and MongoDB’s tremendous speed and indexing capability obviates the need for a separate search engine solution such as Lucene / Katta. MongoDB is “orders of magnitude faster” — queries on the user-facing McAfee.com site, for example, are now completed in ~150ms, down from 500ms.
GridFS for High Availability & CDNf
McAfee also optimizes delivery of content to end users by leveraging MongoDB’s GridFS as a homegrown CDN. Analytics and incremental updates are packaged up and stored in GridFS, then sent to endpoint security systems. McAfee benefits from high availability since GridFS files are available in all of their data servers across the country without any additional work. Plus, with tag aware sharding (which McAfee plans to use in the near future), they can ensure data is geographically close to the systems which use it, making it faster for end users, for example, who are pulling down software updates.
Flexibility to ‘Decorate’ Base Data
MongoDB offers the flexibility to store different types of documents in a single collection. McAfee may start with a base record of IP address, then ‘decorate’ it at will with various information — e.g. what domain the IP address is associated with, who it’s talked to, whether who it’s talked to is known good or bad — a “forest of information” that enables GTI to develop meaningful relationships with several different schema.
“Instead of fitting the problem to the tool, MongoDB is able to morph to any problem, so thinking about the problem is a lot simpler,” said Widner. Developers can change the database’s schema at any time, shortening development cycles and dramatically increasing productivity.
Atomic Updates and Full Consistency
MongoDB’s atomic document updates — together with the richness of the document model — make it easy for McAfee developers to change chunks of data as a unit instead of potentially losing data or having inconsistencies in analytics due to incomplete updates. In addition, MongoDB’s support of fully-consistent reads prevents old data from being acted upon. For example, if a new IP address is now sending out a virus, GTI can immediately flag as suspicious anyone who’s talked to the IP address in the last 24 hours, and reliably provide this data to end-users an instant later.
Language & Driver Diversity
McAfee’s approach is to use the best tool for the job, whether it’s PHP, node.js, Python or any other language. Fortunately, the community maintains MongoDB drivers for almost 50 languages, over a dozen of which are officially supported by MongoDB. This means that no matter what application language McAfee chooses to use for a project, they know interfacing with MongoDB will be a snap. “That’s helped us lean on MongoDB to store all of our data,” said Widner.
MongoDB’s geospatial indexing enables GTI to create heat maps of hot spots where malicious activity is being targeted. “I’ve built geo-location services on top of MySQL, but it’s horribly slow,” said Widner.
- OS: CentOS
- Deployment platform: own hardware
- Server hardware configuration: 24 servers running MongoDB, with at least 96 GB of RAM per server, and SSDs
- Sharding and Replication: 6 shards in largest cluster, each with a primary and two secondaries. Other clusters have 2 shards each, each with a primary and secondary
- Application Languages: PHP, node.js, Java
- Database size: 4 billion documents representing terabytes of data
- Other database technologies: RabbitMQ
- Monitoring systems: looking at using MongoDB Management Service for monitoring internal systems
- 20,000 writes per second on SSDs
MongoDB enables McAfee to develop quickly on a platform that can scale, delivering time to market advantages. “Writing proof of concept applications has become fun to do in MongoDB,” said Widner. “If we have a great POC, we can polish the code and transition directly into a full-scale project — without changing the back-end. It’s easy to hash new things in as we scale up.”
Plus, the ability to change document schema on the fly boosts productivity and even morale. “Dealing with the set-up and boilerplate of traditional databases can be discouraging. With MongoDB, you’re closer to working on the specific problem,” said Widner.