Cloud-Native Logging Architectures: Where the Multi-Use-Case 'Single Pane of Glass' Becomes Just Pain - and How to Avoid It

September 13, 2021 | 8 min read

BlueVoyant

We wrote this piece because our team noticed it was hearing the same theme across our client base for those undergoing a major cloud migration:

“We’ve invested heavily into a single-pane-of-glass SIEM solution with a data pipeline that collects everything we want for the security team, the networking team, the application teams, etc., but it’s getting really expensive and now that we’re migrating to cloud, it’s just overwhelming. What can we do?”

First, don’t stop the cloud migration; go faster. But don’t bring your on-prem security program or data architectures with you. Large on-prem networks may have vast amounts of security relevant data (netflow, windows event, large scale firewalls, etc.) but they lack the fundamentally different scale of IaaS and PaaS logs. BlueVoyant has studied hundreds of on-prem environments and done enough statistical analysis to conclude that, in large part, you can correlate the amount of security data your environment can generate with the number of operating systems in your environment (product and development servers, DMZ, the corporate endpoint environment, etc.).

Cloud infrastructure and platforms completely change the game. Now you don’t have instances; you have instances, containers, and ephemeral assets that are spun up just long enough to run a massive compute job, then torn down the entire lifecycle of which could be less than an hour. As a result, where a 400-person company might have generated around 50GB of security relevant data per day in their traditional on-prem network, that same organization in an all-cloud environment with a massive big data analytics application could spin up 5TB of meaningful security telemetry - with just one application (and just in the production environment!).

This means the centerpiece of your security program, your SIEM - where you aggregate so much of that meaningful data that not only your security department has visibility but your network and app teams also have a great source of visibility for their own use-cases - will either collapse under the strain of scale or becomes a licensing pain point.

So let’s cut forward to this new, all-cloud world and throw out the old ways. App Devs and the DevOps folks still want to log everything (we can debate the verbosity, but let’s just settle on “it’s going to be a lot”). We need to put a framework in place to prioritize how we manage the security relevant subsets of this data, including specific fields within raw log sources. We suggest the following prioritization framework:

PriorityUse Case
1. Detection and Response Data
These are the log sources, and the specific fields within those log sources (forget the other fields) that drive security alerts for continuous, active threat detection (correlations, UEBA analytics, etc.)
2. Investigation Data
This data is often a bit more raw, can be used to provide additional context to an alerts, and is appropriate for “ad hoc" searches, the types of which come up during a deep investigation at the Tier 3 SOC level, or even at the Threat Hunter level.
3. Raw Security Data
These are the full, pristine, raw log sources that contain all the data above, but also a lot of data that may have no security relevance. It’s often kept for two important reasons:
1. Compliance: Regulations like NYDFS want raw data stored for upwards of three years.
2. Forensics: This is where law enforcement and/or legal are involved, and pristine raw data may be necessary to preserve the forensic integrity of any investigation for a major breach or other cyber event.
4. Non-security Data
This is data from your non-security sources. App debug logs, high volume, high variety, no known security value. These are best associated with an alternative data platform, optimized towards lower costs for higher volumes.

Data from the top priority category is best routed to a SIEM on a daily basis, but data from categories two through four may not be. Note that SIEMs are operational data platforms that are optimized for either rapid search or real-time correlation. They are not “data lakes” and, in general, they rely on either tabular structures or indexes to drive search and/or alert generation. Adding data that’s not analyzed on a regular basis not only adds to ingestion-based license costs but bogs down the engines that run search/correlation with extraneous data that “might come in handy at some point.”

So what is the alternative? Since roughly 2016, many large global enterprises with all-cloud, or heavy-cloud footprints have embarked down a “two-tier” model, where they ingest everything, but do so through a “smart pipeline” that routes data to different destinations based on the nature of the source. Data of type one above would go to a SIEM, while data of types two through four would go to an alternative platform with its own search capabilities, sometimes a native IaaS tool or Open Source, or an alternative offering from the same SIEM vendor.

From there, there are three options organizations take to leverage these disparate systems to create “a single pane of glass.” We outline these below along with the various tradeoffs below:

OptionDescriptionConsiderations
1. “Search-Bridge”This option typically involves use of an existing, or construction of a new data connection “add-on” for the SIEM. These can be as complex as a custom rest API endpoint, or sometimes something as simple as a custom search command. In either case, it’s bolted on to the SIEM, and used to directly query the “secondary” data platform.This is code! Needs to be updated and maintained in tandem with BOTH the tier one and tier two data platforms.
2. "Rehydration"
Rehydration is a cover term that implies that data is “moved” from a low cost and low performance tier, to a higher cost and higher performance tier. In the case of a SIEM and cheaper storage, the implication is that data is copied from the cheaper storage into the index/tables of the more expensive one on-demand.
Time and money -data needs to “move” (not necessarily physically, but logically) between tiers, and the movement from tier two to tier one will incur any ingest or capacity based costs
3. “Search in Place”This is where we “leave the data sets where they are” and simply leverage the native query languages of the platforms in which they reside. For example, run SQL in your SIEM, and if the rest of the data is in a system like Hive, use HQL to query in place.
Cross dataset correlation and search management. First, in our described security use-case, data is segmented logically by value and presumably source (e.g. put DNS in tier two). If the goal is to correlate a disparate data source in tier one with something in tier two, there is no guarantee of any success. Secondly, this may create the need to manage “translations” of queries between platforms (e.g. a proprietary query language for tier one and SQL for tier two). While translation engines exist for many such search languages, edge cases often exist so the process of ensuring consistency must have a human review continuously to avoid translation issues.

Option three leaves a lot to be desired from a security investigation standpoint. Incident Responders and Security Operators will want to query all of the data “in one place” to connect the dots during their investigations. One could use this method for alert generation, but correlation-based alerts will be limited due to the lack of integration between data platforms.

Option one is optimal for security, although reliant on either custom code development and management, along with tight coordination on the both platform vendors. This also assumes that the bridge between platforms is robust enough to support arbitrary “ad-hoc” queries that come up during an investigation, and most such bridge connections are limited to data/query subsets and/or may require the analyst to have a deep understanding of both tie one and two platforms (we find this to be rare; most Security Operations have normalized around one platform and query language of choice and built their search/investigation competencies there). It’s also fairly common to incur ingest charges at the tier one level when writing a “bridge search,” so bear that in mind during your analysis.

Option two may involve some sacrifice of both time and speed, although both can be mitigated with the appropriate balance of technical architecture and risk management. An organization may say “we accept the risk of pulling DNS into the platform only when needed,” and knowing that it may take hours since any investigation involving correlation with DNS will not be such that “time is of the essence.” Costs can also be mitigated with the same analysis (i.e. “if we need it that means it’s worth it”), as well as through the right application of technologies to balance ingestion times and costs.

The choice of platform(s) for your first and second tiers will ultimately drive your decisioning between options one and two. For example, MSFT has excellent options for bridging Azure Sentinel SIEM search with Azure Log Analytics, and Splunk has developed search bridges to Hadoop, ELK, etc., and is moving toward bridging with even more types of platforms (AWS S3, etc.).

A recent model that we’ve adopted is a more modular architecture that leverages the strengths from multiple platforms and enables maximum flexibility with data management. For example, consider decoupling data collection and management with different data analytics platforms. In this model, data can be collected and routed to the appropriate destinations - and at the appropriate time with the appropriate context - as needed for each use case. We typically visualize this in the context of how we deliver our security capabilities to our clients in a similarly modular fashion (in the image below):

However, the concept of analytics platform can span across multiple tools and multiple cloud environments, or within the same vendor ecosystem (such as Splunk Cloud with SmartStore and Azure Log Analytics). In the end, the decision ultimately comes down to how you choose to balance risk appetite, budget and resource availability.