Flexible Network Analytics In The Cloud

Analyzing network data can be a daunting task. Some days you end up feeling like Sisyphus the Network Engineer. I’d like to talk about an approach to network analytics that we have been exploring at ESnet based on using cloud services. The cloud isn’t a magical solution, of course, but it does provide quite a bit of leverage. It’s been a bit of a journey for us and if you go this route it will definitely take some work. I think the effort is worth it – the view at the end is really quite nice.

This post is the companion to the talk I gave at CHI-NOG 07. The slides are here and I will add a link to the video once it is available.

Hard Realities

In principle, network telemetry and analytics seem quite simple. In fact, at first glance, it seems more the domain of an accountant than a network engineer. The appearances are deceiving however. There are four hard realities that we need to discuss which will set the stage for understanding our approach.

Reality #1: It’s a mess

The first hard reality is that nothing is as clean and simple as we hope it would be. In fact, it’s a mess. It seems like telemetry is the last thing a vendor adds support for. I suppose this makes sense, the features need to work if they’re going to sell boxes, but it’s frustrating for those of us that have to measure things. We get stuck with proprietary MIBs that don’t have clear semantics and some things can’t easily be measured through SNMP which means screen scraping. And even if we get clear descriptions from the vendors and don’t have to scrape data there’s still the issue that it isn’t neatly bundled into evenly spaced measurements. Or the annoying reality that we lose some measurements. And once in a while there is the joy of discovering that sometimes a counter which is guaranteed to increase monotonically sometimes decreases a little bit and then jumps back up.

If we build our tools assuming these kind of problems don’t happen we end up hacking in quick fixes in an ad-hoc way which quickly leads to unmaintainable code.

The world is not neat and tidy. We know this and more or less accept it as true. However, when it comes to our data we sometimes make the mistake of assuming that it will be neat and tidy. This is a mistake. Your data (and my data) are not neat and tidy. This is something we need to accept as a reality. We can incorporate this reality into our design.

Reality #2: Things change

What you needed yesterday probably isn’t what you’ll need tomorrow, so you have to build something today that will work tomorrow. The telemetry and analysis of today often continues to be useful but just as often there are new kinds of telemetry and analysis that are needed as time progresses.

The second hard reality is that things change. To make things worse you have no real way of knowing what you’ll need tomorrow. Sure you can make some good guesses but there are always surprises.

Don’t make decisions today that you can’t unmake tomorrow. Save your raw data. All of it if possible. If you don’t save your raw data then you only have the processed data which may or may not include the details you need for future analysis. Resist the urge to format the data in a way that seems convenient today, especially if that formatting removes parts of the raw data.

There are of course two major caveats here. The first one is that if saving the raw data presents a security risk it may be better to remove the parts of the data that are sensitive. The second is that you may be subject to a data retention policy and that may limit how much data you can save. Even with these two caveats in mind the general idea still applies.

Finally, saving raw data is a great hedge against mistakes. If you have the raw data you can always recompute some analysis if you find that the original implementation of the analysis has a bug. If you discard the raw data, you’re out of luck.

Reality #3: There’s always more

Great, you’ve now been able to build a system that is flexible enough to take on future needs as well as current needs. The problem is that it’s something you’ve built on top of hardware you ordered a few years ago. It chugs along fine until one month you need to add collection of a new set of measurements from all of your devices. Or you have a whole bunch more devices you need to measure, it doesn’t scale.

The third hard reality is that there’s always more. The demand will always increase. There are always new devices coming online and more telemetry you can collect. Your storage and compute will be the same size tomorrow as they are today.

If you run your own infrastructure you’ll have to upgrade it eventually. If you over provision it you’ll have to pay power and maintenance on equipment you aren’t fully using. And most importantly, are you a network engineer or a sysadmin? Is running this kind of stuff really your strength? By using cloud infrastructure you spend far less time worrying about your pet machines, instead you have access to a huge pool of livestock.

The cloud is elastic, it allows you to grow as demand increases and only pay for what you use. Perhaps you need to run a large job to answer a question for the CTO by the end of the week, you can simply fire up more instances and use them to run your job and then shut them off when you’ve got the analysis you need. The cloud is effectively infinitely large – it will always have more capacity than you can consume.

Reality #4: It’s never really done

Our aspirations often exceed our resources and sometimes our abilities. This is just part of life. Technology has greatly increased our reach but we’re still limited. Further, the situation on the ground changes often causing changes in priorities.

The fourth hard reality is that nothing is really finished. Even if it is finished for a short time something will change and it will need to be amended. Resources are always limited. As a result I think it’s important to focus on the “what” – what you want to extract from your telemetry – rather than the “how” of the details needed to run your analysis. The “how” is a distraction (sometimes a fun one) but the more time you spend worrying about the “how” the less time you have for the “what”. After some initial setup the cloud gives you the opportunity to focus more on the what: the analysis you need to run your network more efficiently.

Coping Strategies

I’ve talked about these four hard realities and proposed some coping strategies. To recap: It’s a mess – don’t assume it will be anything but a mess – design your system to embrace that fact, don’t make assumptions that things will be uniform or simple. Things constantly change – don’t make decisions today that you can put off until tomorrow. In particular, keep raw data so that you can look for different things tomorrow than you are looking for today. There will always be more: more data to collect and analyze, more questions to be answered. The cloud provides elasticity – pay for what you use and near infinite scalability. It’s never really done – there are always resource limits. Maximize the amount of time you spend on the “what” and minimize how much time you spend on the “how”. Prefer declarative solutions over imperative solutions.

A programming model brings solace

The second part of our journey towards doing network analytics in the cloud comes in the form of a programming model: Apache Beam. Beam incorporates three components:

A programming model
SDKs for writing Beam pipelines
Runners which work on top of existing distributed processing backends

The goal of Beam is to provide a common abstraction for describing distributed processing jobs while remaining agnostic about the underlying execution engine. You can run it on premises using one of several runners that work with other Apache projects such as Spark, Flink and others. You can run it on top of Google’s Cloud Dataflow. You can even run it locally using the DirectRunner, which is primarily useful for development and testing.

Beam gives you control over how data is processed and provides semantics similar to both traditional batch systems and streaming systems. In particular, it provides ways to answer each of these questions:

What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?

The first question is the most fundamental of these questions – what is it that you want to extract from your data. The second question brings to the forefront that the timestamp of the event is what is used to process the result as opposed to where in the incoming data stream that even is encountered. The third question gives control over when during the processing run results are emitted. This allows you to tune your results towards completeness or towards lower latency. Finally, Beam provides a way for updating results that have already been emitted. This is addressed in the final question.

Making the ascent with code

I mentioned that this wouldn’t be a magic solution to all your problems. This is where some exertion can pay off. It may take a little mental effort to understand the tooling and build something with it, but doing so provides a great deal of leverage and an improved point of view.

Immutable data store of raw data

I’ve said a lot about keeping raw data. And I’m not quite done. The fundamental message here is to do as little processing of your data as you can before you save the first copy. Get it written to database/disk as soon as you can. Avoid the urge to mess with it. Don’t even think about changing it after the fact. It’s called the immutable data store because it should form the unchanging basis for all your analysis. Changing it is an unnecssary risk.

The two big benefits here are:

You can change your mind later: if you have the data in a raw form which was no different then than it is now. You can choose to look at different parts of it.
If you make a mistake in your analysis you still have the raw data so you can rerun the analysis with the bugs fixed and get an improved result. Sure, this might incur some cost but at least you still have the option. Also, costs really aren’t very high when compared to developer time.

Views: how to turn raw data into usable data

Raw data is great for machines but it isn’t so great for humans and ultimately it’s humans who want to derive insights from all this data. The way to solve this is a view.

A view takes raw data and creates a new dataset with more details or presented in a more usable way. For example, you rarely care about the total number of bytes that have gone through an interface. Instead you want to know at what rate bits are going through that interface. You don’t want to think that ethernet 0 is ifIndex 37, instead you just want to see xe-0/0/0.

Views take the raw data and produce something that provides more insight.

These views can be stacked on top of each other, say you have a view that has the counters converted to rates and the interfaces names in place of the ifIndex. You could use this view to create views that aggregate interfaces together by peer or customer.

The examples I’ve given are SNMP based because that is our current development focus, but this is also applicable to flow analysis – in fact it will really shine for flow. You can imagine making a view that takes a set of customer or peer prefixes and creates a traffic matrix. The nice thing is that since you have the raw data you can go back in time and create new views that use older data.

Core Beam Abstractions

In order to understand how we use Beam to analyze network data we’ll need to understand four key abstractions from Beam.

PCollections A PCollection is a distributed multi-element dataset. PCollections are how Beam presents data as input and output and in between every stage in a pipeline.
Transforms Transforms are the code that make up each stage of the pipeline. You will write your own Transforms to accomplish your analysis. There are also many predefined Transforms. Transforms take a PCollection as input and produce a PCollection as output.
Pipeline I/O Pipeline I/O is how you access data and how you save results. Pipeline I/O will read in data and produce a PCollection or will take a PCollection and write output.
Pipelines Pipelines are the composition of I/O, PCollections and Transforms. These are similar in many ways to Unix shell pipelines. Pipelines can be linear or can have an arbitrary number of branches.

f(allOfMyData): Focus on “what”, not “how”

You’ve decided you need to run a new kind of analysis on your network telemetry. The first step is to define some function f and apply it to all of your data. This is exactly the task Beam was designed for. You define what you want to do and Beam works out how to push all of your data through your code. For clarity, you’ll define a set of functions which you will compose into a pipeline and execute the pipeline over your data. A pipeline is a kind of functional composition.

When building a pipeline you define the what:

I/O: What data do you want to analyze? Specify where it lives.
PCollections: What is the best shape for the data during each step of the analysis?
Transforms: What analysis do you want to perform? Implement the proper Transforms.

Let Beam handle the how:

I/O: Once given a data source and sink. Beam will fire up enough workers to load the data and create a PCollection of the input. Beam also takes care of writing the output.
PCollections: By using the PCollection abstraction to model your data Beam can determine how to deliver your data to the appropriate workers.
Transforms: This code is distributed to the appropriate workers and executed on the PCollections.

Example: Computing Total Traffic

Here’s a short pipeline in Beam. Don’t worry if you don’t get all of the details in the first pass; understanding the high level way a pipeline is constructed is the goal.

This pipeline will sum all the inbound traffic from all the interfaces given to it. Perhaps the list of interfaces is all external facing interfaces or perhaps from just one customer or peer.


class FormatCSVDoFn(beam.DoFn):
    """
    Read in CSV lines
    """
    def __init__(self):
        super(FormatCSVDoFn, self).__init__()

    def process(self, element):
        fields = element.split(',')

        try:
            ret = dict(
                timestamp=int(fields[0]),
                device=fields[1],
                ifName=fields[2],
                ifHCInOctets=int(fields[3]),
                ifHCOutOctets=int(fields[4]),
            )

            yield ret
        except ValueError:
            logging.info('filtered out %s', fields)

def compute_rate(data):
    rows = data[1]
    out = []
    for i in range(1, len(rows)):
        current = rows[i]
        previous = rows[i-1]
        delta_t = float(current['timestamp'] - previous['timestamp'])
        current = dict(
            timestamp=current['timestamp'],
            device=current['device'],
            ifName=current['ifName'],
            rateIn=current['ifHCInOctets'] - previous['ifHCInOctets'] / delta_t,
            rateOut=current['ifHCOutOctets'] - previous['ifHCOutOctets'] / delta_t
        )
        out.append(current)

    return out

def group_by_device_interface(row):
    return ("{} {}".format(row['device'], row['ifName']), row)

def main():
    pipeline = beam.Pipeline('DirectRunner')

    (pipeline
     | 'read'               >> ReadFromText('./small_example.csv')
     | 'csv'                >> beam.ParDo(FormatCSVDoFn())
     | 'ifName key'         >> beam.Map(group_by_device_interface)
     | 'group by iface'     >> beam.GroupByKey()
     | 'compute rate'       >> beam.FlatMap(compute_rate)
     | 'timestamp key'      >> beam.Map(lambda row: (row['timestamp'], row['rateIn']))
     | 'group by timestamp' >> beam.GroupByKey()
     | 'sum by timestamp'   >> beam.Map(lambda deltas: (deltas[0], sum(deltas[1])))
     | 'format'             >> beam.Map(lambda row: '{},{}'.format(row[0], row[1]))
     | 'save'               >> beam.io.WriteToText('./total_by_timestamp'))

    result = pipeline.run()

Let’s take a look at the main() function. The first thing to note is that the text between the | and the >> is a description of what this pipeline stage does. The part after the >> is the code the defines that stage. The description helps understand what the pipeline does and in some runners it is used to provide statistics about each stage.

Let’s look at each step of the pipeline:

read We read data, in this case it’s from a CSV but this would really be your immutable data store / data warehouse.
csv Takes each line of text from the input PCollection and converts it into an object with the appropriate fields.
ifName key This produces a key value pair which we use in the next step. The key is the combination of the device name and the interface name.
group by interface We use the built in beam.GroupByKey() to group the interfaces by device name and interface name.
compute rate We get the raw counter measurements, compute deltas and divide by the collection interval. This produces a PCollection of Python dictionaries that hold the details for each datapoint.
timestamp key Once we have the rates, then we produce a new PCollection that is keyed by timestamp and has each datapoint as a value.
group by timestamp We once again use beam.GroupByKey() to group everything by timestamp so that we can sum up across each interface at each timestamp.
sum by timestamp Sum all the interfaces at each timestamp.
format Prepare the data for writing.
save Write the output.

For just a few interfaces this isn’t particularly impressive, but if you needed to handle thousands of interfaces Beam will automatically scale this to an appropriate number of workers and execute this code for you across many workers.

Note This code will only work properly when executing locally using DirectRunner. The issue is that compute_rate will run on multiple nodes in the cloud, so the data won’t all be available to every instance of the function. This can be solved by using a Combine operation. I’ll post a more detailed look at the code with the solution for that problem in the next few days. I’m contradicting myself here, I suppose, by ignoring the hard truth of messiness for the sake of a clear example.

Good. Fast. Cheap. Pick two.

Now that we’ve seen an example of a pipeline, let’s talk a little bit about the tradeoffs involved in using Beam pipelines. There’s always tradeoffs, it’s a fundamental feature of engineering.

The three main tradeoffs to consider with Beam pipelines are

Completeness
Latency
Cost

If you need a complete answer you need all of the data. However, sometimes quick access to results is more important that completely correct answers. Finally, you can make some gains by spending more money. The specific choices between these three are very much dependent on the needs of your organization.

Usually completeness is accomplished by running jobs in batch mode. When you run in batch mode you wait until you are reasonably sure that you have all of your data and run a job that loads all the data at once and analyzes it. Usually low latency is accomplished by running jobs in streaming mode. Instead of waiting until you have all of the data you perform your computations as each data point arrives. Often this means that your answers aren’t completely correct but they are available much more quickly.

Beam provides abstractions for dealing with both of these data processing styles. In fact it doesn’t really view the world in terms of batch and stream, instead it views the world in terms of bounded and unbounded data sets. Bounded data sets have a known size and can be processed all at once. Where as unbounded data sets have an unknown size and may continue indefinitely. The Beam model generalizes over these two kinds of data sets.

Stream or Batch?

Streaming for real time insight

Current network load
Operational awareness
Threat detection / Anomaly Detection
Tend to be more expensive (VMs always running)

Batch for precision or results that can wait

Billing
Precise traffic reports
Capacity planning
Tend to be cheaper (VMs used in bursts)

A scenic vista

The view from here

We can see how this approach helps to address each of the hard realities I described at the beginning:

It’s a mess Having the raw data means that we don’t try to sanitize it at collection time but instead do that as part of a transformation to a view. If we find new messes in the data we can update our view code and rerun it.
Things change We can unmake today’s decisions tomorrow because we have raw data. This means we can create new views to get new perspectives.
There’s always more We can collect more telemetry by adding new immutable data sets to gain more dimensions of data (SNMP, Flow/sFlow, syslog, …). The cloud allows us to scale up or down as needed and we don’t have to spend our time worrying about the details of storing the data.
It’s never really done Because we have a clearly defined approach and access to resources to analyze data on demand it is possible to try out new ideas quickly and cheaply. We can also teach others to use these tools to write their own analysis.

Our experiences so far

There is a learning curve. Beam provides a lot of leverage, but you need to take the time to learn it.
The documentation isn’t amazing, but it’s getting better all the time. The community is very enthusiastic and helpful.
You may have to adjust your thinking. Need to understand the model to know what will work at scale. For example, the Combine issue I mentioned in the code example was a definite head scratcher the first time we ran into it.
The cloud providers have several choices when it comes to databases. It’s easy to spend a lot of time investigating.
Cost is manageable but it’s good to keep an eye on it.
In the interest of vendor neutrality details about our specific vendor haven’t been covered. In particular I’ve been rather vague on how we build our immutable data store and which execution engine we use. I hope to cover those details in a future post.

Thanks

Thanks to Peter Murphy and David Mitchell for the many discussions and refinements to these ideas over time. Thanks to Melissa Stockman, Sowmya Balasubramanian and Monte Goode who worked hard on the prototype implementation of these ideas. Thanks to Michael Dopheide and Becky Lomax who proof read this post. Many of the ideas presented here were first introduced to me by The Lambda Architecture and the Big Data book by Nathan Marz and James Warren.