Seattle, WA

Amazon is a company operating a marketplace for consumers, sellers, and content creators.

Senior Research Scientist for Network Anomaly Detection

We are part of the AWS Networking organization. Our mission is to process network telemetry messages and interpret them in a way which monitors the network effectively. Our goal is to detect impact to customer traffic and fix the root cause within seconds. The network is the largest and fastest growing network in the world. The customer traffic we are monitoring is your traffic because thousands of apps and websites that you use are based on AWS.

Our traditional monitoring services are critical to the smooth running of the network and those services are truly large scale - processing over 30 million observations per second. The services are predominantly written in Java on Linux and they are large - even by Amazon standards. They are distributed over thousands of hosts in hundreds of global locations and operate at higher than "five nines" availability. In 2018 we began to incorporate anomaly detection techniques into our suite. We are using Data Science and Machine Learning (ML) approaches such as Exponential Smoothing, Distribution Modelling, Clustering, and Spatial Cosine Similarity. We have put these techniques into production and we can now detect issues which were previously undetectable - for example by dynamically choosing the right threshold for an alarm covering a million ports, or forecasting the traffic level of an internet exchange, or finding a rare natural language log among a corpus of billions. By the way, we do all of this on live time series data.

With the success of anomaly detection in 2018 we are doubling down. In 2019 we finish the implementation of 6 separate anomaly detector services and will plug them into our "fire hose" of metric observations. We will build a supervised machine learning system that will ingest an expected million anomalies per minute and make sense of them for operators. We will use statistical techniques to learn associations between anomalies, alerts and external factors. These associations will become rules in an expert system which we will build, and it will increasingly assist humans in making associations and decisions on the relationship between alerts and anomalies. We will apply unsupervised machine learning algorithms to cluster this data into incidents. Those incidents will then largely be managed by our autonomous response system and where necessary, a small number will be escalated to humans where the system will continue to learn from human actions: labeling the data so it can be modeled better.

We are looking for a Senior Research Scientist to join us in Seattle to work on this mission.

