Ebrahim & Jisheng - Automated Pipeline for Large-Scale Neural Network Training and Inference

แชร์
ฝัง
  • เผยแพร่เมื่อ 6 ก.ย. 2024
  • Speakers:
    * Ebrahim Safavi - Senior Data Scientist, Mist, a Juniper company
    * Jisheng Wang - Senior Director of Data Science, Mist, a Juniper company
    Abstract: Anomaly detection models are essential to run data-driven businesses intelligently. In order to manage tens of thousands of anomaly detection models at Mist, we have built a cloud-native and scalable ML training pipeline which automates all steps of ML operations including data collection, model training, model validation, model deployment and version control. The inference workflow is decoupled from the training process to increase the agility and minimize the delay of model service.
    Motivated by the recent impressive performance of recurrent neural networks (RNNs) on a wide spectrum of tasks, we have developed confident deep bidirectional long-short term memory (BiLSTM) models which leverage a large amount of data across numerous dimensions to capture trends and catch anomalies across thousands of Wifi networks and address issues in real-time. The proposed BiLSTM models are capable of predicting the uncertainty of their detection which is essential for the anomaly detection purpose.
    In addition, to address the challenges imposed by the stochastic nature of unsupervised anomaly detection on the workflow pipeline, we have developed novel statistical models for the training workflow to leverage historical data and automate model validation, deployment and version control.
    The anomaly detection service happens hourly and the training jobs occur weekly through the pipeline which consists of different steps including managing the training and serving data stream, model versioning for predictions, training and serving for each network’s model. The workflow pipeline utilizes different technologies including Secor service, Amazon S3 service, Apache Spark across Amazon EMR cluster, Apache Kafka and Elasticsearch.
    In this talk, we first briefly discuss the details of the unsupervised confident deep multivariate models we have built to automatically detect the WiFi network issues. Then, we dive deeper into the details of our cloud-based pipeline and how we use relative entropy to automate the training workflow. Finally, we share lessons learned and insights specifically, how to productize and monitor thousands of ML models to automate anomaly detection.

ความคิดเห็น •