Master's thesis: Automated Parallelization to Improve Usability and Efficiency of Distributed Neural Network Training
--Not Yet Published--
In recent years the amount of data generated has reached exponential heights. On the other side, computation power continues to increase in a linear manner. This leaves a large gap between the amount of data being generated and the ability to use that data. Parallel systems that help enable this data to be used can fall prey to high implementation complexity (Usability). This affects two critical facets: works hours and bugs. The more complex the implementation, the longer it takes a developer to program and the more instances of software bugs. We improve usability and efficiency of distributed neural network training by providing automated parallelization support. Experiments were conducted using both CNN and MLP networks to perform image classification on CIFAR-10 and MNIST datasets. Hardware consisted of an embedded four node NVIDIA Jetson TX1 cluster. Our main contribution is reducing the implementation complexity of data parallel neural network training by more than 90%. And providing components, with near zero implementation complexity, to easily parallelize all or only select linear based layers of a neural network.
--Not Yet Published--
In recent years the amount of data generated has reached exponential heights. On the other side, computation power continues to increase in a linear manner. This leaves a large gap between the amount of data being generated and the ability to use that data. Parallel systems that help enable this data to be used can fall prey to high implementation complexity (Usability). This affects two critical facets: works hours and bugs. The more complex the implementation, the longer it takes a developer to program and the more instances of software bugs. We improve usability and efficiency of distributed neural network training by providing automated parallelization support. Experiments were conducted using both CNN and MLP networks to perform image classification on CIFAR-10 and MNIST datasets. Hardware consisted of an embedded four node NVIDIA Jetson TX1 cluster. Our main contribution is reducing the implementation complexity of data parallel neural network training by more than 90%. And providing components, with near zero implementation complexity, to easily parallelize all or only select linear based layers of a neural network.
Publication: Anomaly Detection from Kepler Satellite Time-Series Data
Kepler satellite data is analyzed to detect anomalies within the short cadence light curve using traditional statistical algorithms and neural networks. Windowed mean division normalization is presented as a method to transform non-linear data to linear data. Modified Z-score, general extreme studentized deviate, and percentile rank algorithms were applied to initially detect anomalies. A refined windowed modified Z-score algorithm was used to determine “true anomalies” that were then used to train both a Pattern Neural Network and Recurrent Neural Network to detect anomalies. For speed in detection, trained neural networks have the clear advantage. However, the additional tuning and complexity of training means that unless speed is the primary concern traditional statistical methods are easier to use and equally effective at detection.
Kepler satellite data is analyzed to detect anomalies within the short cadence light curve using traditional statistical algorithms and neural networks. Windowed mean division normalization is presented as a method to transform non-linear data to linear data. Modified Z-score, general extreme studentized deviate, and percentile rank algorithms were applied to initially detect anomalies. A refined windowed modified Z-score algorithm was used to determine “true anomalies” that were then used to train both a Pattern Neural Network and Recurrent Neural Network to detect anomalies. For speed in detection, trained neural networks have the clear advantage. However, the additional tuning and complexity of training means that unless speed is the primary concern traditional statistical methods are easier to use and equally effective at detection.
Additional Research: Solve the Traveling Salesman Problem using Genetic Algorithms and OpenMP
Introduction
I had two initial goals: solving the Traveling Salesman Problem (TSP) using Genetic Algorithms (GA) and optimizing it for a Parallel Environment (PE) using OpenMP. For all examples I used a population of 36 cities and a generation size of 50,000 chromosomes. For each generation I take the fittest 25,000 chromosomes and use those to create an additional 25,000 offspring. This is done for 150 generations.
First, I started with implementing a Greedy Crossover (GX) and can be seen in section 2. This method proved to be acceptable with a speed-up around ~7 times and found a fastest route of < 449. Next, I moved onto an implementation of the Greedy Crossover using Nearest Neighbor (GXNN) and the results can be seen in section 3. In section 4, I discuss the methods and strategy to increase performance using parallelization through OpenMP. After that in section 5 I discuss some of the challenges that I had to overcome. In the conclusion I discuss the overall results achieved.
Introduction
I had two initial goals: solving the Traveling Salesman Problem (TSP) using Genetic Algorithms (GA) and optimizing it for a Parallel Environment (PE) using OpenMP. For all examples I used a population of 36 cities and a generation size of 50,000 chromosomes. For each generation I take the fittest 25,000 chromosomes and use those to create an additional 25,000 offspring. This is done for 150 generations.
First, I started with implementing a Greedy Crossover (GX) and can be seen in section 2. This method proved to be acceptable with a speed-up around ~7 times and found a fastest route of < 449. Next, I moved onto an implementation of the Greedy Crossover using Nearest Neighbor (GXNN) and the results can be seen in section 3. In section 4, I discuss the methods and strategy to increase performance using parallelization through OpenMP. After that in section 5 I discuss some of the challenges that I had to overcome. In the conclusion I discuss the overall results achieved.