Document Type

Dissertation

Date of Award

12-31-2021

Degree Name

Doctor of Philosophy in Information Systems - (Ph.D.)

Department

Informatics

First Advisor

Hai Nhat Phan

Second Advisor

Frank Biocca

Third Advisor

Cristian Borcea

Fourth Advisor

Dejing Dou

Fifth Advisor

Ruoming Jin

Abstract

During the past decade, drug abuse continues to accelerate towards becoming the most severe public health problem in the United States. The ability to detect drug­abuse risk behavior at a population scale, such as among the population of Twitter users, can help to monitor the trend of drug­abuse incidents. However, traditional methods do not effectively detect drug­abuse risk behavior in tweets, mainly due to the sparsity of such tweets and the noisy nature of tweets. In the first part of this dissertation work, the task of classifying tweets as containing drug­abuse risk behavior or not, is studied. Millions of public tweets were collected through the Twitter API, and a large human labeled dataset with both expert labels and crowd­sourced (through the Amazon Mechanical Turks) labels was built. Three papers on this topic were published: The first work leveraged large quantities of unlabeled tweets with self­taught deep learning (DL)? In the second work, the method of mitigating the imbalance of tweets' classes through the ensemble of DL models was proposed. Results on the testing dataset showed improved performance over traditional and recent methods. Statistical analysis on the results of applying the model on 3­million tweets also yield interesting and meaningful results. Based on the detection model, a demo system was built, which allows the geographical and various statistical information of drug abuse indication tweets to be viewed on live interactive maps.

The development of the drug abuse detection models revealed the importance of privacy preservation in DL. Related works have demonstrated that the privacy of the training data of a DL model can be exploited through either reconstruction attack or membership­inference attack. Thus, due to the sensitive nature of the drug abuse detection model, the privacy of the training data has to be rigorously protected before the model can be made public. The goal of the first work in this direction was to develop a novel mechanism for preserving differential privacy (DP), such that the privacy budget consumption is independent of the training steps and grants the ability to adaptively inject noise according to the importance of features to improve the model utility. Then, in the second work, the aim was to develop a scalable DP preserving algorithm for deep neural networks, with certified robustness to adversarial examples. The robustness bound was strengthened by a novel adversarial objective function, and by injecting noise into both input and latent space. For the first time, a novel stochastic batch training that allows the training of the DP models to be parallelized, was proposed. In the third work along this line, the goal was to preserve DP in the setting of lifelong learning (L2M), given the more challenging privacy risk that the L2M posts. A scalable and heterogeneous algorithm was proposed and implemented, which allows the efficient training and the continuous releasing of new versions of L2M models without affecting the DP protection.

Despite that DP­DL can provide provable privacy protection, another aspect of privacy protection is to protect the data itself. In the foreseeable future, more rigorous data privacy regulations will be widely implemented, which promotes the use of federated learning (FL). In the third part of the dissertation work, the FLSys, a prototype mobile­cloud federated deep learning system was designed and implemented. By utilizing modern cloud architecture, the FLSys is designed to achieve energy efficiency, tolerance failure tolerance, and scalability. To demonstrate the capability of the FLSys, the task of mobile human activity recognition, which aims at predicting human activities with smartphone sensors, was selected. For model developing purpose, two data collection campaigns were launched to collect human activity data through smartphone sensors in the wild from hundreds of volunteers. A simple yet effective way of data augmentation to combat the non­I.I.D (Independent and Identically Distributed) issue that plagues FL was proposed.

Share

COinS