Document Type
Dissertation
Date of Award
12-31-2021
Degree Name
Doctor of Philosophy in Information Systems - (Ph.D.)
Department
Informatics
First Advisor
Hai Nhat Phan
Second Advisor
Frank Biocca
Third Advisor
Cristian Borcea
Fourth Advisor
Dejing Dou
Fifth Advisor
Ruoming Jin
Abstract
During the past decade, drug abuse continues to accelerate towards becoming the most severe public health problem in the United States. The ability to detect drugabuse risk behavior at a population scale, such as among the population of Twitter users, can help to monitor the trend of drugabuse incidents. However, traditional methods do not effectively detect drugabuse risk behavior in tweets, mainly due to the sparsity of such tweets and the noisy nature of tweets. In the first part of this dissertation work, the task of classifying tweets as containing drugabuse risk behavior or not, is studied. Millions of public tweets were collected through the Twitter API, and a large human labeled dataset with both expert labels and crowdsourced (through the Amazon Mechanical Turks) labels was built. Three papers on this topic were published: The first work leveraged large quantities of unlabeled tweets with selftaught deep learning (DL)? In the second work, the method of mitigating the imbalance of tweets' classes through the ensemble of DL models was proposed. Results on the testing dataset showed improved performance over traditional and recent methods. Statistical analysis on the results of applying the model on 3million tweets also yield interesting and meaningful results. Based on the detection model, a demo system was built, which allows the geographical and various statistical information of drug abuse indication tweets to be viewed on live interactive maps.
The development of the drug abuse detection models revealed the importance of privacy preservation in DL. Related works have demonstrated that the privacy of the training data of a DL model can be exploited through either reconstruction attack or membershipinference attack. Thus, due to the sensitive nature of the drug abuse detection model, the privacy of the training data has to be rigorously protected before the model can be made public. The goal of the first work in this direction was to develop a novel mechanism for preserving differential privacy (DP), such that the privacy budget consumption is independent of the training steps and grants the ability to adaptively inject noise according to the importance of features to improve the model utility. Then, in the second work, the aim was to develop a scalable DP preserving algorithm for deep neural networks, with certified robustness to adversarial examples. The robustness bound was strengthened by a novel adversarial objective function, and by injecting noise into both input and latent space. For the first time, a novel stochastic batch training that allows the training of the DP models to be parallelized, was proposed. In the third work along this line, the goal was to preserve DP in the setting of lifelong learning (L2M), given the more challenging privacy risk that the L2M posts. A scalable and heterogeneous algorithm was proposed and implemented, which allows the efficient training and the continuous releasing of new versions of L2M models without affecting the DP protection.
Despite that DPDL can provide provable privacy protection, another aspect of privacy protection is to protect the data itself. In the foreseeable future, more rigorous data privacy regulations will be widely implemented, which promotes the use of federated learning (FL). In the third part of the dissertation work, the FLSys, a prototype mobilecloud federated deep learning system was designed and implemented. By utilizing modern cloud architecture, the FLSys is designed to achieve energy efficiency, tolerance failure tolerance, and scalability. To demonstrate the capability of the FLSys, the task of mobile human activity recognition, which aims at predicting human activities with smartphone sensors, was selected. For model developing purpose, two data collection campaigns were launched to collect human activity data through smartphone sensors in the wild from hundreds of volunteers. A simple yet effective way of data augmentation to combat the nonI.I.D (Independent and Identically Distributed) issue that plagues FL was proposed.
Recommended Citation
Hu, Han, "Private and federated deep learning: system, theory, and applications for social good" (2021). Dissertations. 1567.
https://digitalcommons.njit.edu/dissertations/1567
Included in
Artificial Intelligence and Robotics Commons, Data Science Commons, Social Media Commons, Substance Abuse and Addiction Commons