Summer Research Opportunity Machine Learning
Project Summary: Students will learn the basics of neural networks as well as some standard algorithms for turning documents into vectors. These things will be learned while working on the project described below:
The arXiv is a preprint repository for academic articles in several disciplines, including mathematics. When submitting a preprint, the user is asked to classify their article into one of the given fields and subfields represented on the arXiv. The arXiv recently released an API that will attempt to automatically determine this classification. The goal of the project will be to see if we can produce a model based on neural networks that beats their in-house classification.
Team Members
Nicholas Vlamis (nicholas.vlamis@qc.cuny.edu), Faculty Mentor
Kathy He
FangFang (Daisy) Lu
Tao Wu
References
arXiv info. You should familiarize yourself with what the arXiv is and the metadata it provides. You should also read up on its methods for automatic classification.
Neural networks and deep learning. This is a nice online book introducing the basic of neural networks. Please read.
tf-idf: A very common and straightforward way of assigning vectors to documents using frequence of keywords. The arXiv is using this in its automatic classification.
word2vec. In order to use neural networks to classify documents, we will need to convert documents into vectors. word2vec is a algorithm introduced by Google to convert words into vectors in a such a way that captures semantic relations between words. You should understand the basic mathematical idea of the algorithm.
doc2vec. Building on word2vec, doc2vec is an algorithm built on top of word2vec introduced by Google to convert documents into vectors. The goal is to give an vector embeddings in such a way that the embeddings of two documents have large cosine similarity if they are similar documents. (The link is for a Medium article that shows up on the top of Google searches, so I don't claim it is the best.)
Deep Learning: An Introduction for Applied Mathematicians by C.F. Higham and D.J.Higham.
Definitions
A vector embedding (see word embedding) is a way of taking some object and assigning it a vector.
The cosine similarity of two vectors simply refers to their dot product, which is essentially measuring the angle between the two vectors. This is a the most common measurement of similarity between vector embeddings; it has the distinct advantage of being fast to calculate. Note that if cosine similarity between two unit vectors determines their Euclidean distance and vice versa.
Feedforward is lingo that is essentially referring to matrix multiplication.
Back propagation is lingo that is essentially referring to the chain rule.
Tools
Gensim. A library that is commonly used for implementing word2vec and doc2vec as well as other natural language processing (nlp) algorithms. We will be getting familiar with this library, and I'm sure we will have to dig into what is available. The library stems from the author's phd thesis is implementing many standard algorithms in efficient ways.
PyCharm. One of the standard Python IDEs. With your college email, you can obtain a free license. I've been using this a bit, but if people prefer we can also consider VSCode.
GitHub. We will want to collaborate via GitHub. We also want the goal of publishing some of what we do on GitHub so you can build a public profile for employers to look at. I've never ussed GitHub, so you will have to teach me.
NumPy. The standard linear algebra library for Python, geared towards scientific computing.
PyTorch. A more powerful linear algebra library geared towards modern machine learning in Python and designed to take advantage of gpus.
QC VPN. We will use my office computer to run code and build models. So, if you are working from home, you will want to be able to VPN into the QC network so you can ssh into my computer.
Office computer. I have a Mac Studio M2 Ultra with 24 cpu cores, 60 gpu cores, 128gb of RAM, and 1gb SSD. The machine was purchased with this project in mind, and I imagine you will want to build/run/store models (remotely) on this computer.