Generating Insights from LinkedIn Profile Data

Keywords: LinkedIn, web scraping, BeautifulSoup4, Selenium, AWS Lambda, data cleaning, logistic regression

The LinkedIn profile scraper is hosted on GitHub. It was done as part of the requirements for the invite-only URECA undergraduate research programme at NTU under the supervision of Assistant Professor Leong Kaiwen.

In this project, I ran LinkedIn profile data through the data science pipeline, with extra focus on the data engineering part. First is the process of data mining, where I developed a scraper using BeautifulSoup4 and Selenium libraries on Python and used it to collect profile data from LinkedIn, working around its extensive anti-scraping protocols. To make the process faster, it was carried out on the cloud using Amazon Web Services’ Lambda and the data was stored on an Amazon S3 bucket. I cleaned the harvested data on RStudio, by categorising the collected data into different common features, and by clearing excess whitespace and standardising the qualitative text fields. Finally, I performed supervised machine learning by running logistic regression using one hot encoding, using the scikit-learn Python library, to build models for predicting the job outcomes based on past experience, education and skills. I compared the relative accuracy of the models and found that the skills are the best predictor, followed by the past roles at work.

A majority of my time working on this project was spent on the data collection. I studied LinkedIn’s anti-scraping protocols, and tried to find ways to work around them.My scraper used the Selenium library to run the Firefox webdriver to prepare the web pages by logging into LinkedIn with dummy accounts, and the BeautifulSoup4 library to scrape the contents of these web pages, I tried many different methods: modifying the behaviour of the scraper by adding random scrolling behaviour and taking random breaks in between profiles, using residential proxies, adding checkpoints to resume the scraping from where it gets blocked, and finally, using AWS Lambda.

To solve the problem of my dummy accounts getting flagged due to browser fingerprinting, I used AWS Lambda and Amazon S3. Lambda allows one to run code on the cloud, and S3 can be used to store information on the cloud. These two can be used in conjunction to perform web scraping. A deployment package is a zip file that consists of the Python code to be executed, the Python dependencies required (for example, BS4 and Selenium), and other miscellaneous files such as the webdrivers needed. It is used to “deploy” any function to Lambda. Docker was used to create the deployment package, which was uploaded to Lambda. At the same time, an S3 bucket was attached to the Lambda function to allow storing of the output and data files on the bucket.

After scraping around 1,000 profiles (due to shortage of time), I moved to data cleaning on R. The cleaned data was then run through various logistic regression models on Python, and I tried predicting job outcomes using various predictors from the profile data.

Previous
Previous

quant-risk

Next
Next

Forecasting Singapore’s Retail Sales Index