As a engineering senior at the University of Michigan, I was fortunate enough to consult for a Fortune 100 company. I was asked to work on a project for Dow Chemicals (Dow Performance Silicones) that involved state of the art machine learning techniques to identify valuable business intelligence insights. Dow Chemicals is a leading manufacturer and supplier of industrial grade materials to tens of thousands of customers worldwide. The Customer Experience team at Dow considers customer experience and satisfaction to be central to the company’s growth and future success. For this reason, Dow sends out annual customer satisfaction surveys to customers, asking scaled, and open ended questions. Currently, analysts at Dow aggregate the quantitative response data and measure success against these satisfaction metrics.
The free-form responses from the open-ended questions are manually analyzed to uncover customer sentiment regarding key categories of customer experience. This process is time consuming, and the Dow Chemicals team requested assistance from the student team to research and analyze automated methods of assessing sentiment of the free form customer survey responses.
In order to deliver satisfactory results to Dow, I lead research into current Machine Learning methodologies to analyze the text data.
Preprocessing and Cleaning
The data provided by the team at Dow Chemicals was an Excel file containing a row for each customer survey. Due to NDA agreements between myself and Dow, I can not provide the actual survey data. The data contained personal information such as email, names, and addresses of Dow customers. However, I will illustrate below what the dataset looked like.
As seen above, each respondent has an ID and some additional identifying attributes. Also notice, that each row has several comments, this required some preprocessing of the data before building the model. Also, notice in the first comment, the response was, “Your too expensive, but great support!” The goal of this project is to assign sentiment to comments, but often we found that comments contained several sentiment in a single response. Additional preprocessing was done to split these topics apart as well.
The type of model I was planning on building utilized a type of machine learning called Supervised Machine Learning. Supervised machine learning technique allows you to provide your algorithm with data that has a tag assigned to it already. By showing the model examples of Positive responses versus Negative Responses, the machine learning model can learn about your data in order to predict future data.
When I received the data from Dow, there was no indication of whether a response was negative or positive. I was responsible for tagging approximately 4,000 different comments. In order to speed up this process I build a web portal that loaded the comments stored on a SQL database and allowed me to quickly grade the comments with a click of a button. The interface was built using basic HTML, CSS and served using Node.js that connected to my MySQL database. A screenshot of the interface is shown below.
After tagging each comment with a Positive or Negative response, I move on to building the machine learning model.
Python and Scikit-Learn
I utilized the amazing Python package Scikit-Learn to apply several machine learning techniques to uncover sentiment in the responses. I compared three of the algorithms provided by Scikit-Learn: Logisitc Regression, Multinomial Naive Bayes, and the Bernoulli Bayes classifiers. The goal was to develop a model that could accurately assign comments a Positive, Negative, or Neutral rating based on the comments received in the survey results. The results of these models is shown below.
From left to right: Logisitic Regression, Multinomial Naive Bayes, Bernoulli Bayes. You can see the performance of the three classifiers on our data set in the confusion tabels above. These tables illustrate the accuracy of each model in predicting the correct sentiment when compared to the manual score.
Further investigation into these models was done by constructing Receiver Operating Curves from the true positive and false positive rates.
The ROC curves shown above allow you to compare accuracy of the model as the cutoff rate of accepting results increases. These visuals reinforce our previous conclusion that the Logistic Regression is the best model to use to predict future Dow survey responses.
Final Prediction Tool
After modeling the survey data using machine learning techniques, I wanted to make a tool that would allow analysts at Dow to easily replicate these results on future years surveys. I turned to the Python package Tkinter to build a user interface to allow this model to be run on new data. The first page of the application is shown below. It allows you to import an original survey data file, choose one of the three regression models from above, and indicate an output file and directory for the final output excel file.
The final tool allows Dow to input their data as they provided it to me. Before any preprocessing is done to the data. Because the preprocessing was quite involved, I wanted to build this into the tool. The tool reads in the first row as headers, and allows the analysts at Dow to select the columns they wish to grade against the predictive model.
After selecting the columns to grade, the tool automatically preprocesses the data, and applies the chosen machine learning algorithm to the data. The final result is a new excel file that contains probabilities of a Positive, Negative, and Neutral responses.
As you can see, the final output provides you with the original sentence, the final prediction, and the probabilities of each individual prediction.
Dow analysts can now conduct further analysis on which Customers or Distributors are contributing the most negative sentiment.
I delivered the final results to Dow in December of 2016, along with a final report elaborating in detail every step that was taken along the way to create this model. Analysts at Dow claimed this was “Excellent work” and “Extraordinary results.”
I am very proud to have accomplished so much in a semesters worth of time. I enjoyed my time consulting for the Dow analysts and look forward to a future in applied machine learning technologies.