DataHacks 21-22
Shark Tank - How Pitch Descriptions Make or Break the Deals
Thy Nguyen

Section 1: Introduction

Shark Tank is a famous TV show that allows potential entrepreneurs to introduce their companies/products and get invested by the investors (aka "sharks").
In our DataHacks this year, we are asked to predict whether or not a deal was made between a pitched business and the investors by using the description of the pitched businesses.
To accomplish that goal, we are given two datasets, which allow us to access 530 rows and 5 columns of the train set as well as 176 rows and 3 columns of the test set.

In particular, the columns of the train set are:

Season_Epi_code(data type: int64, categorical-ordinal data) is a code to denote the pair of season and episode when the pitch occured. The datasets consist of 8 seasons of Shark Tank (in the US), and the Season_Epi_code is under the format of SEE. For instance, 115 means 1^st season - 15^th episode, 803 = 8^th season - 3^rd episode)

Pitched_Business_Identifier(data type: str, categorical-nominal data) is an abbreviation of the pitched business

Pitched_Business_Desc(data type: str, categorical-nominal data) briefly describes the business and their products

Deal_Status(data type: int64, categorical-ordinal data) is the status of whether the pitched business got a deal or not, with two outcomes: 1 = "YES, the deal was made", 0 = "NO, the deal was not made"

Deal_Shark(data type (except for null values): str, categorical-nominal data) - "Which of the most common sharks agreed on the episode along with the presenters for a deal?" For example, here is a list of deal sharks listed in the corresponding column: BC - Barbara Corcoran, DJ - Daymond John, KOL - Kevin O'Leary, LG - Lori Greiner, MC - Mark Cuban, RH - Robert Herjavec.

Figure 1: Trainset Header

On the other hand, the test set only has 3 columns:

Season_Epi_code

Pitched_Business_Identifier

Pitched_Business_Desc

whose descriptions are shown above

Figure 2: Testset Header

Section 2: Data Cleaning / Pre-processing

When conducting data cleaning and pre-processing, we have to take care of duplicates, irrelevant observations, missing data, inconsistent ways of recording data, and unexpected outliers.

For our training data set, all 530 observations are distinct, so we need not handle duplicates in this case.

We also observe that, except for the last column ("Deal_Shark"), there is no null value in the first 4 columns of the trainset. Furthermore, as specified in the prompt, the Deal_Shark column will not be used to predict the Deal_Status due to collinearity. Hence, we choose to drop this column out of our trainset.

Similarly, we notice that the input variables of the test set are unique and non-null. Hence, we find no duplicates and zero missing data

Additionally, both datasets are recorded in a consistent format: In particular, the data type in each column is consistent throughout the datasets.

However, there exist some special characters in the Pitched_Business_Desc column, namely \xa0 or A^, so we use regex to remove those non-alphanumeric and non-space characters / words.
Overall, to clean our data, we drop the "Deal_Shark" column and preprocess the Pitched_Business_Desc text.

In the later section, we will have a chance to dive into how we split the given traindata into train sets and test sets.

Figure 3: Train Data Header after Preprocessing

Section 3: Exploratory Data Analysis

In general, for each pair of season and episode, there are up to 5 pitched businesses.
The most common number of pitched businesses per season-episode is 3; the next common one is 4, and the least common one is 5.
Only Season 1-Episode 6 had 5 pitched businesses participated at the same time.

Figure 4: Histogram of the Number of Pitched Businesses per Season-Episode

Besides, as our label is the "Deal_Status" column, we should also pay more attention to it.
Specifically, in the trainset, the number of deals made is 283, which is 12.7% greater than the number of no deals.
Such a difference is not drastically large; however, it might lead to the overfitting issue when we build our classification model in the later section.
Thus, we need to handle data with thorough care when splitting the train data into sub-train and sub-test data sets.

Figure 5: Proportion of Deal=YES over Deal=NO

Also, one might wonder if the even or odd episode affects the decision of the investors.
Unfortunately, this factor does not contribute much to our process of classifying deal status.
In fact, the proportions of deals that were/were not made in an even episode are relatively similar to such proportions in an odd episode.
On average, the proportion of "YES" deals is approximately 53%, and the proportioon of "NO" deals is 47%.

Figure 6: The Number of Deal Made and No Deal in an Even or Odd Episode

Now, we take a look at the p-values of the Chi-test that assesses the correlation between every 2 categorical variables.

Figure 7: The P-values of Chi-test when Testing the Correlation between 2 Categorical Vairables

Before we jump into conclusion of this Chi-square test, let's define the null and alternative hypotheses.
H0: There is no correlation between 2 variables.
H1: There is a correlation between 2 variables.
Let's the level of significance be alpha = 0.05. We will reject the null hypothesis if the p-value < 0.05, and conclude that our test favors the alternative hypothesis.
As shown in the above table, all the non-null entries are above 0.05. Hence, we fail to reject the null hypothesis.
We conclude that our traindata did not provide sufficient evidence to favor the alternative hypothesis,
and that also means the data supports H0 which states: "There is no correlation between every two categorical variables".

Figure 8: Heatmap Showing P-values of Chi-test of Correlation

Overall, from the above EDA, we conclude that the Season_Epi_code is not a good predictor for the Deal_Status.
As a result, we are left with "Pitched_Business_Identifier" and "Pitched_Business_Desc" to classify the labels of deals.
Comparing those two attributes, we see that the "Pitched_Business_Identifier" is not descriptive enough,
and we have little to no clue about how we can turn it into a good predictive variable.
Thus, we finally choose "Pitched_Business_Desc" as our input varible of the classification model.

Section 4: Feature Engineering

In the feature engineering process, we might want to remove the stopwords of the "Pitched_Business_Desc" column.
The reason is that stopwords are so common that they do not achieve a high value of TF-IDF and are not good at indicating the importance or relevance of a string representation.
Some stopwords that we frequently see in our daily life are "a", "the", "of", "in", etc.

Note that, before removing stopwords, we should turn all words into lowercase as we might encounter the situation where the stopword corpus of NLTK only include the lowercase version of the same word. For instance, NLTK stopword corpus only has "the" instead of "The".

Next, we'll normalize the corpus by a technique called Lemmatization.
In particular, for words that are related to one another, namely, "use", "uses", "used", "using", and "user", we only need to normalize them and keep the word "use" as far as the classifier concerns. The reason is redundancy does not help our classifier, so we should keep a single word among a dozen of its variations.

Besides, we use CountVectorizer and TfidfTransformer to transform the input data and enable them to work with Sklearn Classifiers.
In particular, CountVectorizer converts the input text into a sparse matrix of token counts. By default, the number of features will be equivalent to the vocabulary size found by data analysis.

Figure 9: Sparse Matrix after CountVectorizer on the Whole Trainset

On the other hand, TfidfTransformer transforms the count matrix obtained from CountVectorizer into a nornalized matrix of tf or tf-idf representation.
In our case, we exclude the use of idf (i.e. set idf(t) = 1, and tf_idf(t, d) = tf(t, d) * idf(t) = tf(t, d)).
The reason is that we do not want rare terms to weigh more heavily than the other frequent terms. In fact, the rare terms in Pitched_Business_Desc may come from the company names themselves, and such "rare terms" are not the indicators of whether or not the investors will sponsor those pitched businesses.

Figure 10: Sparse Matrix after TfidfTransformer

Section 5: Model Building

Since we cannot access the testset while training data, we split the trainset into smaller train and test sets (70% train / 30% test).
We'll train our model by using 4 classifiers (Decision Tree, Stochastic Gradient Descent, Multinomial Naive Bayes, and K-Nearest Neighbors).

After that, we compare those models and pick the optimal choice.

Figure 11: Model Comparisons

To summarize:

As shown in the above table, we observe that the Decision Tree Classifier experiences overfitting with 100% train accuracy but only 51% test accuracy.
This is nothing more than a coin flip, so we stop considering this model in future usage.

Additionally, the remaining models seem not to suffer much from overfitting. Their train and test accuracies as well as F1-scores do not vary much from one another.

Compared to SGD and Multinomial Naive Bayes, K-Nearest Neighbor achieves the best train and test accuracies.
Its F1-scores are better than SGD classifier, and even though the F1-scores of KNN do not beat those of Multinomial Naive Bayes, the differences are not significant.

Overall, we pick KNN Classifer over the other three choices.

Advantages and Disadvantages of KNN:

KNN is relatively simple to implement and easy to understand.
Being an instance-based approach, KNN quickly adapts to new training data, which avoids long training period.

Some drawbacks of KNN are that it does not work well with large dataset, high dimensional or noisy data.
Fortunately, our datasets are not drastically large (530 observations from the original trainset and around 170 observations from the testset),
and we do not have high dimensional or noisy data. Hence, we can consider using it in the testing process.

Section 6: Model Testing - Evaluation Metrics, Analysis, and Results

After choosing KNN as our classifier, we use the whole given trainset to train the model again and output the results.
(Note: the predictions are stored in a csv file, which has 2 columns ("Pitched_Business_Desc" and "predictions"))

Evaluation Metrics on the Trainset

Train Accuracy: approx. 64%

F1 Score: approx. 0.692

Analysis

The model performs better than a coin flip, and we expect that the final testing result should be around this range of accuracy.

Similarly, the F1-score is close to 0.7, and we expect that the test F1-score does not vary much from this training score.

Result

Please check out the csv file to have more information.

Section 7: Conclusion / Future Improvements

Our goal for this study is to classify the label of a pitched business and see whether or not that business was able to make a deal during their
participation in the TV show Shark Tank.

We've started our project by introducing the given datasets, cleaning data, conducting EDA and feature engineering,
then building and testing the classification model.

After cleaning data by dropping the unnecessary column as well as removing stopwords and lemmatization, we've utilized CountVectorizer
and TfidfVectorizer to transform our data before building the classifier.

Four classifiers we've used are Decision Tree, SGD, Multinominal Naive Bayes, and KNN. Among them, KNN gave
the best possible result during this intense 24-hour DataHacks, whereas Decision Tree classifier suffers from overfitting.

With that conclusion, we will suggest some future improvements for our project. For instance, we can use Grid-search Cross-Validation
to find the optimal combinations of parameters that are used in KNN model, or we can consider other advanced techniques such as
deep learning, LSTM, RNN, and so on.

Thank you for your time!

Thy Nguyen

DataHacks 21-22 Shark Tank - How Pitch Descriptions Make or Break the Deals Thy Nguyen