In the domain of analysing company growth and its potential, the conventional way is normally conducted by checking the comprehensive financial reports and operation status. However, to many who do not really hold lots of business insights and sufficient financial figures, is there a simpler but yet statistically studied method that they can use as a baseline reference ?
The objective of the project is to attempt using limited information and small set of features to predict the company potential if the company will eventually go for IPO or acquired by other market players. It is not meant to serve as any financial or investment advice but more a project to exercise and integrate various modelling trials to assist the preliminary assessment decision in a more structure and statistically explainable approach.
The outcomes of the data analysis and model prediction provide moderate information on the competitive landscape across various market sectors while modelling the growth potential of various companies that were in focus by the Top 10 key investors.
Goal
To identify high growth company & predict its potential for successful IPO / acquisition
Success Metrics
Focusing on True Positive Rate [ Sensitivity ]
Company predicted as Acquired/IPO does get acquired/IPO in real world
To assess using Receiver Operating Characteristic curve ( ROC curve ) , a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied
Target ROC_AUC score > 70%
Data Source
Raw dataset from Crunchbase @ 2015
Missing values handling via Web-scrapped data
Validation via Web-scapped / direct webpage data @ 2018
Analytical Approach
1. Scope Definition to Focus on Market Sectors of High Interest
Since available data set was back to 2015, start up companies to focus for the analysis shall not be funded too many years back.
Criteria 1: Limit scope to companies with last round of funding obtained at 2010 and beyond
Criteria 2: Restrict analysis to only funding sources coming from:
venture capitalist
seed funding
angel investor
private_equity
Criteria 3: Extraction of data involving investments done by top 10 investors
2. Thorough Data Mining & Cleaning
Data cleansing and filtration based on scopes defined and perform preliminary basic analysis to get the most relevant data out for next level of details analysis
3. Exploratory Data Analysis
Extraction of Top10 Key Investors based on the Total Investment Amount in USD
Reclassification of Market Sector and group scatterred sectors into the closest business fields
Similarity of Investment Porfolio via Network Graph
Missing Values Handling and Replacement via Web-Scrapping
Preliminary Statistical Analysis for
Acquired companies
IPO companies
Closed down companies
4. Sampling Methods & Predictive Model Selection
Evaluation of various sampling methods for Imbalanced datasets
Evaluation of various classification models for best predictive model selection
5. Data Validation for Predictive Model Improvement
Validation of False Positive data versus latest state in 2018
Recalculation with latest state to check the actual predictive model performance
Analytical Outcomes
1. Finalization of Top10 Key Investors
2. Market Sectors In Focus by Top10 Key Investors
3. Similarity of Investment Portfolio via Network Graph
1st Visualization via networkx library
2nd Visualization via Neo4j Graph Platform
Analysis on Acquired Companies
Summary of companies under Top10 investor list that were exited due to acquisition :
196 / 1675 companies from top 10 investors list was acquired
11.70% acquisition successful rate
For companies founded since 1994
Average
Median
5.63 years to exit
5 years
3.09 funding_rounds
3
1.56 investor from Top10 investors
1
$66.67M total funding
$30M
For start-ups founded 2010 onwards [88 companies]
Average
Median
2.95 years to exit
3 years
2.25 funding_rounds
2
1.42 investor from Top10 investors
1
$35.79M total funding
$8.25M
Top 3 market sectors with higher no. of companies being acquired
Software\Apps
Content Creation\Entertainment\Curated Web\Design
Internet\Web\Search\Communication\Social Media
Analysis on IPO Companies
Similar stats analysis approach as done for Acquired companies was repeated and used to study IPO companies. Below are the graphical analytic outcomes for IPO companies :
Summary of companies under Top10 investor list that were IPO successfully :
71 / 1674 companies were successfully listed
4.24% successful IPO rate
From 71 IPO companies :
Average
Median
$332.91M total funding
$112.79M
5.51 funding_rounds
5
1.94 investor from Top10 investors
1
2005 years to exit
2006
Top 3 Market Sectors having the highest no. of IPO companies:
BioTech\Health Care : 25
Software\Apps : 9
eCommerce\Marketplace : 7
Top 3 IPO companies having the highest funding total:
Company
Total_Funding
1st Alibaba
$4.81 Billion
2nd Facebook
$2.43 Billion
3rd Twitter
$1.16 Billion
Analysis on Closed Down Companies
Similar stats analysis approach was again repeated and used to study Closed Down companies. Below are the graphical analytic outcomes for Closed Down companies :
Summary of companies under Top10 investor list that were closed down :
52 / 1674 companies were closed down
3.11% failure rate
From 52 closed down companies :
Average
Median
$28.66M total funding
$12.15M
2.56 funding_rounds
2
1.42 investor from Top10 investors
1
2008 years to exit
2009
[ 2008 Subprime Mortgage Financial Crisis ]
Top 3 Market Sectors having the highest no. of Closed Down companies:
Current data set was found not suitable for a comprehensive Time Series Analysis due to the lack of important key feature of Date/Time and its completeness to support continuous time-dependent analysis. With this, pior to starting next core modelling, a new dataframe was set up to exclude companies with Closed Down status and companies that were either acquired or IPO were then combined as one category. The restructure of such dataframe was done in preparation for classification modelling purpose.
Restructure 1: Exclude companies with Closed Down status
Restructure 2: Combine acquired companies and IPO companies as one new category
1. Dataframe Readiness
2. Model Preprocessing
3. Sampling Method Evaluation
Since imbalanced dataset was observed, additional step was therefore needed to evaluate sampling methods to balance them before modelling. Logistic RegressionCV was chosen as base model to support this sampling method evaluation since it was the simplest classification model.
Sampling Methods to evaluate as listed below :
Over Sampling_RandomOverSampler
Over Sampling_SMOTE
Combine Sampling_SMOTEENN
Combine Sampling_SMOTETomek
Under Sampling_RandomUnderSampler
Under Sampling_CondensedNearestNeighbour
Summary of Sampling Method Evaluation
Set up a summary to tabulate ROC_AUC score and the recall score for Class 1 for the various evaluated sampling methods for easy reference to decide which one to choose for further comprehesive modeling selection
With base model of LogisticRegression CV, scoring for various sampling methods were compared. From the summary table,
Under Sampling_Random Under Sampler was observed having the highest ROC_AUC score with recall_score for Class1 was also relatively higher compared to other sampling methods.
Therefore, final sampling method to use for next model selection process is < Under Sampling_Random Under Sampler >
4. Predictive Model Selection
With sampling method finalized above as Random Under Sampler, the data will be preprocessed using Standard scalar followed by this under sampling method in order to achive balanced dataset for subsequent modelling
Classification models to evaluate as listed below :
Logistic Regression CV
KNeighbors Classifier
SGD Classifier
Gradient Boosting Classifier
Support Vector Classifier
Random Forest Classifier
Summary of Predictive Model Selection
Tabulate summary of model performance for ease of model finalization and conclusion wrap up
With Grid Search on various models and various parameter trials, the best ROC_AUC score is in the range of 0.60-0.67. Random Forest Classifier was found having the highest ROC_AUC score at 0.66 with Recall score for Class 1
stood 2nd highest at 0.69
Predictive Model Best Estimator & Features Importance
Based on earlier model evaluation on Random Forest Classifier, the best estimators & their parameters as below:
Since precision and recall score for Class 1 in Confusion Matrix is not upto expectation, validation of the all False Positive to be done next to check if any of those predicted IPO/acquired companies did get listed/acquired after year 2015
Companies latest status could be validated through web-scrapped script / directly reference on webpage
17.49% of companies under false positive were actually acquired/IPO after 2015. Overall recall score for class 1 after factored in the latest company status now inceased from 0.69 to 0.79 ~~
Project Summary
Based on the dataset of 2015, the performance of companies under Top10 Investor List as follows :
196 / 1674 companies from top 10 investors list were acquired
[ 11.70% acquisition successful rate ]
71 / 1674 companies were successfully listed
[ 4.24% successful IPO rate ]
52 / 1674 companies were closed down
[ 3.11% failure rate ]
Due to imbalance dataset, sampling method was therefore applied. Based on highest ROC_AUC score and Recall Score for True Positive, the final selected classification predictive model was Random Forest Classifier coupled with Random Under Sampler Method. The analysis and prediction done based on 2015 data set would yield a Recall score of only 0.69 for True Positive and AUC score of 0.66. Regardless of various hyperparameters tuning and evaluation of different predictive models, the target metrics were still ranging between 0.6 to < 0.7, hitting the limitation of the evaluated features to yield better prediction score.
Subsequent validation of the False Positive data with the latest state as of 2018 showed that there was an additional 32 companies ( ~17.5% of the False Positive ) were acquired/listed after 2015. However, there was also 5 companies ( 2.7% ) were found closed down. As original dataset cut off timeframe was back to 2015, giving longer observation/holding period of 3 years on those predicted high potential/high growth companies, the chances of them getting acquired/listed was around 17% more.
The validation of False Positive with latest company state showed an increase of True Positive Recall score to 0.79 and an adjusted ROC_AUC score was at 0.72. Target of >70% would be achievable only with longer period in this project based on the evaluated predictive model.
With only 5 key features :
funding_total
funding_rounds
no.of key investers
founding year
market sectors
in addition to various scatterred data sets, the performance of the model was therefore at moderate level to serve as preliminary baseline to assess a company mid-long term growth potential.