Lead Data scientist (Jul, 2019
–Oct, 2021) at Synechron technologies pvt ltd
Responsible to constantly interact with the stakeholders to understand the business need, data cleansing and preparation,conduct statistical tests, exploratory analysis and come up with best credit scoring model to understand the credit risk of the customers,forecast liquidity ratio and build recommendation system for corporate bonds
To model credit risk of the customers by various credit scoring methods such as survival analysis,Logistic regression, Linear discriminant analysis,Naïve Bayes,Decision tree,Random forest,SVM,PGM and neural networks and choosing the best model by validating the performance metrics.
Tools used - Python
• Data extraction from multiple sources, Data cleansing to merge acquisition, performance and macro-economic related variables.
• Data preprocessing include missing value treatment, dummy encoding categorical features, generating new features . • Statistical test -Check for multicollinearity, Granger causality test has been performed to find the impact of the
macro economic variables upon one of the credit events - default .
• Feature extraction by using entropy as a measure and by using graphical networks.
• Modelling the data using logistic regression,survival analysis,Niave bayes,Decision tree,random forest
,neural networks,LDA,Probabilistic graphic model.
• Logistic regression -Finding the appropriate threshold for classification by plotting False positive and True positive rates. • Survival analysis -Fitting Cox Proportional Hazard Model
• status - 0 indicates that hazard event has not occurred and status - 1 indicates that the Hazard event has occurred (loan borrower defaulted) time - represents the duration of survival of the loan borrower. Fit multivariate cox regression model with macro-economic variables, loan characteristics and Equity as covariates .
•Probabilistic graphical model - Implemented a PGM model between all the variables .Assigned conditional probabilities at each node. User can query (reasoning) for any combination of different levels of independent or dependent variables with respect to other variable levels using Bayesian network.
• Validating the models by K-fold cross validation.
• Evaluation of the models by confusion matrix,ROC curve, gini index .Also data envelope analysis technique is implemented to choose the best fit model.
Building product recommendation system using associative mining/collaborative filtering techniques to recommend corporate bonds.
• Data - Data consists of trade date,quantity,price,ticker,coupon,cousip,amount outstanding,maturity date,coupon structure etc
• Data cleansing and exploration- Missing value treatment and imputation, convert continuous variables to discrete, cross tabulation, scatter plots.
• Performing various statistical tests to explore data and find the how independent variables are impacting dependent variable.
• Recommendation system - Associative mining model/collaborative filtering is built to recommend corporate bonds to customers/ clients .
• Data visualization -Heat maps of rules generated for different combination of support and confidence.
• Graphical user interface -GUI is designed using R-shinny by giving the user the flexibility to choose different values
of support and confidence,limiting number of rules generated,filtering other variables.
• Stock market analysis –Data collection, visualization of data to understand the pattern, Beta calculation of individual Stocks with respect to the market,portfolio optimization