• Offshore Analytics COE

    Offshore Analytics COE – cracking the code

    What is an ACOE?

    Increasingly, companies rely on their information systems to provide critical data on their markets, customers and business performance in order to understand what has happened, what is happening – and to predict what might happen. They are often challenged, however, by the lack of common analytics knowledge, standards and methods across the organization. To solve this problem, some leading organizations are extending the concept of Centers of Expertise (COE) to enterprise analytics.

    With these COEs, they have realized benefits such as reduced costs, enhanced performance, more timely service delivery and a streamlining of processes and policies. An Analytics COE (ACOE) brings together a community of highly skilled analysts and supporting functions, to engage in complex problem solving vis-à-vis analytics challenges facing the organization. The analytics COE fosters enterprise-wide knowledge sharing and supports C-level decision making with consistent, detailed and multifaceted analysis functionality.

    The eternal debate – in-house versus outsource

    On scanning the market it is evident that both the in-house and outsourced models are equally prevalent at least in India based ACOEs. Most of the financial institutions like Citicorp, HSBC, Barclays etc have chosen to go in-house. This is primarily due to data sensitivity issues. Firms in industries where the data security concerns are not as high like CPG, Pharma etc typically choose third party specialized analytics shops to set up ACOE for them. While making a decision on in-house versus outsource some points to be kept in mind are

    1. External consultants can be utilized for the heavy lifting i.e. data cleansing & harmonisation / modeling / reporting work. Internal resources with their better understanding of the competitive scenario, internal business realities and management goals can concentrate on using the insights generated from the analysis / reporting to formulate winning strategies/tactics
    2. External consultants provide you the flexibility of ramping up / down at short notice based on fluctuations in demand
    3. Analytics resources span a wide variety of skill sets across Data warehousing / BI / Modeling / Strategy. It’s difficult to find folks with skills / interests across all these areas. Often you do not need a skill set full time e.g. a modeler might be needed only 50% of the time. In case you hire internally you have to sub optimally utilize him / her for the balance 50%. An external team gives you the flexibility to alter the skill mix depending on demand while keeping the headcount constant e.g. a modeler can be swapped for a DW/BI resource if the need arises
    4. Possibility of leveraging experience across clients / domains.

    Initiating the engagement

    As with any outsourcing arrangement, setting up an ACOE is a 3 step process



    Ongoing governance of the relationship


    At TEG we recommend a 3 tier governance structure as described in the figure above for all ACOE relationships.

    1. The execution level relationship between analysts on both sides that takes decisions on the day to day deliverables
    2. The Project Manager – Client Team lead level relationship that works to provide prioritization and resolve any execution issues
    3. Client Sponsor – Consultant Senior management level relationship that works on relationship issues , contractual matters & account expansion etc

    Projects executed under ACOE

    Typically any project / process that need to be done on a regular and repeated basis is ideal for an ACOE. Building out an ACOE ensures high level of data and business understanding as the same analysts work across multiple projects. This set up is not suitable for situations where the analytical wok happens in spurts, with periods of inactivity in between.

    TEG runs ACOE for several Fortune 500 clients, and the analysts are engaged in a variety of tasks

    1. Apparel & Sports goods retailer
      • Maintain an Analytical Datamart of all sales , sell through , sell in and pricing data across multiple franchisees stores and accounts
      • Maintain the entire suite of Sell Through reporting for retail operations, merchandising & sales teams. This set of reports includes sales & inventory tracking , SKU performance and promotion tracking at various levels
      • Formulate promotion pricing strategy for factory outlet stores using sell through data
    2. Beauty products major
      • Survey analytics , identifying key trends from the survey results and drivers analysis
      • Market Basket Analysis, analyse past purchase history to identify the product combinations that have a natural affinity towards each other. Insights based on this analysis are used for cross-promotions, brochure layout, discount plans, promotions, and inventory management
      • ETL on the sales and marketing data to create an Analytical Data Mart that can be used as a DSS tool for strategic pricing & product management decisions
      • Online competitor price tracking, create a link extractor that scrapes price aggregator and competitor websites and creates a database of competitor product prices. This database is used by our client to perform price comparison studies and take strategic decisions on pricing
      • Generate Executive Management Workbooks to track market share of Top 100 products & provide analytical insights
    3. Credit card and personal finance firm
      • Creation of basic customer marketing , risk & collections report with multiple slicers for extensive deep dive analysis of customer transaction data
      • Collection queue analysis , ensuring equitable distribution of collection calls amongst different collections agents
      • Customer life time value analysis
      • Customer product switching analysis
      • Acquisition & active customer model scoring & refresh
    4. Nutritional & consumer products MLM firm
      • Campaign management using SAS, SQL & Siebel. Complete campaign management including propensity model creation , audience selection for specific campaigns, design of the campaign using DOE methodology , control group creation , campaign loading in the CRM system , post campaign analysis
      • Customer segmentation
      • Distributor profitability analysis
      • Customer segment migration analysis using Markov chain based models
    5. CPG major in household cleaning products
      • Creation of digital analytics DataMart using data across 18+ sources across 11 marketing channels
      • Creation and maintenance of complete reporting and dashboard suite for digital marketing analysis and reporting
      • Price and promotion analysis , price elasticity modeling , pricing tool to determine revenue and profitability impact of key pricing decisions
      • Market share reporting across 25 countries in LATAM & APAC
      • Creation of data feeds for MMX modeling
      • Shipment , Inventory & consumption analysis with a view to optimizing inventory and shipping costs
      • SharePoint dashboard creation to track usage of corporate help resources
    6. Consulting company focused on automobile sector
      • Demand forecasting of automotive sales based on variations in marketing spend across DMAs
      • Propensity modeling to determine the ideal prospects for direct sale of customized electric vehicle
      • Customer segmentation to determine the ideal customer profile for relaunch of a key model

    Key takeaways

    The ACOE model has been successfully deployed by clients across a variety of industries to beef up their analytical capabilities.

    In some cases the requirement is tactical for a limited period of time, but mostly clients use it strategically to harness best of breed capabilities that are difficult to build in house. The critical success factors in a ACOE relationship are

    1. Strong business understanding of client processes by the consultant team. This is usually done by posting key resources onsite on a permanent basis or on a rotational basis
    2. Strong governance at multiple levels
    3. Tight adherence to business and communication processes by both parties
    4. Well defined scope of services for the consultant teams
  • Improving Marketing Effectiveness (Using Performance Pointer)

    Improving Marketing Effectiveness (Using Performance Pointer)

    Retail and Consumer goods companies run multiple campaigns, promotions, and incentives to lure customers to buy more. A great deal of time, energy, and resources are deployed to execute the promotion programs. In most organisations the allocation of funds to various programs is based on gut feel or past experience. If the promotions do not pan out the way they were intended or perform better than expected, the decision maker is unable to explain the phenomenon or repeat the performance. Also the promotions may be targeted at a macro segment while they may be effective only at a micro segment thereby reducing the overall effectiveness of the program.


    Using Analytics the gut-based decision can be supported by facts. It helps the business make a better business decision and stay ahead of its competitors who may rely purely on gut feel or past experience. We have chosen an MLM (Multi Level Marketing) company as an example as the problem of allocating funds to various incentives is further amplified due to the large sales force engaged in the MLM companies. Retail and consumer goods industry can draw parallel between incentives and promotions that are run through the year targeted at various segments to improve sales.

    According to Philip Kotler, one of the distribution channels through which marketers deliver products and services to their target market is Direct Selling where companies use their own sales force to identify prospects and develop them into customers, and grow the business. Most Direct Selling companies employ a multi-level compensation plan where the sales personnel, who are not company employees, are paid not only for sales they personally generate, but also for the sales of other promoters they introduce to the company, creating a down line of distributors and a hierarchy of multiple levels of compensation in the form of a pyramid. This is what we commonly refer to as Multi-Level Marketing (MLM) or network marketing. Myriad companies like Amway, Oriflame, Herbalife, etc. have successfully centered their selling operations on it. As part of Sales Promotion activities MLM companies run Incentive programs for their Sales Representatives who are rewarded for their superior sales performance and introducing other people to the company as Sales Representatives.


    Business Challenge
    Although Incentives play a major role in sales lift, there are many other environmental factors such as advertisement spend, economic cycle, seasonality, company policies, and competitor policies also affect sales. It becomes increasingly difficult to isolate the impact of incentives on the sales. Usually MLM companies run multiple and overlapping incentive programs i.e. at any given time more than one incentive programs run simultaneously. See figure below. Rewards could be monetary or include non-monetary items like jewelry, electronic items, travel, cars etc. and are offered on a market-by market basis. A key question that arises is – “how do we understand the effectiveness of these multiple incentive programs?” The success of any MLM company is largely dependent on the performance of its Sales Representatives, Incentive programs are of paramount importance for realising the company’s marketing objectives and hence form a vital component of its marketing mix. Some of the large Direct Selling Corporations end up spending millions of dollars on Incentive programs in every market. It is important for the incentive managers to understand the effectiveness of the incentive programs, often measured in terms of Lift in Sales, in order to drive higher Return on Incentive Investment as against making gut-based investment decisions. To summarise, it is not easy to measure the ROI on Incentive programs for two broad reasons. First, it is difficult to separate the Lift in Sales due to Incentive programs from that due to other concurrent communication and marketing mix actions. Secondly, in most cases there may be an absence of “Silence Period” i.e. one or more incentive programs are functional at all points of time. This makes the task of baseline sales estimation virtually impossible by conventional methods. Usage of Analytics – the science of making data-driven decisions – becomes indispensible in order to address the above constraints while at the same time statistically quantify the individual Incentive ROI and make sales forecasts with sound accuracy. While doing so a systematic approach using best practices is followed in order to obtain reliable results in a consistent and predictable manner.

    It is very tempting to jump straight into the data exploration exercise. However, a structured approach ensures the outcome will be aligned with the business objectives and the process is repeatable.

    Objective Setting
    The first step towards building an analytics based solution is to list down the desired outcomes of the endeavor prior to analysing data. This means to thoroughly understand, from a business perspective, what the company really wants to accomplish. For instance, it may be important for one MLM company to evaluate the relative effectiveness of various components of its marketing mix which includes Incentives along with pricing, distribution, and advertising etc. while some other company may be interested in tracking the ROI of its past incentive programs in order to plan future incentive programs. In addition to the primary business objectives, there are typically other related business questions that the Incentive Manager would like to address. For example, while the primary goal of the MLM Company could be ROI estimation of Incentive programs, the incentive manager may also want to know which segment of Representatives are more responsive to a particular type of Incentive or to find out if the incentives are more effective in driving the sales of a particular product category. Moreover, it may also be prudent to design the process in a way that it could be repeatedly deployed across multiple countries rapidly and cost effectively. This is feasible where a direct selling company has uniform data encapsulation practices across all markets. A good practice while setting the objectives is to identify the potential challenges at the outset. The biggest challenge is the volume of Operational Data. In the case of retail and direct sellers it runs into billions of observations over a period of few years.

    Data Study
    Most Direct Sellers maintain sales data at granular levels and aggregated over a period of time and across categories along with Incentive attributes, measures and performance indicators. Hence, we can safely assume that most companies will have industry specific attributes like a multi-level compensation system. However, every MLM company will also have its own specific set of attributes which differentiate it from its competitors. It is therefore vital to develop a sound understanding of historical data in the given business context before using it for model building. This exercise also entails accurately understanding the semantics of various data fields. For instance, every MLM company will associate a leadership title with its Representatives. However, the meaning of a title and the business logic used to arrive at the leadership status of a Representative will vary from company to company. Good understanding of other Representative Population attributes like Age, duration of association with company, activity levels, down line counts etc. also leads to robust population segmentation. Depending on business objectives other data related to media spend, competitor activity, macro economic variables etc. should also be used. A potential issue of data fragmentation might arise here as voluminous data is broken up into smaller parts for ease of storage, which needs to be recombined logically and accurately using special techniques while reading raw data.
    Moreover, raw data coming from a data warehouse usually contains errors and missing values. Hence, it becomes important to identify them in the data using a comprehensive data review exercise so that they may be suitably corrected once the findings have been validated by the client. In extreme cases the errors and inconsistencies might warrant a fresh extraction of data. This may lead to an iterative data review exercise which is also used to validate the entire data understanding process. Any lapse in data understanding before preparing data for modeling might lead to bias and errors in estimation in future. A data report card helps clients understand the gaps in their data and establish procedures to fill the gaps. A sample scorecard is shown for reference.


    Data Preparation
    Modeling phase requires clean data with specific information in a suitable format. Data received from the company cannot be used as input for modeling as is. Data preparation is needed to transform input data from various sources into the desired shape. Not all information available from raw data may be needed. For example, variables like name of Incentive program, source of placing orders, educational qualifications of Representatives etc. may not be needed for building a model. Key variables that must go in model building are identified and redundancies removed from the data. Observations with incorrect data are deleted and missing values may be ignored or suitably estimated. Using the derived Representative attributes the population is segmented into logically separate strata. Data from different sources is combined to form a single table and new variables are derived to add more relevant variables. In the final step of the data preparation exercise measures are aggregated across periods, segments, geographies and product categories. The aggregated data is used as an input for modeling.



    Data Modeling
    A statistical model is a set of equations and assumptions that form a conceptual representation of a real-world situation. In the case of an MLM company a model could be a relationship between the Sales and other variables like Incentive costs, number of Representatives, media spend, Incentive attributes and Representative attributes. Before commencing with the modeling exercise the level at which the model should be built needs to be ascertained. A Top Down approach builds the model at the highest level and the results are proportionally disseminated down at the population segment level and then on to individual level. The resultant model may not be able to properly account for the variation in the dependent variable and introduce bias in estimates as the existence of separate strata in the population while model building is ignored. Choosing a Bottom Up approach on the other hand builds the model at the individual level and the results are aggregated up to the segment level and then on to the top level. This is exhaustive, but at the same time, very tedious, as Sales data usually runs into millions of observations and not all Representatives may be actively contributing to the Sales at all times. Moreover, if the project objective revolves around estimating national figures this exercise may become redundant. The Middle Out approach may be the best approach to model the data. The model is built at the Segment level and depending on the requirement the results may be aggregated up to the top level or proportionately disseminated down at the individual level. The first step of model building exercise is to specify the model equation. This requires the determination of the Dependent variable, Independent variables and the Control Variables. Control variables are those variables that determine the relationship between the dependent variable and independent variables. In a baseline estimation scenario, the Sales measure is the dependent variable; Incentive cost and other Incentive attributes form the independent variables; segmentation variables, time series, geography, inflation and other variables like media spend act as Control variables. Usually the model is non-linear i.e. the dependent variable is not directly proportional to one or more independent variables. A non-linear model may be transformed to a linear model by use of appropriate data transformations. For example, the relationship between Sales and Incentives is non-linear. Representative Incentives behave like consumer coupons where there is an initial spike in Sales at the start of the Incentive followed by a rapid decline, but the impact returns at the end of the incentive as Representatives try to beat the deadline. Application of coupon transformation to Incentive variables therefore produces a linear relationship between Sales and Incentives. Model coefficients are then estimated using advanced statistical techniques like Factor Analysis, Regression and Unobserved Component Modeling. The common practice followed across industry is to use Regression Analysis for explaining the relationship between the dependent variable and the independent variables and separately employ time series ARIMA (Auto Regressive Integrated Moving Average) models for forecasting as the data invariably has a time component. To solve the Regression models with all incentive attributes accounted for, they are first condensed into a few underlying factors accounting for most of the variance. These factors are then part of the regression along with the control variables. Final coefficients are a combination of factor loadings and model coefficients. This Regression model equation allows us to understand how the expected value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. The time series model is developed by reducing the non-stationary data to stationary data, removing the Seasonality and Cyclic components from it and estimating the coefficients of the ARIMA model. This approach often leads to in concordant answers from Regression and ARIMA models as Regression analysis will miss the trend and ARIMA forecasting may fail to account for causal effects. Unobserved Component Modeling may be employed if a very high accuracy is desired from the model. It leverages the concepts of typical time series analysis where observations close together tend to behave similarly while patterns of correlation (and regression errors) breaks down as observations get farther apart in time. Hence, regression coefficients are allowed to vary over time. Usual observed components representing the regression variables are estimated alongside the unobserved components such as trend, seasonality, and cycles. These components capture the salient features of the data series that are useful in both explaining and predicting series behavior. Once the model coefficients are determined it is essential to validate the model before using it for forecasting.


    Model Validation
    The validity of the model is contingent on certain assumptions that must be met. First l, the prediction errors should be Normally Distributed about the predicted values with a mean of zero. If the errors have unequal variances, a condition called heteroscedasticity, Weighted Least Squares method should be used in place of Ordinary Least Squares Regression. A plot of residuals against the predicted values of the dependent variable, any independent variable or time can detect the violation of the above assumption. Another assumption that is made in time series data is that prediction errors should not be correlated through time i.e. errors should not be auto correlated. This may be checked using the Durbin Watson test. If errors are found to be auto-correlated then Generalized Least Squares Regression should be used. It is also important to check for correlation among the independent variables, a condition called multi collinearity. It can induce errors in coefficient estimates and inflate their observed variances indicated by a variable’s Variation Inflation Factor (VIF). Multi-collinearity can be easily detected in a multiple regression model using a correlation analysis matrix for all the independent variables. High values of correlation coefficients indicate multi-collinearity. The simplest way to solve this problem is to remove collinear variables from the model equation. However, it may not always be feasible to remove variables from the equation. For example, the cost of an Incentive program is an important variable that cannot be removed if found to have a high Variation Inflation Factor. In such cases Ridge regression may be used in place of OLS Regression. However, some bias may sneak in the coefficient estimates. The goodness of model fit may be adjudged by the values of R², which is the model coefficient of determination. Its value ranges between 0 and 1. A good fit will have an R² a value of greater than 0.9. But any value of R² close to 1 must be bewared as it could be causing over-fitting. Such a model would give inaccurate forecasts. A low value of Mean Average Percentage Error (MAPE) of predicted value is also indicative of a good fit. Once the model assumptions are validated and goodness of fit established the model equation can be used for reporting and deployment purposes.


    Reporting & Deployment
    Depending on the chosen dependent variable based on the scope of the incentive modeling exercise the baseline measures like Sales, Volume, Representative count etc can be estimated using the model equation. These estimates along with other variables and derived values can be used to obtain insights about Incentive performance through dashboards with KPIs and other pre defined reports like annual Lift in Sales vs. Incentive Cost, Baseline Sales vs. Sales Representative Count, etc. The key to realizing the business objectives & deriving value from the modeling outcomes is to capture & present the findings in the most suitable form which will enable the end user to understand the business implications as well as to flexibly slice & dice data in any way in a convenient fashion without having to make any costly investments in acquiring and maintaining system resources. For example, an incentive manager could look at the average ROI of a particular type of Incentive program as a pre-built report and be given the flexibility of being able to compare the cost of that Incentive with that of another type of Incentive over an online hosted analytics platform which presents pre-canned reports along with user customizable reports and multi- dimensional Data Analysis capability. Such a system can give the end user the freedom to access the reports and analyse the data anytime anywhere using an internet browser. Once deployed, it may be refreshed with additional data in future and may also be used for multiple markets with minor region specific customisations.




    The Insight-based approach will significantly increase the confidence level of incentive managers while planning the incentive programs for MLM activities. They will be able to identify the Incentive programs which deliver high, medium and low paybacks, and hence optimize investment in them. It will also help the Direct Seller to check if any product categories are more responsive to Incentives than others. The endeavor can make significant impact where counter intuitive facts surface. For example, any particular event or holiday, which might be an influencing factor in designing incentive programs during a particular time of the year, may actually turn out to be an insignificant contributor to company sales. Incentive Managers can simulate various scenarios by assigning different values to the contributors and macro economic variables and forecast the ROI of near future incentive programs. This will enable regional incentive managers to drive efficiency and effectiveness in incentive planning and realise the company objective of enhanced Sales and ROI. The share of Incentive programs in marketing budget of most Direct Sellers has been progressively increasing and the expenditure incurred is steadily going up in face of competition in emerging markets like India and China which are fast becoming the engines of growth for global Direct Sellers. Investment in analytics based decision support systems will prove to be the difference maker for Direct Sellers.


  • Random Forest in Tableau using R

    Random Forest in Tableau using R

    I have been using Tableau for some time to explore and visualize the data in a beautiful and meaningful way. Quite recently, I have learned that there is a way to connect Tableau with R-language, an open source environment for advanced Statistical analysis. Marrying data mining and analytical capabilities of R with the user-friendly visualizations of Tableau would give us the ability to view and optimize the models in real-time with a few clicks.

    As soon as I discovered this, I tried to run the machine learning algorithm Random forest from Tableau. Random forest is a machine learning technique to identify features (independent variables) that are more discerning than others in explaining changes in a dependent variable. It achieves that by ensembling multiple decision trees that are constructed by randomizing the combination and order of variables used.

    The prediction accuracy of Random forest depends on the set of explanatory variables used in the formula. To arrive at the set of variables that makes the best prediction, one often needs to try multiple combinations of explanatory variables and then analyze the results to assess the accuracy of the model. Connecting R with Tableau will help you save a lot of time that would have otherwise gone into the tedious task of importing the data into Tableau every time you add/remove a variable.

    Tableau has a function script_real() that lets you run R-scripts from Tableau. To use this function in any calculated field, you need to set up the connection by following steps:

    1. Open R Studio and install the package ‘Rserve’


    2. Run the function Rserve()


    3. Once you see the message “Starting Rserve…”, open tableau and follow the below steps to setup the connection


    When you click on “Manage External Service Connection” or “Manage R Connection” depending on the version of Tableau, you’ll see the following window.


    Click OK to complete the connection between Tableau and R on your machine.

    Let’s take a simple example to understand how to leverage the connection with R to run Random Forest. In this example, I need to predict the enrollments for an insurance plan based on its features (say costs and benefits) and the past performance of similar plans.

    After importing the dataset into Tableau, we need to create a calculated field using the function script_real() to run the script for random forest which looks like below:

    Data<-read.csv(“C:/Tableau/Test 1.csv”)
    Data15 Data16 attach(Data15)
    rf ntree= 1000, Importance = TRUE, do.trace = 100,
    yhat Data16$Enrollments<-yhat

    To run the same script in Tableau using the function script_real(), we need to create a dataframe using only the required columns in the imported dataset. This should be done using the arguments .arg1…arg5 instead of actual column names since R will be able to access only the data that’s referred through arguments.

    The values for these arguments should be passed at the end of the R-script in the respective order i.e., .arg1 will take the values of the first mentioned field, .arg2 will take the values of the second mentioned field and so on.

    After making these changes, the code will look like the following:

    Data=data.frame(.arg1, .arg2, .arg3, .arg4, .arg5, .arg6)
    Data15 Data16 formula<-.arg2~.arg3+.arg4+.arg5+.arg6
    rf ntree= 1000, Importance = TRUE, do.trace = 100, na.action=na.omit)
    yhat Data16$.arg2<-yhat
    testdata<-rbind(Data15, Data16)
    testdata$.arg2′, ATTR([Year]),SUM([Enrollments]), SUM([Plan feature 1]),SUM([Plan feature 2]),SUM([Plan feature 3]),SUM([Plan feature 4]))

    The calculation must be set to “Plan ID” level to get the predictions for each plan ID.

    Although this approach achieves the objective of predicting enrollments for each plan, it doesn’t offer us the flexibility to run multiple iterations without having to change the code manually. To make the model running easier, we can create parameters as shown below to choose the variables that go into the model.


    Then, we can create calculated fields (as shown below) whose values change based on the variables selected in the parameters.

    case [Parameter1]
    when “Plan Feature 1″ then [Plan feature 1]
    when “Plan Feature 2″ then [Plan feature 2]
    when “Plan Feature 3″ then [Plan feature 3]
    when “Plan Feature 4″ then [Plan feature 4]
    ELSE 0

    After replacing the variables in code with parameters, the code will look like below:

    Data=data.frame(.arg1, .arg2, .arg3, .arg4, .arg5, .arg6)
    Data15 Data16 formula<-.arg2~.arg3+.arg4+.arg5+.arg6
    rf ntree= 1000, Importance = TRUE, do.trace = 100, na.action=na.omit)
    yhat Data16$.arg2<-yhat
    testdata<-rbind(Data15, Data16)
    testdata$.arg2′, ATTR([Year]),SUM([Enrollments]), SUM([var 1]),SUM([var 2]),SUM([var 3]),
    SUM([var 4]))

    This will let us run multiple iterations of random forest very easily as compared to manually adding and deleting variables in the code in R for every iteration. But, as you might have observed, this code takes exactly 4 variables only. This might be a problem since having a fixed number of variables in the model is a privilege you rarely (read as never) have.

    To keep the number of variables dynamic, a simple way in this case is to select “None” in the parameter which will make the corresponding variable 0 in the data. Random forest will ignore a column in the data if all the values are zero.

    As long as the no of variables is not too high, you can create as many parameters and select “None” in the parameters when you don’t want to select any more variables.

  • What happens when the big boys of the insurance industry meet under one roof?

    What happens when the big boys of the insurance industry meet under one roof?

    Apart from assessing how deep everyone’s pockets are, they discuss what they can do to make them deeper. The last 10 years have been incredibly profitable for the insurance industry as a whole – Personal, Commercial, Property, Life, Annuity, Healthcare – you name it! However, in an ever changing landscape where Social networking is shifting the balance of power to consumers, environmental pressures needing to be addressed, rise of economic power in emerging markets, Geo-political issues and last but not the least explosion of data and technology offer great risk to the insurance industry. And, typically, insurers tend to take risk, seriously!


    It was a pleasant summer morning in a big conference hall in Chicago where the leaders of the Insurance industry descended. All of these leaders were Chief Data/Insurance/Integration/Digital/Analytics/Innovation/Customer/Data Science Officers or basically anyone who had anything remotely to do with the data in insurance. A key theme of the day was how Insurance firms can move from a data centric approach to a data centric approach that solves business problems. A number of issues ranging from – Leveraging big data technologies & abolishing legacy systems, creating a culture of analytics in the organization, hiring the right people to work with data, using advanced machine learning and what can be done with Internet of Things. As an Analytics as a service company, TEG Analytics fueled passionate discussions on how they leverage advanced analytical techniques to drive business value from data. Here’s a summary of things that were discussed.

    As users of technology, insurers are typically laggards when compared to technologically progressive industries. In an environment of data proliferation and inexpensive computing & storage, as well said by a CTO, there is little scope of finding an excuse for not embracing a technology ecosystem that can drive gains from automation, operational efficiencies and improving the customer experience. Historically, insurers have used structured data to make tactical and operational decisions around customer targeting, risk pricing, loss estimation etc. However, with the augment of Internet of Things – massive volumes of unstructured and sensor data is being made available. According to one CDO, this is giving rise to a new generation of consumers who demand speed, transparency and convenience reversing the age old wisdom that ‘insurance is sold and not bought’. Choices are becoming complex to comprehend in the digital world of multiple interactions as choices become more dependent on trust defined by social networks and not agents/intermediaries. To harness this trend of ‘Big-data’ and complement structured with unstructured data, financial and intellectual investments are being done to allow insurers to make strategic forward-looking decisions from data. A unanimous view was to add new types of information, integrating external data sources, incorporating granularity to the data. With presence of NAIC, the standard-setting, regulatory support body in the room, the Importance of adhering to data governance policies was demonstrated.


    Some gasps were let out when a head of Data science said 75% of models do not see the light of the day of implementation. A key theme that pervaded the entire conference was – developing a culture of analytics within an organization. Data has always been used as a key ingredient in the different functions of the insurance industry. Risk, Underwriting, Pricing, Campaigns, Claims etc. all use data and key metrics in some way or the other. The problem arises when executives are unwilling to operationalize insights from data into making decisions. This is where having leaders in the space of Data, Information and Analytics must work together to inculcate a data driven decision making process in the organization. Seamless integration of processes of these three divisions can potentially transform the business without losing the sight of feasibility and risk. Harmonizing the analytics and business functions is imperative in capitalizing the tactical and strategic benefits of data. With new competitive pressures, risks, opportunities available in the market, the CAO must build a case for change with other business leaders. TEG Analytics believes that the analytics folks must work collaboratively with business leaders to define a clear, well-defined goal rooted completely in business strategy. Undertaking an analytics project with a business sponsor driven by a desired outcome and insights delivered @ the speed of business can create immediate, implementable value for the business function. Arvind (CEO of TEG Analytics) said that data science teams are sometimes infamous for interacting with the business teams in a language only they understand. They should engage project owners in a more holistic way and take them on a journey from the start to the end; finishing with a go-to plan or recommendation that is implementable.

    Sophisticated analytics progresses to a point where no more useful information can be extracted and all key decision-making has been automated to provide sharp & quick insights. Different functions in the insurance domain have historically used data. However, there is a big gap in using data and using data to make decisions swiftly.

    Underwriting in insurance can be automated and made intelligent by using structured data, sensor/IoT information along with unstructured data. Use of process mining techniques, NLP and deep learning algorithms, we can build personalized underwriting systems that take into account unique behaviors and circumstances.

    With the onset of internet, mobile and social; the way consumers interact has changed. This has led to disappearance of two things – distributor sales channels and the concept of ‘advice’ before buying an insurance product. Insurers must track the entire consumer journey to understand its needs and sentiments to be able to design personalized products. Advanced Machine learning techniques can be leveraged to infer customer behavior from this data. This machine-advisor evolution will offer intelligence based on customer needs by building recommender systems to advice products.
    Analytics will also help in improving profitability from operational efficiency. Multiple staffing models can be built and tested to increase resource utilization while increasing underwriting throughout sales performance. A machine learning based Claims insights platform can accurately model and update frequency and severity of losses over different economic and insurance cycles. Insurers can apply claims insights to product design, distribution, and marketing to improve overall lifetime profitability of customers. In order to determine repair costs, use of deep learning techniques to automatically categorize the severity of damage to vehicles involved in accidents. Use decision tree, SVM, and Bayesian Networks to build claims predictive models on telematics data. Use of graph or social networks to identify patterns of fraud in claims. These predictive models can improve effectiveness by identifying the best customers thereby refining risk assessments and enhancing claim adjustments.

    All in all, the Chief Data Officer conference was an insightful discussion on the current state of the insurance industry, its evolution in a world of massive data propagation and how firms must evolve with the changing landscape of the industry. Various players from different domains within the insurance vertical discussed key themes like abolishing legacy systems, moving to technologically advanced ecosystems capable of handling data from every sphere and leveraging advanced analytical techniques to derive business value for various functions of the industry.

  • Digital Marketing

    Impact Measurement & Key Performance Indicators

    What is Digital Marketing? What are its key components? How do we know whether our marketing works? This paper talks about the important KPIs that every marketer should measure, why these are important & what they say about your digital marketing performance.

    Digital marketing – Impact Measurement & Key Performance Indicators  

    The importance of advertising online!
    “The Internet is becoming the town square for the global village of tomorrow”
    – Bill Gates

    It’s become a cliché to say we live in a 24X7 networked world, but some clichés are true. Wespend more and more of our lives online, using the ‘net’ to book plane tickets, move money across bank accounts, read restaurant reviews , to an extent that we would be severely handicapped if our broadband stopped working tomorrow.

    As audiences spend more & more time online, just as predators follow prey on a migratory route, advertisers have started allocating greater parts of their budgets to digital media. According to research done by Zenith Optimedia1, while all advertising is likely to increase at a rate of ~ 5% between 2009 to 2013, online advertising will rise three times as fast.

    The charts below show how internet is grabbing a larger share of the advertising pie over time, at the cost of traditional media. If the projected growth rates continue, Internet advertising shall overtake Print in 2016 and TV in 2025, to become the single largest advertising channel.


    Understanding the online advertising beast

    “Rather than seeing digital marketing as an “add on”, marketers need to view it as a discipline that complements the communication mix and should be used to generate leads, get registrations or drive sales, rather than simply generating awareness.”
    – Charisse Tabak, of Acceleration Media

    Digital advertising is increasing in importance, even for heavy users of traditional media like CPG firms. Among TEG clients, we are seeing upto 1% of revenue being allocated for digital advertising. Consequently management is asking questions of the digital marketing group, about the tangible being generated. These are still early days, and most companies are still years away from getting true ROI numbers and optimizing digital spend, taking cross channel impacts into account. Mostly, our clients are deciding on the KPIs that are relevant & meaningful, and on the mechanisms that need to be set up to track & measure them.

    Before moving into the details, it is important to understand the key components of online marketing. Broadly, all online marketing channels are divided into 3 overarching categories, Paid, Owned & Earned. The definition first came into public domain in March of 2009, when Dan Goodall of Nokia, wrote in his blog about Nokia’s digital marketing strategy.


    At a high level, paid is media you buy – you get total control over messaging, reach and frequency, as much as your budget allows; earned is what others say about your brand – you get no control but you can influence outcomes if you’re smart; and owned is content you create – you control the messaging, but not so much whether anyone reads/views it.
    The three media types, are best suited to achieve different marketing objectives and have their own pros and cons. A very succinct summarization on the advantages and drawbacks of all three types of media has been done by Forrester Research’s Sean Corcoran, as illustrated below.



    Evaluating the impact of your digital strategy

    Evaluation of the ‘true impact’ of your digital marketing strategy & spend, is a multi-phase journey. At TEG Analytics, we break the journey into five distinct phases

    1. Datawarehousing & Reporting : Get all your digital data under one roof
    2. Dashboarding : Identify key KPIs and interactions and create meaningful dashboards
    3. Statistical Analysis: Determine the past impact on business KPIs like Sales, Profit etc from your Marketing inputs. This utilises Market & Media Mix modelling techniques
    4. Predictive Analysis: Use historical analysis to identify likely future scenarios for your business
    5. Optimization: Use inputs from predictive analysis to run an optimal marketing strategy by maximising ROI subject to budgetary constraints


    To begin with, one needs to decide what impact is desired from each digital channel e.g. e-mail campaigns should get my new customers to sign up on my site. This will easily lead us to the KPI that needs to be tracked on an ongoing basis, and any industry research can provide us the relevant benchmarks. The rest of the article shall focus on those metrics that TEG Analytics believes need to be captured & tracked to get a holistic & detailed understanding of digital marketing performance. These metrics have been arrived at from numerous projects in digital analytics that TEG Analytics has completed for clients across the globe.

    Paid Media

    Display Banner Advertising

    Display banners or banner ads are online display ads delivered by an ad server on a web page. It is intended to attract traffic to a website by linking to the website of the advertiser. Viewers can click on these ads to either watch on the page itself or get routed to advertiser’s website.

    Display Banners typically account for a lion’s share (40-55%) of digital advertising budget, based on TEG Analytics’ experience.
    There are 2 types of display banner advertising

    1. Flash/Static: These are simple banner ads with one or two frames. Approx 85% of all impressions served in FY11 were flash/static ads.
    2. Rich media: These are rich ads that allow people to expand and interact with further content within the banner itself. About 15% of impressions served in FY11 were rich media impressions

    The intention of display banner advertising is to drive traffic to our own website & also to create brand awareness. To determine if the strategy is working to deliver these goals, TEG Analytics recommends that all advertisers should track the following metrics

    • Impressions: This is an exposure metric that counts how many time an ad was shown
    • Clicks: Response to an ad measured through clicks
    • Click-rate: Clicks as % of impressions. Click-rate is declining in the industry as consumers tend to prefer getting all the relevant information within the banner itself
    • Rich media interactions: Is a counter of all interactions that take place within the rich media unit (e.g. expanding, clicking within the multiple parts of banner etc.)
    • Floodlight Metrics: This is a Doubleclick6 specific term and is used to track actions that visitors take once they arrive on the website. Each campaign and brand may track specific actions on site so there could be many floodlight metrics across all brands.

    NOTE: (All figures of proportion of digital ad spend by channel are approximations based on TEG analytics digital analytics project experience
    Based on transaction data analysis performed by TEG Analytics on client data)

    Paid Search

    Paid Search also known as Search Engine Marketing is used by advertisers to show relevant ads on Search Engines. For instance, a search for “Bleach” on a search engine, would throw up sponsored links on the top and on the right hand side of the page. These are Paid Search advertising links. Paid Search typically accounts for around 10 – 15% of the total digital advertising budget.

    Typically, there are two types of Paid Search or Search Engine Marketing ads:

    • Paid Search Ads: Text links that show up on a search engine
    • Content Targeting or Content network buy: Provided by Google and Yahoo. The ad is shown not on the search engine itself but within a network of advertisers that the client has connections with. An example would be if in Gmail there are a lot of conversations around travel to Africa, the relevant travel agency ad would show within the Gmail environment as a text link.

    Paid Search advertising is generally bought on Cost-per-Click basis. Google or Bing is paid only if someone clicks on the ad. This means the impressions on the Paid Search Advertising are technically free. Cost-per-click is decided through a bid engine. Lot of companies bid on the same key word and those with the highest bids get the position within the search engine.

    Content Targeting is bought on a Cost per thousand Impressions (CPM) basis.

    NOTE :( DoubleClick is a subsidiary of Google that develops and provides Internet ad serving services. Its clients include agencies, marketers (Universal McCann Interactive, AKQA etc.) and publishers who serve customers like Microsoft, General Motors, Coca-Cola, Motorola, L’Oréal, Palm, Inc., Apple Inc., Visa USA, Nike, Carlsberg among others. DoubleClick’s headquarters is in New York City, United States.)

    Paid Search is primarily intended to drive traffic to owned websites, and e-commerce links to induce purchase. TEG Analytics recommends that the following KPIs should be tracked and measured to evaluate paid search impact.

    • Impressions: This is an exposure metric that counts how many time an ad was seen/shown
    • Clicks: Response to an ad measured thru clicks
    • Click-rate: Clicks as % of impressions
    • Cost per Click: The total spend on Paid Search divided by the total Clicks that the advertiser has got. It is factored in through the bid management tool that the agency handles
    • Average Position: Specific to search advertising. It shows where the ad shows up within the paid search results of a search engine result page. Industry best practice is to be on the top 3 spots. Anything after the top 3 means that the ad shows up on the right hand side of the page, where both visibility of and response to the ad are minimal.

    Streaming TV

    Streaming TV is streaming video execution done on a digital platform. When advertising videos are shown on websites such as abc.com, google.com or others it’s called Streaming TV. Simplistically speaking it is an extension of television viewership. As television viewership starts moving to the digital channels, advertisers are moving money from television to digital media.

    Steaming TV accounts for around 15 – 20% of total digital advertising budget.
    Every Streaming TV ad is accompanied by free companion banner in the same environment. For instance if a streaming video is shown on www.abc.com, right next to it would be a companion banner similar to a display banner during the period the streaming video is playing. Online video ad streaming together with companion banner is termed as Streaming TV execution.

    Streaming TV executions are typically bought on a CPM or Cost per Thousand Impressions basis which implies the payment is based on the exposure that the advertiser gets in the Streaming Video space. In some rare cases, Streaming TV can also be bought on Cost per Video View.

    The desired actions from Streaming TV ads are very similar to TV advertising. TEG Analytics recommends our clients to track metrics that approximate to TRPs most closely.

    • Video Impressions: This is an exposure metric that counts how many times the video ad was seen/shown
    • Video Clicks: Response to the video ad measured thru clicks
    • Companion Banner Impressions: This is an exposure metric that counts how many time a Companion Banner ad was seen/shown
    • Companion Banner Clicks: Response to the display banner ad measured thru clicks
    • Video Midpoint or 50% Completion Metric: People have viewed at least 50% of the video ad

    Digital Coupons

    Digital coupons are the online counterpart of regular print coupons and are heavily used by CPG and other Consumer Products marketers, as a price discounting medium. Advertisers want to initiate trial & re-purchase by enticing consumers through price discounting. They account for approximately 2-5% of all digital advertising.

    • Digital Coupons are of the following types.
      • Print – Printing coupon on to a paper. Majority of redemptions are from this type of a coupon.
      • Save 2 Card – Come up especially in the last 2 years. Allows customers to save the coupon onto their loyalty card (such as loyalty card of Kroger or CVS). When customers go into the store and scan the loyalty card, these coupons will be redeemed at the point of sale. The volume of redemption on this type is quite low.
      • Print and Mail – Coupons are printed and mailed back to the advertiser along with the product purchase bill for redemption. This type of redemptions is also quite low in volume.
    • Distribution of coupons happens in the following ways.
      • DFSI Network – This is very similar to FSI. Multiple coupons of different companies and products are available to the customers and all these can be clicked and downloaded at one go by the customers. There is high volume of coupon prints but low volume of redemptions on this network
      • Banner Advertising – Display banners have coupons within the banners that can be clicked on and printed off the banner advertising itself
      • E-Mail Program – Coupons are included in some of the regular e-mails sent by the advertiser to the database of loyal customers. Customers can click on these coupons and download them
      • Websites – People can come to the advertiser websites and download the coupons available

    Coupons are essentially an extremely direct method of marketing, with a straight forward purpose of redemption by the customer. To ensure the redemption number being tracked is normalised across the size of the campaign itself, TEG Analytics recommends the redemptions be tracked as a proportion of prints. The KPIs that all coupon distributors should track are

    • Prints – Total # of Coupons printed or Save 2 Card from the digital environment
    • Redemptions – Number of printed Coupons Redeemed in store
    • Redemption Rate – Number of redemptions divided by Print. All coupons have an expiration date. However due to lag between actual redemptions and retailers reporting the redemption numbers, redemptions will be seen trickling in even after the expiration date

    Owned Media

    Company Websites

    Company websites are one of the most often used media to communicate the company and brand vision to the customer. It is also used as a tool to enable e-commerce for the advertiser’s products. Most of digital marketing, ultimately induces the viewer to visit the company website, hence it probably is THE most important part of your digital armoury. Websites typically consume 10-15% of the digital marketing budget.

    Google Analytics is typically used to track Website performance, and is highly recommended by TEG Analytics, as it has a very solid back end as well as evolved reporting interface. Google Analytics helps track where the visitors come from and all the actions they perform once they arrive on the websites such as the time, date and day of visit, whether the person is a first time or a return visitor, all content consumed and actions taken all the way through the points of exit. Google Analytics is a free online tool and there is no cost associated with implementing and using this system. On the flip side there is no direct customer support or customization possible.

    TEG Analytics recommends that the advertiser extract and maintain multiple metrics for tracking website performance.

    • Visits: A visit is a session that typically lasts for 30 minutes. After 30 minutes it would renew as a new session and would count as a new visit. If in a day a person visits the site 4 times, it would count as 4 visits assuming that each visit lasted less than 30 minutes.
    • Average time spent: Gives the time spent in seconds per visit to the website. Due to some tracking challenges, the time spent on the last page before exit does not get captured. For this reason, this metric should be used as a relative but not as an absolute metric. This implies that it is a good measure for comparison between websites but by itself can be used only for directional purposes.
    • Average pages per visit: Shows how many pages were consumed per visit. This is an engagement metric for most CPG firms. In other cases such as for e-commerce sites the objective would be to push people down the funnel as quickly as possible.
    • Return Visitors: This metric refers to visits from a browser that has already been exposed to the website. If a visit to www.example.com has already happened from a browser in a certain system, and the same browser is exposed again to this site, it is counted as a return visit. Return visit is true for the time period for which the report is selected. If it is for 1 month, then it is a return visit for that 1 month.
    • Unique Visitors: Unique visitors refer to the unique number of cookies on the browser that was exposed to the site. If the browser was exposed once, it would be 1 unique visitor. The unique visitor is true by default for a period of 2 years, unlike the return visitor which would be true for the period for which the report is being generated such as a month, a quarter or a year.

    Social Media

    Social Media is a part of owned digital media marketing. TEG Analytics has primarily worked with Facebook data. Overall clients spent about 6 – 10% of their digital media on Facebook & other social media sites.
    As Facebook is an owned vehicle, execution is typically handled by the community managers in the Marketing Communications team within the advertiser itself. There are some PR agencies that help the community managers handle pages as well, but TEG Analytics recommends that the Marketing Department build this up as an internal core strength as its importance is likely to increase over time. There is no cost associated with building any of the Facebook pages other than the cost of having the community managers on board.
    The cost associated with increasing exposure and driving fans to the Facebook page should be measured as a part of paid media and not as a part of the owned social media bucket.
    The main aim of Facebook pages is to drive loyal customers to become advocates and engage the customers to keep the advertiser brands on top of their minds. Keeping this in mind, the key metrics captured and measured from Facebook should be:

    • Fans to Date – This is a magnitude metric that gives the cumulative lifetime fans of the page. It is derived as previous day fans + total likes – total unsubscribes. The unsubscribes are the number of fans who have decided to unlike the pages
    • Monthly Interactions per Fan – This metric captures the quality of the fans on the Facebook page. As the objective is to drive engagement the hope is to have higher interactions per fan for each of the pages. Interactions is calculated as Likes + Comments + Discussion Posts + Wall Posts + Videos
    • Monthly Impressions per Fan – This captures the social reach of Facebook. For every fan of a brand page, what is the additional reach that they are providing for the facebook content?

    Relationship Marketing

    A lot of our clients have been capturing and maintaining a database of loyal customers over a period of time. A partner of choice has been a company called Merkle. Merkle system talks to another system that allows the advertiser to activate and communicate to the loyal customer database. The most popular vendor of the email communication system is a company called Responsys. 7Merkle captures the personally identifiable information (PII) on customers, scrubs it for validity and ensures privacy of each of the members coming into the loyal database. Responsys sends out emails to the customers in the Merkle database of loyal customers. The execution is primarily around sending out emails using the email ids in the database.

    Typically, we have seen that clients spend about 3-5% of their total budget on relationship marketing.

    The key metrics captured from this data source are primarily around the effectiveness of the email program.

    • Sent – The total number of email addresses to which the email was sent.
    • Delivered – The total number of email addresses to which the email was delivered. The difference between Sent and Delivered is called the Bounces and is a separate metric.
    • Open Rate – Out of the total email addresses the email was sent to, how many were actually opened.
    • Click Rate – This is a derived metric that tells how many clicks on the email happened as a proportion of the total delivered emails. This is calculated as Clicks/Delivered
    • Effective Rate – This is derived metrics that tells, of the total number of email addresses that opened the email, how many clicked on the email. This is calculated as Clicks/Opens

    Earned Media

    Buzz Marketing

    Buzz monitoring or social media monitoring is a fairly new discipline within most of our clients. Most companies use an outside tool/service like Sysomos, Radian6, peerFluence, Scout Labs, Artiklz etc. They typically provide information on brand and competitor related chatter in the social media space. Buzz technology has the ability to capture this chatter across blogs, news feeds, twitter and video scrapping, and builds holistic tracking systems.

    The tools need to be trained to capture relevant “keywords”. Companies monitor buzz and chatter on both their and competition brands, as it gives them a fairly good idea about their comparative positioning in the social media universe.

    There are also tools like Map from Sysomos that can be used by the brand team to understand broader consumer trends, to develop communication and get new product ideas and innovations.
    The key information from buzz that is critical for your brand is the extent of chatter about your brands, and the extent to which that chatter is ‘positive’. The KPIs that TEG Analytics recommends, to be tracked for this channel are

    • Mentions – This is a magnitude metric that captures the level of chatter about a particular tag defined by the community manager. For e.g. if information is being captured for a tag that says “Ipad”, this metric tells how many mentions of this chatter did the Buzz tracking tool find in the overall social universe.
    • %Positive/Neutral – This tries to get behind the sentiment behind the chatter. Based on an algorithm developed by the buzz tracking tool, every single post in the social environment is categorized into positive, negative or neutral. These algorithms are based on Natural Language Processing and are learning programs. This metric is a derived metric and is calculated as (total number of “positive+neutral” mentions)/total number of mentions.

    Life after metric tracking

    Tracking the relevant metrics is the first and very important step, to create a completely data driven decision system. It is the basic building block & your analytics suite cannot proceed without it. However, in order to make the best of your data and answer the “why” and “optimization” questions, we need to go further.

    During the creation of dashboards & reports, a lot of data is collected, curated and harmonised and goes through a lot of ETL activities. This data can be used for a variety of advanced analytics, which will help the company truly determine the ultimate impact of digital marketing on sales or brand equity. TEG analytics uses proprietary methodologies, to calculate “True Impact” of digital marketing by marrying digital data with traditional marketing data like TV, Print etc. and calculating Cross Channel Impact on Sales & Revenue. We have developed models using Bayesian hierarchical techniques, which eliminate all the noise and narrow down on the true impact of marketing.

    TEG Analytics also has a product for Digital Analytics called DigitalWorksTM that provides clients end to end digital analytics services. It has modules that address all the phases in analytics, as shown below.


    End Note
    To conclude, the world of digital marketing is new and exciting, and lot of the spend on digital marketing is currently being done to ‘keep up with the Joneses’, and the discipline that is present in creating traditional media marketing plans, is largely absent. However, this need not be the case, as most of the data as well as the tools to extract actionable insights from this data are available with consultancy firms like TEG Analytics. Once the power of this data is harnessed, companies will see a vast improvement in the efficacy and ROI of their digital marketing program.

  • Text Mining & its Applications

    Unearthing the intelligence hidden in free form data

    Text Mining – What does it add to transaction data

    Text mining refers to extraction and putting together of textual information into quantitative forms in order to derive information and garner insights from it.

    There are many industry applications of text mining-

    • Market research surveys use text mining to make sense of open ended questions in surveys.
    • CRM data analysis uses text mining for adding value to customer churn modeling using customer feedback data with transaction data.
    • The entertainment business uses text mining as ‘sentiment analysis;’ to gauge if new movie releases garner favorable or unfavorable word of mouth reviews.
    • Publishers use text mining to get access to information in large databases via indexing and retrieval.

    In retail, text mining or text analytics in conjunction with transaction data analytics helps retailers-

    • Look deeper at real customer, product and service issues
    • Enhance value from market research and may even help to cut costs of doing large scale market research studies
    • Improve customer service by cutting lead times to address common issues
    • Create better products

    The process by which retailers can extract value from text data is:

    1. Identifying where text data is collected

    The three sources where text mining data is available and can be leveraged are:

    • Surveys – These are usually customer satisfaction surveys that a retailer initiates with a customer. A lot of open ended information provided in these surveys contains valuable text information that should be mined for a deeper look at customer issues.
    • Contact centre data – This data consists of e-mails, phone in transcripts and web chat or submissions by customers who are communicating an issue. Analysis of this data can yield a lot of very valuable information.
    • Internet data – Data on the internet in blogs, product review sites and expert groups contains a wealth of information that is not gleaned by satisfaction surveys or customer feedback via phone.

    2. Changing text data to structured form
    The next step in the process is to change unstructured data into a more manageable form of structured data. This involves several small steps:

    • Identification of the sources from where text data needs to be extracted
    • Decision on which unstructured data to analyze i.e. product related, sentiment related, time period related, particular promotion related etc.
    • Use of software that can extract the relevant information from various places
    • Creation of theme or concept buckets to be able to take a closer look at extracted information and link it to transaction data

    3. Analyzing text data
    Once the unstructured data has been made manageable; reports can then be generated from it. These help the retailer focus on addressing key metrics as they come up and resolve the relevant issues. Thus keeping cleaned text data as a separate entity allows retailers to focus on data which would otherwise not be looked at.

    4. Integrating text data with transaction data
    A lot of actionable insights can be generated if text data that is cleaned up is then integrated to the larger transaction data warehouse. The linking of these two complementary data sets generates added value for retail organizations. It helps answer questions like:

    • What is the reason for higher returns in a particular town/city/region?
    • Why are customers calling in regarding a particular SKU?
    • Which offer will a customer be most likely to accept?
    • Why did a particular promotion not do well?
    • What are the real reasons why customers have lapsed?
    • Which competitor is doing better in terms of product and quality and price?
    • Is a certain customer group adopting a new product more than others?

    While most retailers have the text information they need to improve their knowledge of their customers, products and service, very few presently mine this information. Retailers thus need to unlock the value lying in unstructured data with a clear vision on how they will clean and integrate this data to larger quantitative data sets. They can then start to and use the insights generated from this data to improve customer experience through better service, products, quality and process.

    Using text data to capture and add value to voice of customer



    • Mining the web to add semantics to retail data mining-Rayid Ghani
    • A method for generating plans for retail store improvements using text mining and conjoint analysis-T Kaneko in Proceedings of the 2007 conference on Human interface: Part II
    • Mining Text in a Retail Enterprise Assessing Customer Sentiment and Satisfaction by Sara Charen, Dan Ross
    • Text Analytics 2009-Users perspectives on solutions and providers-Seth grimes Alta Plana
    • Calling for Customer Experience Insight Social media may be hot, but don’t leave contact centers out in the cold. By Sid Banerjee Posted Mar 22, 2010 CRM.com
  • Python in Data Science

    “The joy of coding Python should be in seeing short, concise, readable classes that express a lot of action in a small amount of clear code — not in reams of trivial code that bores the reader to death” – Guido van Rossum (Creator of Python).

    Data Science is an emerging and extremely popular function in companies. Since the volume of data generated has increased significantly a new array of tools and techniques are deployed to make decisions out of raw big data. Python is among the most popular tools used by Data Analysts and Data Scientists. It’s a very powerful programming language that has custom libraries for Data Science.

    Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale.

    Python has been one of the premier general scripting languages, and a major web development language. Numerical and data analysis and scientific programming developed through the packages Numpy and Scipy, which, along with the visualization package Matplotlib formed the basis for an open-source alternative to Matlab. Numpy provided array objects, cross-language integration, linear algebra and other functionalities. Scipy adds to this and provides optimization, linear algebra, optimization, statistics and basic image analysis capabilities (Open CV).

    “One Python to Rule Them All”

    Beyond tapping into a ready-made Python developer pool, however, one of the biggest benefits of doing data science in Python is added efficiency of using one programming language across different applications.

    It turns out that the benefits of doing all of your development and analysis in one language are quite substantial. For one thing, when you can do everything in the same language, you don’t have to suffer the constant cognitive switch cost between languages and analysis.

    Also, you no longer need to worry about interfacing between different languages used for different parts of a project. Nothing is more annoying than parsing some text data in Python, finally getting it into the format you want internally, and then realizing you have to write it out to disk in a different format so that you can hand it off to R or MATLAB for some other set of analyses. In isolation, this kind of thing is not a big deal. It doesn’t take very long to write out a CSV or JSON file from Python and then read it into R. But it does add up. All of this overhead vanishes as soon as you move to a single language.

    Powerful statistical and numerical packages of python are:

    • NumPy and pandas allow you to read/manipulate data efficiently and easily
    • Matplotlib allows you to create useful and powerful data visualizations
    • scikit-learn allows you to train and apply machine learning algorithms to your data and make predictions
    • Cython allows you to convert your code and run them in C environment to largely reduce the runtime
    • pymysql allows you to easily connect to SQL database, execute queries and extract data
    • Beautiful Soup to easily read in XML and HTML type data which is quite common nowadays
    • iPython for interactive programming

    Python as Part of Data Science


    Python as a part of the eco-system, can be broadly divided into 4 parts:
    1) DATA
    2) ETL
    3) Analysis and Presentation
    4) Technologies and Utilities

    Data, as the word suggests. We can see data in any form: structured or unstructured. Structured data is a standard way to annotate your content so machines can understand it, it can be in a SQL database, a csv file etc. Structured data is always a piece of cake in data science industry.

    Actual problem starts when we see unstructured data. Unstructured data is a generic label for describing data that is not contained in a database or some other type of data structure. Unstructured data can be textual or non-textual. Textual unstructured data is generated in media like email messages, PowerPoint presentations and instant messages. Python is very useful in reading all kind of the data format and bring in to a structured data format.

    Extraction Transformation and Loading is the most costly major part of the data science. A data scientist spends 80% of time in data exploration, data summarization, data extraction and transformation and 8% in modeling and 12% in visualization. It can vary from project to project.

    Extraction: the desired data is identified and extracted from many different sources, including database systems and applications.
    Transformation: The transform step applies a set of rules to transform the data from the source to the target. This includes converting any measured data to the same dimension (i.e. conformed dimension) using the same units so that they can later be joined.
    Loading: it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database.

    Let’s take example of we need a twitter data for a social media sentimental analysis.

    We need to follow basic few step to get a clean structured data.

    1) Reading all the tweet in one language (encoding into utf-8)
    2) Removing Apostrophes e.g. “‘re” should be replace by “etc.
    3) Punctuations in sentence should be removed. e.g. !()-[]{}’”,.^&*_~ should be removed.
    4) Remove hyperlink
    5) Remove repeated character from the sentence. Eg.”I’m happppyyyyy!!!” should be “I am happy” after you have used all the step from 1 to4 in the sentence.

    Analysis and Presentation: Analysis with python can be broadly defined as Analysis with package like Pandas.

    Package Highlights of Pandas:

    • A fast and efficient DataFrame object for data manipulation with integrated indexing.
    • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format.
    • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form.
    • Flexible reshaping and pivoting of data sets.
    • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.
    • Columns can be inserted and deleted from data structures for size mutability.
    • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets.
    • High performance merging and joining of data sets.
    • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure.
    • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data.

    Scikit-learn : scikit-learn (formerly scikits.learn) is an open source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

    Further plotting in python in python, we can use packages like Matplotlib, PyPlot.

    Matplotlib: is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like wxPython, Qt, or GTK+. There is also a procedural “pylab” interface based on a state machine (like OpenGL), designed to closely resemble that of MATLAB. SciPy makes use of matplotlib.

    Technologies and Utilities: when we say Technologies and Utilities that are all the repeated work what has be done in the past to get a result.

    Numpy play as important role in automations.
    NumPy is the fundamental package for scientific computing with Python. It contains among other things:

    • a powerful N-dimensional array object
    • sophisticated (broadcasting) functions
    • tools for integrating C/C++ and Fortran code
    • useful linear algebra, Fourier transform, and random number capabilities

    Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

    IPython Notebook : The IPython Notebook is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media, as shown in this example session:

    Window Class: google-chrome

    The IPython notebook with embedded rich text, code, mathematics and figures.
    It aims to be an agile tool for both exploratory computation and data analysis, and provides a platform to support reproducible research, since all inputs and outputs may be stored in a one-to-one way in notebook documents.

    There are two components:
    1) The IPython Notebook web application, for interactive authoring of literate computations, in which explanatory text, mathematics, computations and rich media output may be combined. Input and output are stored in persistent cells that may be edited in-place.
    2) Plain text documents, called notebooks, for recording and distributing the results of the rich computations.
    The Notebook app automatically saves the current state of the computation in the web browser to the corresponding notebook, which is just a standard text file with the extension .ipynb, stored in a working directory on your computer. This file can be easily put under version control and shared with colleagues.

    Despite the fact that the notebook documents are plain text files, they use the JSON format in order to store a complete, reproducible copy of the current state of the computation inside the Notebook app.

    Thus, Python has a great future in data science industry. There is a large community of developers who continually build new functionality into Python. A good rule of thumb is: if you are thinking about implementing a numerical routine into your code, check the documentation website first and you will be have your model ready in Python code. Happy Learning .

  • Data Visualisation

    Today we are trapped amidst tons and millions of cryptic data. We are continuously striving to understand and infer from these data. Data mining is the order of the day, but the perception of the data is what we believe is the end result.

    Why are weather reports more appealing to us when presented on a map than on bland tables? Why do we find those infographic images of the news articles more captivating? Be it the sensex points or stakeholder’s share or the earnings/turnovers; we inherently focus on the graphs and the charts. We need to admit that all those images are outcomes of tons of data, but yet are highly attractive to us.


    The secret behind this is the power of Visualization. Visualization can be called the art and science of data, by the way that it captures our attention and projects the data in a simplified way. Right from our childhood, we have been taught to perceive alphabets as visual images; we remember people when we see, rather than when we hear from them. Such is the power of Visualization on us that there is no doubt those infographic images are more appealing to us!

    This projection of data into pictorial or graphical form for the ease of understanding of the common man is what we call as the Data Visualisation technique. This Data Visualisation is making life easier in more than several ways. Let us understand the vitality of data visualisation by citing some critical scenarios.

    A Sales Manager of a Company works across umpteen sales figures on a weekly/monthly/quarterly/yearly basis. His past sales tracks guide him towards his future sales projections .So important are the vast data for him that when projected in a chart form, it just eases his life. The data can be presented across any time line, inference can be drawn from past or for future projections and more importantly the totality of data available can be viewed in one go.

    At other instances, we might come across scenarios when limited data needs to be driven to draw several hidden inferences. Marketing Managers need to walk their way through three dimensional data, say Market Share, Share of Voice and YoY Profit for their brands. The raw data present would just drive them crazy. On the Contrary, a simple bubble chart with Market Share and Share of Voice data on the axes and the Profit as the size of bubble would make wonders for them! The entire data can thus be projected on a common platform and hidden inferences can be drawn. With these visuals in place, drawing brand equity, competitiveness of the brands and what not can be derived easily and effectively.

    There are Challenging occasions when managers have to work with tons of data and arrive at concise and compelling findings. Working on such cumbersome data and projecting them in a presentable way would have not been possible, but for the data visualisation. By the aid of data visualisation, data can be drilled down to charts and graphs; they can be well integrated to take viewers on an interactive journey to grab insights out of those data.

    Thus by projecting data in visual forms, we not only draw the attention of the viewers, but also gain their confidence. As all the data is available at one place, the authenticity and credibility is established on either sides.

    Data Visualisation has a lot of scope for the future. As the vast data becomes presentable and readable, Data Visualisation paves way for further research on the data. Based on the present trends or emerging patterns, several new insights can be drawn. The more complex the data, the more is the scope to ponder on the data with these handy visualisation tools.
    Thus Data Visualization is the universal language of Data Science. It is easily comprehensible, concise and is the vital tool for the data analysis of the unexplored!!.

    Being factually accurate, Visualization helps viewers to make conclusions based on the data by offering important context for understanding the underlying information.


    “Formal education will make you a living. Self-education will make you a fortune”.
    -Jim Rohn

    Well let’s face it: we always tend to learn more when we’re thoroughly involved in a task rather than when we are given a lecture on it by a third person. For example, I could either give you lecture after lecture on how to make a sandwich, or I could drop you into a kitchen and say “It’s all yours. Make me a sandwich.” In an age when the internet can give us information about anything and everything under the sun, when learning about the surface of Pluto has become easier than finding your lost bike keys, it should not be too difficult to make your first sandwich, or your first dashboard for that matter. This is the idea that had given birth to the concept of ‘Hackathon’ in TEG Analytics.

    The Casus belli:

    The plan is to encourage self-learning and competition, while another benefit is that it initiates inter-team communication and knowledge sharing, besides providing a great opportunity to the participants to showcase their talent in front of the biggest brains of the company. That way, the company is also able to identify its employees’ talents and weaknesses. Needless to say, you end up learning a lot in the entire process. So yes, it is a win-win situation for all!

    It is important to mention here that hackathons in TEG are not like the regular hackathons as per the dictionary definition of the word. It’s actually even better! It’s not limited to coding and logical thinking skills of the person alone, but involves data visualization and business understanding.


    How to Train Your Dragon?

    I mean… employees.

    Well let me give you a brief idea about how hackathons are conducted here and how they support the concept of self-learning. First, the organisers make sure that all the employees have had at least one official training on the basics of the particular skill that they are going to be tested on. Then, they are divided into teams of two. These teams are built in such a way, that the most skilled person is partnered with a lesser skilled person and so on. The organisers then provide them with a common business problem that needs to be solved using a certain soft skill and presented before a panel of judges within a specified time frame, which is generally 15-20 days. The business problem is created in a way that gives the participants the feel of a real-life client handling process. They can Google as much as they need to and learn all about the problem or the tool, besides taking help from the organisers to clear their doubts. To motivate the participants further, incentives in the form of monetary benefits are provided.


    The Battles of Tableau and Excel:

    The first ever such competition held in TEG Analytics was Tableau hackathon. The second was Excel hackathon which was concluded recently. In both the events, the enthusiasm of the participants was extraordinary and the competition tough. Here, it is worth mentioning that the second hackathon witnessed more than twice the number of participants as the first. The competitive spirit among the teams was incredible, with each team leaving no stone unturned to prove they are better than all others. A week before the final day, one could find the participants spending late nights in office and even working on weekends to make sure they have used every last fragment of grey matter available to make their dashboards absolutely perfect. One the day of the presentation, their morale was sky-high and the passion was almost contagious, as the teams, armed with their codes, calculations and charts battled for the title of the ‘Best Dashboard’.


    And the Victor is..?

    Everybody! Because everybody wins. To conclude, I can say that conducting such events within the organization is a brilliant idea to encourage learning, team-spirit, healthy competition and improvement of one’s own soft skills. You could say it’s like pushing a bird off a tree and leaving it with two options: learning to fly or preparing to fall. And at the end of it, whether you fly or fall, you definitely learn the use of your wings and will probably be confident enough to flap them the next time you have to save yourself.

    Adwitiya Borah
    Data Analyst, TEG Analytics

  • Internet of Things Analytics


    Tony Stark has J.A.R.V.I.S; we have IoTA

    Most of us must have seen the movie Terminator in which the artificially intelligent operating system SKYNET becomes a lot more intelligent and decides to take over the world by spreading itself into all the systems across the world. Or we have seen the our favorite billionaire and brainy Tony Stark working with his faithful companion J.A.R.V.I.S. who does everything for him from making a cup of coffee to saving his life. In both these examples, one would notice the significance of what an artificially intelligent system could do and the scale of revolution it can bring about in our lives.

    Similar is the case with IoT – Internet of Things. IoT is a network of physical objects or “things” embedded with electronics, software, sensors, and connectivity to enable objects to exchange data with the manufacturer, operator and/or other connected devices based on the infrastructure of International Telecommunication Union’s Global Standards Initiative. The Internet of Things allows objects to be sensed and controlled remotely across existing network infrastructure, creating opportunities for more direct integration between the physical world and computer-based systems, and resulting in improved efficiency, accuracy and economic benefit. In simple words, if you want something to be done, tell your device and it would do it for you! Sounds quite stereotypical, isn’t it? One would argue that our mobile devices are already 50% voice operated. What difference would IoT make? Well, they would be surprised to know ramifications of the global application of IoT would spark the beginning of a new era; especially when it comes to Analytics.

    The crux of the discussion is the IoT enabled devices. They would collect data from all over the world; so your data would be geographically vast and demographically omniscient. With such a level of data, using the IoT analytics tools will improve real time decision making and customer experience. A coffee-maker with a bunch of buttons to make a good coffee is just a simple coffee-maker. But one which is network-connected and can be accessed from mobile phones is advanced or in better words, a “smart” coffee maker. The manufacturing industry could gather the data of the type of coffee you regularly drink and can make changes so that the maker adjusts itself to your preferences. Various enterprises benefit from IoT by monetizing their data assets and providing visibility to their customers and also understanding their needs in a much better way.

    • Compelling visualizations, interactive reporting, ad hoc analysis and tailored dashboards can be embedded into applications.
    • Highly customizable web-based user interface to match the branding, look and feel
    • Gathering competitor’s information and getting more insights based on merger, acquisitions, partnerships and pricing strategies.
    • Breaking down of markets into sub-segments to get a more comprehensive picture of the customer activities and buying patterns.

    IoT analytics tools have an unprecedented role in major industries like manufacturing, healthcare, energy and utilities, retail, transportation and logistics.

    Future cities are likely to include smart transport services for journey planning, adapting to travelers journey patterns, etc. to reduce expenditure and make transport more affordable. Smart buildings will be able to react to information derived from sensor networks across the city to adjust ventilation and window settings based on cross-referencing of pollution level and weather. Imagine how amazing it would be if the systems alert you to an open parking space when you enter a building. They can also be integrated into security systems to monitor the identities of the inhabitants and checking if any unauthorized personnel enter the place.

    A world where devices or “things” connected through networks and servers all across the world are doing analysis on not only your business or your competitor’s profit but integrating accurate decision making into daily lives of people is something which is most palpable right now. It is analytics at its best and what it should actually be like. This is just like having J.A.R.V.I.S or SKYNET everywhere in the world.

    Business leaders are busy thinking about a better future for the world; well I’d say that –

    ‘Internet of Things Analytics (IoTA) is the FUTURE’

Hide dock Show dock Back to top