• Python in Data Science

    “The joy of coding Python should be in seeing short, concise, readable classes that express a lot of action in a small amount of clear code — not in reams of trivial code that bores the reader to death” – Guido van Rossum (Creator of Python).

    Data Science is an emerging and extremely popular function in companies. Since the volume of data generated has increased significantly a new array of tools and techniques are deployed to make decisions out of raw big data. Python is among the most popular tools used by Data Analysts and Data Scientists. It’s a very powerful programming language that has custom libraries for Data Science.

    Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale.

    Python has been one of the premier general scripting languages, and a major web development language. Numerical and data analysis and scientific programming developed through the packages Numpy and Scipy, which, along with the visualization package Matplotlib formed the basis for an open-source alternative to Matlab. Numpy provided array objects, cross-language integration, linear algebra and other functionalities. Scipy adds to this and provides optimization, linear algebra, optimization, statistics and basic image analysis capabilities (Open CV).

    “One Python to Rule Them All”

    Beyond tapping into a ready-made Python developer pool, however, one of the biggest benefits of doing data science in Python is added efficiency of using one programming language across different applications.

    It turns out that the benefits of doing all of your development and analysis in one language are quite substantial. For one thing, when you can do everything in the same language, you don’t have to suffer the constant cognitive switch cost between languages and analysis.

    Also, you no longer need to worry about interfacing between different languages used for different parts of a project. Nothing is more annoying than parsing some text data in Python, finally getting it into the format you want internally, and then realizing you have to write it out to disk in a different format so that you can hand it off to R or MATLAB for some other set of analyses. In isolation, this kind of thing is not a big deal. It doesn’t take very long to write out a CSV or JSON file from Python and then read it into R. But it does add up. All of this overhead vanishes as soon as you move to a single language.

    Powerful statistical and numerical packages of python are:

    • NumPy and pandas allow you to read/manipulate data efficiently and easily
    • Matplotlib allows you to create useful and powerful data visualizations
    • scikit-learn allows you to train and apply machine learning algorithms to your data and make predictions
    • Cython allows you to convert your code and run them in C environment to largely reduce the runtime
    • pymysql allows you to easily connect to SQL database, execute queries and extract data
    • Beautiful Soup to easily read in XML and HTML type data which is quite common nowadays
    • iPython for interactive programming

    Python as Part of Data Science


    Python as a part of the eco-system, can be broadly divided into 4 parts:
    1) DATA
    2) ETL
    3) Analysis and Presentation
    4) Technologies and Utilities

    Data, as the word suggests. We can see data in any form: structured or unstructured. Structured data is a standard way to annotate your content so machines can understand it, it can be in a SQL database, a csv file etc. Structured data is always a piece of cake in data science industry.

    Actual problem starts when we see unstructured data. Unstructured data is a generic label for describing data that is not contained in a database or some other type of data structure. Unstructured data can be textual or non-textual. Textual unstructured data is generated in media like email messages, PowerPoint presentations and instant messages. Python is very useful in reading all kind of the data format and bring in to a structured data format.

    Extraction Transformation and Loading is the most costly major part of the data science. A data scientist spends 80% of time in data exploration, data summarization, data extraction and transformation and 8% in modeling and 12% in visualization. It can vary from project to project.

    Extraction: the desired data is identified and extracted from many different sources, including database systems and applications.
    Transformation: The transform step applies a set of rules to transform the data from the source to the target. This includes converting any measured data to the same dimension (i.e. conformed dimension) using the same units so that they can later be joined.
    Loading: it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database.

    Let’s take example of we need a twitter data for a social media sentimental analysis.

    We need to follow basic few step to get a clean structured data.

    1) Reading all the tweet in one language (encoding into utf-8)
    2) Removing Apostrophes e.g. “‘re” should be replace by “etc.
    3) Punctuations in sentence should be removed. e.g. !()-[]{}’”,.^&*_~ should be removed.
    4) Remove hyperlink
    5) Remove repeated character from the sentence. Eg.”I’m happppyyyyy!!!” should be “I am happy” after you have used all the step from 1 to4 in the sentence.

    Analysis and Presentation: Analysis with python can be broadly defined as Analysis with package like Pandas.

    Package Highlights of Pandas:

    • A fast and efficient DataFrame object for data manipulation with integrated indexing.
    • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format.
    • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form.
    • Flexible reshaping and pivoting of data sets.
    • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.
    • Columns can be inserted and deleted from data structures for size mutability.
    • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets.
    • High performance merging and joining of data sets.
    • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure.
    • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data.

    Scikit-learn : scikit-learn (formerly scikits.learn) is an open source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

    Further plotting in python in python, we can use packages like Matplotlib, PyPlot.

    Matplotlib: is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like wxPython, Qt, or GTK+. There is also a procedural “pylab” interface based on a state machine (like OpenGL), designed to closely resemble that of MATLAB. SciPy makes use of matplotlib.

    Technologies and Utilities: when we say Technologies and Utilities that are all the repeated work what has be done in the past to get a result.

    Numpy play as important role in automations.
    NumPy is the fundamental package for scientific computing with Python. It contains among other things:

    • a powerful N-dimensional array object
    • sophisticated (broadcasting) functions
    • tools for integrating C/C++ and Fortran code
    • useful linear algebra, Fourier transform, and random number capabilities

    Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

    IPython Notebook : The IPython Notebook is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media, as shown in this example session:

    Window Class: google-chrome

    The IPython notebook with embedded rich text, code, mathematics and figures.
    It aims to be an agile tool for both exploratory computation and data analysis, and provides a platform to support reproducible research, since all inputs and outputs may be stored in a one-to-one way in notebook documents.

    There are two components:
    1) The IPython Notebook web application, for interactive authoring of literate computations, in which explanatory text, mathematics, computations and rich media output may be combined. Input and output are stored in persistent cells that may be edited in-place.
    2) Plain text documents, called notebooks, for recording and distributing the results of the rich computations.
    The Notebook app automatically saves the current state of the computation in the web browser to the corresponding notebook, which is just a standard text file with the extension .ipynb, stored in a working directory on your computer. This file can be easily put under version control and shared with colleagues.

    Despite the fact that the notebook documents are plain text files, they use the JSON format in order to store a complete, reproducible copy of the current state of the computation inside the Notebook app.

    Thus, Python has a great future in data science industry. There is a large community of developers who continually build new functionality into Python. A good rule of thumb is: if you are thinking about implementing a numerical routine into your code, check the documentation website first and you will be have your model ready in Python code. Happy Learning .

Hide dock Show dock Back to top