All Blog Posts Tagged 'modeling' - AnalyticBridge2019-08-22T17:00:27Zhttps://www.analyticbridge.datasciencecentral.com/profiles/blog/feed?tag=modeling&xn_auth=noComparing Model Evaluation Techniquestag:www.analyticbridge.datasciencecentral.com,2019-08-08:2004291:BlogPost:3936612019-08-08T16:37:43.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>In my previous posts, I compared model evaluation techniques using Statistical Tools & Tests and commonly used Classification and Clustering evaluation techniques</p>
<p>In this post, I'll take a look at how you can compare regression models. Comparing regression models is perhaps one of the trickiest tasks to complete in the "comparing models" arena; The reason is that there are literally dozens of statistics you can calculate to compare regression models, including:</p>
<p><strong>1.…</strong></p>
<p>In my previous posts, I compared model evaluation techniques using Statistical Tools & Tests and commonly used Classification and Clustering evaluation techniques</p>
<p>In this post, I'll take a look at how you can compare regression models. Comparing regression models is perhaps one of the trickiest tasks to complete in the "comparing models" arena; The reason is that there are literally dozens of statistics you can calculate to compare regression models, including:</p>
<p><strong>1. Error measures in the estimation period (in-sample testing) or validation period (out-of-sample testing):</strong></p>
<ul>
<li>Mean Absolute Error (MAE),</li>
<li>Mean Absolute Percentage Error (MAPE),</li>
<li>Mean Error,</li>
<li>Root Mean Squared Error (RMSE),</li>
</ul>
<p><br/><strong>2. Tests on Residuals and Goodness-of-Fit:</strong></p>
<ul>
<li>Plots: actual vs. predicted value; cross correlation; residual autocorrelation; residuals vs. time/predicted values,</li>
<li>Changes in mean or variance,</li>
<li>Tests: normally distributed errors; excessive runs (e.g. of positives or negatives); outliers/extreme values/ influential observations.</li>
</ul>
<p>This list isn't exhaustive--there are many other tools, tests and plots at your disposal. Rather than discuss the statistics in detail, I chose to focus this post on comparing a few of the most popular regression model evaluation techniques and discuss when you might want to use them (or when you might not want to). The techniques listed below tend to be on the "easier to use and understand" end of the spectrum, so if you're new to model comparison it's a good place to start.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3414342046?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3414342046?profile=RESIZE_710x" class="align-center"/></a></p>
<p></p>
<p>The above picture (comparing models) was originally posted <a href="https://www.datasciencecentral.com/profiles/blogs/model-evaluation-techniques-in-one-picture" target="_blank" rel="noopener">here</a>. </p>
<p><em>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/comparing-model-evaluation-techniques-part-3-regression-models" target="_blank" rel="noopener">here</a>. </em></p>Elegant Representation of Forward and Back Propagation in Neural Networkstag:www.analyticbridge.datasciencecentral.com,2019-08-08:2004291:BlogPost:3934122019-08-08T16:29:52.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Sometimes, you see a diagram and it gives you an ‘aha ha’ moment. Here is one representing forward propagation and back propagation in a neural network:<br></br><a href="https://storage.ning.com/topology/rest/1.0/file/get/3388408048?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/3388408048?profile=RESIZE_710x"></img></a></p>
<p>A brief explanation is:</p>
<ul>
<li>Using the input variables x and y, The forwardpass (left half of the figure) calculates output z as a function of x and y i.e. f(x,y)</li>
<li>The right side…</li>
</ul>
<p>Sometimes, you see a diagram and it gives you an ‘aha ha’ moment. Here is one representing forward propagation and back propagation in a neural network:<br/><a href="https://storage.ning.com/topology/rest/1.0/file/get/3388408048?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3388408048?profile=RESIZE_710x" class="align-center"/></a></p>
<p>A brief explanation is:</p>
<ul>
<li>Using the input variables x and y, The forwardpass (left half of the figure) calculates output z as a function of x and y i.e. f(x,y)</li>
<li>The right side of the figures shows the backwardpass.</li>
<li>Receiving dL/dz (the derivative of the total loss with respect to the output z) , we can calculate the individual gradients of x and y on the loss function by applying the chain rule, as shown in the figure.</li>
</ul>
<p>A more detailed explanation below from me.</p>
<p><em>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/an-elegant-way-to-represent-forward-propagation-and-back" target="_blank" rel="noopener">here</a>. </em></p>Decision Tree vs Random Forest vs Gradient Boosting Machinestag:www.analyticbridge.datasciencecentral.com,2019-08-08:2004291:BlogPost:3934102019-08-08T16:25:09.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Decision Trees, Random Forests and Boosting are among the top 16 data science and machine learning tools used by data scientists. The three methods are similar, with a significant amount of overlap. In a nutshell:</p>
<ul>
<li>A decision tree is a simple, decision making-diagram.</li>
<li>Random forests are a large number of trees, combined (using averages or "majority rules") at the end of the process.</li>
<li>Gradient boosting machines also combine decision trees, but start the combining…</li>
</ul>
<p>Decision Trees, Random Forests and Boosting are among the top 16 data science and machine learning tools used by data scientists. The three methods are similar, with a significant amount of overlap. In a nutshell:</p>
<ul>
<li>A decision tree is a simple, decision making-diagram.</li>
<li>Random forests are a large number of trees, combined (using averages or "majority rules") at the end of the process.</li>
<li>Gradient boosting machines also combine decision trees, but start the combining process at the beginning, instead of at the end.</li>
</ul>
<p><strong>Decision Trees and Their Problems</strong></p>
<p>Decision trees are a series of sequential steps designed to answer a question and provide probabilities, costs, or other consequence of making a particular decision.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3414325027?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3414325027?profile=RESIZE_710x" class="align-center"/></a></p>
<p>They are simple to understand, providing a clear visual to guide the decision making progress. However, this simplicity comes with a few serious disadvantages, including overfitting, error due to bias and error due to variance.</p>
<ul>
<li>Overfitting happens for many reasons, including presence of noise and lack of representative instances. It's possible for overfitting with one large (deep) tree.</li>
<li>Bias error happens when you place too many restrictions on target functions. For example, restricting your result with a restricting function (e.g. a linear equation) or by a simple binary algorithm (like the true/false choices in the above tree) will often result in bias.</li>
<li>Variance error refers to how much a result will change based on changes to the training set. Decision trees have high variance, which means that tiny changes in the training data have the potential to cause large changes in the final result.</li>
</ul>
<p><strong>Random Forest vs Decision Trees</strong></p>
<p>As noted above, decision trees are fraught with problems. A tree generated from 99 data points might differ significantly from a tree generated with just one different data point. If there was a way to generate a very large number of trees, averaging out their solutions, then you'll likely get an answer that is going to be very close to the true answer.</p>
<p><em>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/decision-tree-vs-random-forest-vs-boosted-trees-explained" target="_blank" rel="noopener">here</a>. </em></p>The Power of Machine Learning Modelstag:www.analyticbridge.datasciencecentral.com,2019-08-07:2004291:BlogPost:3935052019-08-07T09:30:00.000ZArash Aghlarahttps://www.analyticbridge.datasciencecentral.com/profile/ArashAghlara
<p>Properly implemented Machine Learning (ML) models can have a positive effect on organizational efficiency. It is first necessary to understand how these models are created, how they function, and how they are put into production.</p>
<p><strong>The Definition of a Machine Learning Model</strong></p>
<p>When a computer is presented with questions within a particular domain, a machine learning model will run an algorithm that will enable it to resolve those questions. These algorithms are not…</p>
<p>Properly implemented Machine Learning (ML) models can have a positive effect on organizational efficiency. It is first necessary to understand how these models are created, how they function, and how they are put into production.</p>
<p><strong>The Definition of a Machine Learning Model</strong></p>
<p>When a computer is presented with questions within a particular domain, a machine learning model will run an algorithm that will enable it to resolve those questions. These algorithms are not necessarily limited to particular scenarios, but can be programmed to a higher degree of accuracy for certain types of questions. Use cases for these are listed below.</p>
<ul>
<li>Regression questions, such as ‘How much’ and ‘how many’. For example, how much will my car be worth in two years?</li>
<li>Classification questions, such as ‘Type of object’. For example, what to class does this object belong?</li>
<li>Clustering or grouping questions. For example, what are the different clusters for this particular set of objects?</li>
<li>Abnormality detection questions. For example, is this object abnormal based on what is defined as normal?</li>
</ul>
<p>Using tools, frameworks and codes, these models are built by engineers and data scientists based on what is often a huge amount of data.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3411924717?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3411924717?profile=RESIZE_710x" class="align-full"/></a></p>
<p>To build a really effective machine learning model, massive amounts of data are needed. This data needs to be cleaned and labelled. It is an iterative process, involving trial and error, as well as tests and measures. Fundamentally, there are many steps and processes involved in creating a functional model. Once this model is created, the computer will be able to answer questions for different cases within a particular scenario.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3411925684?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3411925684?profile=RESIZE_710x" class="align-full"/></a></p>
<p>The machine learning model is used to find answers to specific questions regarding different cases. Each model is specific to a particular scenario. For example, is an issue with a product fixable or not, or is this set of symptoms indicative of a particular medical problem, or is this a legitimate bank transaction? In other words, a computer can suggest a solution with a certain degree of accuracy based on the data that is used to create the machine learning model.</p>
<p> </p>
<p><strong>How Can Machine Learning Models Help Us?</strong></p>
<p>The goal of every Machine Learning model is to achieve the following:</p>
<ul>
<li>Integrate workflows and processes that involve multiple participants</li>
<li>Enable information systems to utilize certain algorithms with minimal code revision</li>
<li>Provide analytics as a service by sharing the model between multiple use cases</li>
<li>Use real batch or on-the-fly cases to integrate the model systematically</li>
<li>Combine multiple models to answer complex questions requiring multi-step answers</li>
<li>Use models in decision making within the organization or with external customers.</li>
</ul>
<p>The ability to monitor and measure the behavior of the models in a live environment is critical. This facilitates a cycle of constant improvement. Individualized models are generally not as useful as those that are part of a more sophisticated deployment involving multiple scenarios. In such cases the solutions suggested by the model need to go to a decision model based on a domain expert’s knowledge and implemented by certain business rules.</p>
<p>Let’s take the example of car insurance. A machine learning model will be designed by an insurance company using their own data sets that detail stolen cars. The model will categorize a car as low, medium or high risk.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3411926906?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3411926906?profile=RESIZE_710x" class="align-full"/></a></p>
<p>As such, calculating an insurance quote for a particular car would involve calling to a Machine Learning model which will then identify the likelihood of it being stolen and send the result to another part of the quotation process to calculate a cost for an insurance policy. In this case, the Machine Learning model is integrated into the Quote Generation Process.</p>
<p><strong>Conclusion</strong></p>
<p>Machine Learning models are most useful when they are integrated as part of a business decision to deliver <a href="http://www.flexrule.com/archives/what-is-business-value/" target="_blank" rel="noopener">business value</a>. It is crucial that these models are able to execute requests on-the-fly. The performance of these models in a specific context must be monitored, measured, and improved over time.</p>
<p><em>Read more <a href="http://www.flexrule.com" target="_self">here</a>.</em></p>
<p> </p>Why is Python a Top Choice of the data analytics?tag:www.analyticbridge.datasciencecentral.com,2019-07-25:2004291:BlogPost:3933192019-07-25T06:53:37.000ZDivyesh Aegishttps://www.analyticbridge.datasciencecentral.com/profile/DivyeshAegis
<p align="justify">Python is an extremely popular programming language. It is not just apt for generic purposes but it is extremely easy to read and use as well. The main reason why Python is used by a majority of people these days is the fact that it allows the programmers to save their time by using only limited lines of codes. In order to accomplish tasks, the developers do not have to spend a lot of time on coding, unlike the other languages. Rather, all they can do is, spend time on…</p>
<p align="justify">Python is an extremely popular programming language. It is not just apt for generic purposes but it is extremely easy to read and use as well. The main reason why Python is used by a majority of people these days is the fact that it allows the programmers to save their time by using only limited lines of codes. In order to accomplish tasks, the developers do not have to spend a lot of time on coding, unlike the other languages. Rather, all they can do is, spend time on improving the product and making it better. At the same time, there are so many libraries that make Python a preferred choice, libraries like SciPy, Matplotlib etc.</p>
<p><br/> <strong>Do data scientists use Python?</strong></p>
<p align="justify">Yes, data analysts around the world are fond of Python. Data scientists come across a plenty of data, and therefore, they have to entertain to a host of requests of the clients. Though, there are certain basis trends and techniques that a data analysis has to master in order to make sure that the data is analyzed properly and the best results are churned out from the data. In order to make sure that Python is used for data analysis, a few of the top things have to be kept in mind, like basic filtering of the data, aggregation of the data etc. And, in most of the cases, Pandas library in Python is used in order to analyze the data. Therefore, it is very important to install this Panda library.</p>
<p align="justify"></p>
<p align="justify"><a href="https://storage.ning.com/topology/rest/1.0/file/get/3382246514?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3382246514?profile=RESIZE_710x" class="align-center"/></a></p>
<p><br/> <strong>Why is Python a top choice of the data scientists?</strong></p>
<p></p>
<p><strong>The pace of the language</strong></p>
<p align="justify">Python is one of the most advanced programming languages in the world today. Python offers a large number of advantages which lead to code development at a high pace. The language has a very high-level character, therefore, it becomes very quick and efficient for the data scientists to prototype ideas prototyping ideas. But, the most important thing is that coding becomes super-fast. Also, the fact that it is very easy to learn makes it all the more preferred by the data scientists. So, basically, people from different backgrounds also like Python as they know that the language will be easy to learn and then later on, a lot of benefits can be derived from learning Python and mastering it. Also, when it comes to using Python, there is a lot of transparency that the users can see between the code as well as execution. This greater level of transparency smoothens the maintenance of code. Also, things like finding bugs or rewriting any code become easy. Additionally, if the programmers want to add anything to the code base then that becomes possible as well.</p>
<p><br/> <strong>Python and Data Science is a fab combo</strong></p>
<p align="justify">Python and data science is undoubtedly a fantastic combination. The language is used by a majority of companies, irrespective of their size and field of work. Whether we talk about a big company or a small sized startup, everyone is <strong><a href="https://www.nexsoftsys.com/technologies/python-development-services.html" target="_blank" rel="noopener">using Python development</a>.</strong> Therefore, the language has become one of the most promising programming languages in the world, and there is unquestionably enough score of the language. Hence, it is regarded as a top notch language for data science, and it is not just used for big data related firms, but by a set of other companies as well. Even the machine learning experts find Python as a great option.</p>
<p></p>
<p align="justify">Most of the people who are using Python, use any of the two libraries, either Pandas or Numpy. Also, they tend to opt for specific third-party packages which are specifically curated specifically for data science. In order to master Python in data science, one has to know about most of the data containers in Python. Apart from that, the data scientists have to have enough knowledge about indexing, power of arrays etc. Though, there is a lot of scope of Python in data science, but at the end of the day, it is important to master it. As, only when the scientists are able to figure out the best ways to use it, then only they will be able to reap the benefits from Python. Hence, it is important to learn it thoroughly in order to make the most of it.</p>Is Python Completely Object Oriented?tag:www.analyticbridge.datasciencecentral.com,2019-07-16:2004291:BlogPost:3934542019-07-16T06:55:08.000ZDivyesh Aegishttps://www.analyticbridge.datasciencecentral.com/profile/DivyeshAegis
<p align="justify">Python was introduced in 1991 by Guido Van Rossum as a high level, general purpose language. Even today, it supports multiple programming paradigms including procedural, object oriented and functional. Soon, it became one of the most popular languages in the industry, and in fact is the very language that influence Ruby and Swift. Even <a href="https://www.tiobe.com/tiobe-index/" rel="noopener" target="_blank">TIOBE Index reports</a> mentions python as the third most popular…</p>
<p align="justify">Python was introduced in 1991 by Guido Van Rossum as a high level, general purpose language. Even today, it supports multiple programming paradigms including procedural, object oriented and functional. Soon, it became one of the most popular languages in the industry, and in fact is the very language that influence Ruby and Swift. Even <a href="https://www.tiobe.com/tiobe-index/" target="_blank" rel="noopener">TIOBE Index reports</a> mentions python as the third most popular language in the world today! And mind you, this popularity has very less to do with Monty Python and more with the standard libraries that help a programmer do anything and everything in merely a few lines of code.</p>
<p align="justify"></p>
<p align="justify"><a href="https://storage.ning.com/topology/rest/1.0/file/get/3284174563?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3284174563?profile=RESIZE_710x" class="align-full"/></a></p>
<p align="justify"></p>
<p align="justify">Python was introduced in 1991 by Guido Van Rossum as a high level, general purpose language. Even today, it supports multiple programming paradigms including procedural, object oriented and functional. Soon, it became one of the most popular languages in the industry, and in fact is the very language that influence Ruby and Swift. Even TIOBE Index reports mentions python as the third most popular language in the world today! And mind you, this popularity has very less to do with Monty Python and more with the standard libraries that <a href="https://www.nexsoftsys.com/hire/python-developers.html" target="_blank" rel="noopener">help a python programmer</a> do anything and everything in merely a few lines of code.However, a large community of programming language enthusiasts have called python a ‘hybrid’ or ‘partially object oriented’ in nature. While python’s fame is accredited to its object oriented properties, why do some claim it is not even an OOL/OOP? Before we dive deep into this, let’s take a look at what an Object Oriented Language truly is.</p>
<p></p>
<p><strong>What is an Object Oriented Language?</strong></p>
<p align="justify"></p>
<p align="justify">As the name suggests, Object Oriented Programming Language or an Object Oriented Language is all about objects. It realizes real world entities like inheritance and polymorphism in programming. For this, it uses object instances of a class, whereby a class is the building block of such a language. The main concept behind introducing OOP was to bind data and functions into one unit so that the outer world cannot access private data – something called encapsulation and abstraction.</p>
<p>The main concepts of an object oriented programming language are:</p>
<ul>
<li><strong>Polymorphism</strong>: Using the same name for multiple functions.</li>
<li><strong>Encapsulation</strong>: Binding data and functions as one unit.</li>
<li><strong>Inheritance</strong>: Using functions of a preceding class.</li>
<li><strong>Abstraction</strong>: Hiding data and sharing functions.</li>
<li><strong>Classes</strong>: Groups of different data types and the functions to access and manipulate them. It is An user defined prototype of an object.</li>
<li><strong>Objects</strong>: Instances of the class, things you can Operate on.</li>
</ul>
<p align="justify">And this is exactly where the dilemma with python arose. Python does not support Encapsulation – a very important component of Object Oriented Programming!</p>
<p><strong>Why is Encapsulation not supported in Python?</strong></p>
<p align="justify">Strange as it may be to hear, the idea of encapsulation not being supported in Python has philosophical roots. People believe Guido did not think hiding data was necessary. He believed in sharing data just as you would share your functions. Long ago, in an explanation, Guido said, “we are all consenting adults here”.</p>
<p align="justify">It is worth noting that the creator of the term OOP, Alan Kay, said, "Actually I made up the term "object-oriented", and I can tell you I did not have C++ in mind." So it is secure to consider he did not have python in his mind either. While python trusts its maker, it also trusts its editor. So yes, there are no access specifiers in python and you can go around poking in the dark.</p>
<p align="justify">However, if it is absolutely necessary to hide data, python does give you an option. Putting underscores identify the literals as private and put out a message that one shouldn’t use them.</p>
<p align="justify">Guido must have removed encapsulation, but it is time we appreciate him for introducing indentation. As a programmer, you must have come to believe that encapsulation only gives you a ‘sense of security’ than actual security. Here, indentation provides actual readability to the code. He removed a not-so necessary feature but introduced hundreds of powerful ones to take its place. It is time we appreciate those.</p>
<p><strong>‘Pure’ object oriented language</strong></p>
<p align="justify">One can argue and say that Smalltalk, the first ever object oriented programming language, is the only Object Oriented programming language. However, in all honesty, object orientation is really continuum. If Smalltalk is the purest of them, others lie on varied scales. Python, for example, can score less due to lack of encapsulation. And even if Python is not a 100% pure object oriented language, one can write programs that work better in it – programs that sometimes don’t work for Smalltalk at all.</p>
<p align="justify">So is 100% object orientation actually desirable? That is indeed a contemplatable question.</p>
<p><strong>Final Verdict</strong></p>
<p align="justify">Looking at facts and figures, we can start with the assumption that python is an ‘object based language’ as it has proper classes defined. A very simple example would be</p>
<p style="text-align: center;">a = 10</p>
<p align="justify">Here, 10 is an object and hence belongs to a class. The class in this case is ‘int’. You must now be thinking of Java where ‘int’ is a primitive data type. But, surprisingly, ‘int’ is a class in Python – a commendable approach towards object orientation.</p>
<p align="justify">Coming back to our definition of an Object Oriented Program, it is a programming language that uses classes and objects to manipulate data and apply real-world entities like inheritance to them. Since python is fully capable of doing this, it definitely qualifies.</p>
<p><strong>Yes, Python is an Object Oriented Programming Language</strong></p>
<p align="justify">You can carry out inheritance, polymorphism, and can make hundreds of objects of a class. Python is a multi-paradigm language, a language that has been Object oriented since the day it existed. However, its efficiency completely depends on your code.</p>
<p align="justify">Python has come a long way from Smalltalk or even Java, but the changes have all been for the better. If you remember to be DRY (Don’t Repeat yourself) and shy (don’t poke into unnecessary functions), your code will be efficient, no matter what the language is.</p>
<p>Read more Sources:</p>
<p><a href="https://www.tiobe.com/tiobe-index/" target="_blank" rel="noopener">https://www.tiobe.com/tiobe-index/</a></p>
<p><a href="https://en.wikipedia.org/wiki/Python_(programming_language)" target="_blank" rel="noopener">https://en.wikipedia.org/wiki/Python_(programming_language)</a></p>
<p><a href="https://mail.python.org/pipermail/tutor/2003-October/025932.html" target="_blank" rel="noopener">https://mail.python.org/pipermail/tutor/2003-October/025932.html</a></p>How the Mathematics of Fractals Can Help Predict Stock Markets Shiftstag:www.analyticbridge.datasciencecentral.com,2019-07-08:2004291:BlogPost:3930292019-07-08T16:25:57.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>In financial markets, two of the most common trading strategies used by investors are the momentum and mean reversion strategies. If a stock exhibits momentum (or trending behavior as shown in the figure below), its price on the current period is more likely to increase (decrease) if it has already increased (decreased) on the previous period.</p>
<p>When the return of a stock at time t depends in some way on the return at the previous time t-1, the returns are said to be autocorrelated. In…</p>
<p>In financial markets, two of the most common trading strategies used by investors are the momentum and mean reversion strategies. If a stock exhibits momentum (or trending behavior as shown in the figure below), its price on the current period is more likely to increase (decrease) if it has already increased (decreased) on the previous period.</p>
<p>When the return of a stock at time t depends in some way on the return at the previous time t-1, the returns are said to be autocorrelated. In the momentum regime, returns are positively correlated.</p>
<p>In contrast, the price of a mean-reverting stock fluctuates randomly around its historical mean and displays a tendency to revert to it. When there is mean reversion, if the price increased (decreased) in the current period, it is more likely to decrease (increase) in the next one.</p>
<p>A section of the time series of log returns of the Apple stock (adjusted closing price), shown below, is an example of mean-reverting behavior.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3211474393?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3211474393?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Note that, since the two regimes occur in different time frames (trending behavior usually occurs in larger timescales), they can, and often do, coexist.</p>
<p>In both regimes, the current price contains useful information about the future price. In fact, trading strategies can only generate profit if asset prices are either trending or mean-reverting since, otherwise, prices are following what is known as a random walk (see the animation below).</p>
<p>Read full (long) article <a href="https://www.datasciencecentral.com/profiles/blogs/how-the-mathematics-of-fractals-can-help-predict-stock-markets" target="_blank" rel="noopener">here</a>. <span>For free books about machine learning and data science, </span><a href="https://www.datasciencecentral.com/profiles/blogs/new-books-and-resources-for-dsc-members" target="_blank" rel="noopener">follow this link</a><span>. </span></p>Where’s the Love – Trends in Data Science Career Opportunitiestag:www.analyticbridge.datasciencecentral.com,2019-07-08:2004291:BlogPost:3933392019-07-08T16:18:23.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><strong><em>Summary:</em></strong><span> </span><em> The annual Burtch Works salary survey tells us a lot about which industries are using the most data scientists and the difference between higher and lower skilled data scientists. Salary increases show us whether demand is increasing, and finally we take a shot at determining which skills are most in demand.</em></p>
<p> What a difference a few years can make. We used to say that everyone loves a data scientist – and wants to be one. …</p>
<p><strong><em>Summary:</em></strong><span> </span><em> The annual Burtch Works salary survey tells us a lot about which industries are using the most data scientists and the difference between higher and lower skilled data scientists. Salary increases show us whether demand is increasing, and finally we take a shot at determining which skills are most in demand.</em></p>
<p> What a difference a few years can make. We used to say that everyone loves a data scientist – and wants to be one. That’s still true. But as data science has increasingly been adopted by businesses at all levels, industries, and geographies the nature of the opportunities available to data science have also changed.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3211466047?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3211466047?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Yes it’s still one of the most interesting and rewarding career choices you can make. I wouldn’t trade it for anything. Where else can you create value out of previously unvalued data while basically predicting the future? Of course I’m talking about what customers will do, what prices or values will be, or whether something is abnormal. All the things we’re involved with on a day-to-day basis.</p>
<p>Read the full article <a href="https://www.datasciencecentral.com/profiles/blogs/where-s-the-love-trends-in-data-science-career-opportunities" target="_blank" rel="noopener">here</a>. For free books about machine learning and data science, <a href="https://www.datasciencecentral.com/profiles/blogs/new-books-and-resources-for-dsc-members" target="_blank" rel="noopener">follow this link</a>. </p>How to learn the maths of Data Science using your high school maths knowledgetag:www.analyticbridge.datasciencecentral.com,2019-06-27:2004291:BlogPost:3931012019-06-27T18:22:15.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>By Ajit Jaokar. This post is a part of my forthcoming book on Mathematical foundations of Data Science. In this post, we use the Perceptron algorithm to bridge the gap between high school maths and deep learning. </p>
<p><strong>Background</strong></p>
<p>As part of my role as course director of the Artificial Intelligence: Cloud and Edge Computing at the University of Oxford, I see more students who are familiar with programming than with mathematics.</p>
<p>They have last learnt maths…</p>
<p>By Ajit Jaokar. This post is a part of my forthcoming book on Mathematical foundations of Data Science. In this post, we use the Perceptron algorithm to bridge the gap between high school maths and deep learning. </p>
<p><strong>Background</strong></p>
<p>As part of my role as course director of the Artificial Intelligence: Cloud and Edge Computing at the University of Oxford, I see more students who are familiar with programming than with mathematics.</p>
<p>They have last learnt maths years ago at University. And then, suddenly they find that they encounter matrices, linear algebra etc when they start learning Data Science.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3138240717?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3138240717?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Ideas they thought they would not face again after college! Worse still, in many cases, they do not know where precisely these concepts apply to data science.</p>
<p>If you consider the maths foundations needed to learn data science, you could divide them into four key areas</p>
<ul>
<li>Linear Algebra</li>
<li>Probability Theory and Statistics</li>
<li>Multivariate Calculus</li>
<li>Optimization</li>
</ul>
<p>All of these are taught (at least partially) in high schools (14 to 17 years of age). In this book, we start with these ideas and co-relate them to data science and AI.</p>
<p>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/how-to-learn-the-maths-of-data-science-using-your-high-school" target="_blank" rel="noopener">here</a>. </p>Machine Learning and Data Science Cheat Sheettag:www.analyticbridge.datasciencecentral.com,2019-06-07:2004291:BlogPost:3931312019-06-07T02:27:48.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Originally published in 2014 and viewed more than 200,000 times, this is the oldest data science cheat sheet - the mother of all the numerous cheat sheets that are so popular nowadays. I decided to update it in June 2019. While the first half, dealing with installing components on your laptop and learning UNIX, regular expressions, and file management hasn't changed much, the second half, dealing with machine learning, was rewritten entirely from scratch. It is amazing how things changed in…</p>
<p>Originally published in 2014 and viewed more than 200,000 times, this is the oldest data science cheat sheet - the mother of all the numerous cheat sheets that are so popular nowadays. I decided to update it in June 2019. While the first half, dealing with installing components on your laptop and learning UNIX, regular expressions, and file management hasn't changed much, the second half, dealing with machine learning, was rewritten entirely from scratch. It is amazing how things changed in just five years!</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2802101885?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2802101885?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Written for people who have never seen a computer in their life, it starts with the very beginning: buying a laptop! You can skip the first half and jump to sections 5 and 6 if you are already familiar with UNIX. This new cheat sheet will be included in my upcoming book<span> </span><em>Machine Learning: Foundations, Toolbox, and Recipes</em><span> </span>to be published in September 2019, and available (for free) to Data Science Central members exclusively. This cheat sheet is 14 pages long.</p>
<p><strong>Content</strong></p>
<p>1. Hardware</p>
<p>2. Linux environment on Windows laptop</p>
<p>3. Basic UNIX commands</p>
<p>4. Scripting languages</p>
<p>5. Python, R, Hadoop, SQL, DataViz</p>
<p>6. Machine Learning</p>
<ul>
<li>Algorithms</li>
<li>Getting started</li>
<li>Applications</li>
<li>Data sets and sample projects</li>
</ul>
<p>This new cheat sheet is available <a href="https://www.datasciencecentral.com/profiles/blogs/data-science-cheat-sheet" target="_blank" rel="noopener">here</a>. </p>Questions To Answer And Factors To Consider For Web Analyticstag:www.analyticbridge.datasciencecentral.com,2019-06-06:2004291:BlogPost:3929412019-06-06T07:30:00.000ZJenny Richardshttps://www.analyticbridge.datasciencecentral.com/profile/JennyRichards
<p>It will be unwise to expect you will generate lot of sales if you have significant amount of web traffic. It alone cannot be of much help in this matter. You will need to track the website metrics properly in order to take necessary measure to convert the traffic into your business prospects. You will need to analyze your website from time to time to ensure that it is not only accessible to the users but also provides all necessary guidance to show them the right way to make a…</p>
<p>It will be unwise to expect you will generate lot of sales if you have significant amount of web traffic. It alone cannot be of much help in this matter. You will need to track the website metrics properly in order to take necessary measure to convert the traffic into your business prospects. You will need to analyze your website from time to time to ensure that it is not only accessible to the users but also provides all necessary guidance to show them the right way to make a purchase.</p>
<p>If you are a good online marketing strategist, you will know the answers to different questions that may be put up by your visitors. In order to ensure that you provide them with just the solutions they want, you will need to find answers to a few specific questions and a proper website analysis will help you a great deal in this matter.</p>
<p>Ideally, a successful business owner knows the answers to questions such as:</p>
<ul>
<li>Where the traffic to their website is coming from?</li>
<li>What are the web pages that their visitors are landing on the most?</li>
<li>What percentage of their visitors come back to visit their website?</li>
<li>How many such visitors actually convert into customers?</li>
</ul>
<p>If you lack these answers it will imply that you are only concerned on the volume of traffic and not on its quality. Any good website on a reliable browser will have a large number of visitors but that is due to the accessibilityof these sites which does not deduce that these sites are of highest quality. If your site lacks in quality, it will surely affect your sales as a result.</p>
<p>Therefore, for a successful website to generate leads and sales it does not matter how many people visit that website but all that matters is whether or not those who visit the site are just the "right" kind of people. The “right” kind of people means those type of people who will buy your product or service someday.</p>
<p>It is for this reason you should emphasize on good website analytics as that will take out the mystery of wondering who is visiting the website of your company and for what reason. However, to use these web analytics tools, you need not be an expert online marketing strategist. You will find a plethora of <a href="https://www.tableau.com/trial/simplify-web-analytics" target="_self">website analytics packages</a><a href="https://storage.ning.com/topology/rest/1.0/file/get/2800633706?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2800633706?profile=RESIZE_710x" class="align-full"/></a><a href="https://www.google.com/aclk?sa=l&ai=DChcSEwiflILau8_iAhWGJCsKHTzmCcsYABAAGgJzZg&sig=AOD64_1FHwP-PbTkki1zNI5pNLN00ldDEQ&q=&ved=2ahUKEwiN__vZu8_iAhXGXSsKHVPABmUQ0Qx6BAgLEAE&adurl="></a>on the web for sale. However, it is best that you get started with the free Google Analytics.</p>
<p><strong>Factors to consider</strong></p>
<p>The Google service provides you with a useful line of code that you can plug into each of the pages of your website to start tracking its functionality and performance. These tools will help you to get a proper breakdown of how many visitors came to your website. Along with that these tools will also help you to know several other things such as:</p>
<ul>
<li>How long a visitor stayed in your site</li>
<li>What are the sites that they previously came from</li>
<li>What search terms or keywords they used to reach to your website and</li>
<li>Which are the pages in your website they visited the most?</li>
</ul>
<p>In order to know the answers to these most significant questions you will need to consider a few most important points to get the right numbers. These are:</p>
<ul>
<li>Do the visitors of your website already know you – The primary objective of a website is to link the brand or the business with the potential new customers. These customers may never have heard about you or your business before and they may not be those people who may be simply looking up your web address. Ideally, a well-crafted website must have only a small percentage of visitors, maybe 5%, who will actually have used the name of the companyto find and visit it.</li>
<li>Is your website bringing in more potential customers–You should have proper search keywords used so that people can reach to your site more easily. Typically, use of proper keywords will raise the level of <a href="https://siteimprove.com/en-us/accessibility/what-is-accessibility/">Accessibility</a> to wards your website enabling you to bring in more potential customers. All you have to do now is know what they want and deliver them with that exactly. They will surely buy from you if you offer the best deal.</li>
<li>How well does your social media presence work: Most of the businesses of today use their social media connection to guide them towards their business websites. Typically it is seen that, if you put in 10% of your online marketing energies on social media and 25% of the entire number of visits to your website come from Facebook or Twitter, you can consider your business to be in good shape. You will be even better off if you can set a few specific goals to achieve through Google Analytics. One such goal may be that you want the visitors to come from Facebook to see a particular post on your website that links to a specific promotional offer. In the end, if you want to see how many people followed the path, you can tweet the promotion later on.</li>
<li>Are your visitors bailing from the homepage of your site – With the help of Google Analytics you will be able to know about the bounce rate of your home page and even know the exact percentage of visitors to your home page who actually did not click on any other additional pages. If you find that the bounce rate is more than 60% and within 70%, you may consider that your site has a problem.</li>
</ul>
<p>Typically, the Google Analytics will help you to know about the specific search terms that your visitors are using to find your site and even tell you whether or not they are the people that you consider to be ‘right.’ If they are, then they should delve deeper into your site but if the analytical results show they are not you can consider re-designing your homepage to make it look more professional, the content more compelling and make your site less confusing to ensure better conversions.</p>7 Simple Tricks to Handle Complex Machine Learning Issuestag:www.analyticbridge.datasciencecentral.com,2019-06-04:2004291:BlogPost:3925262019-06-04T18:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>We propose simple solutions to important problems that all data scientists face almost every day. In short, a toolbox for the handyman, useful to busy professionals in any field.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2760849159?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/2760849159?profile=RESIZE_710x"></img></a></p>
<p><strong>1. Eliminating sample size effects</strong>. <span>Many statistics, such as correlations or R-squared, depend on the sample size, making it difficult to…</span></p>
<p>We propose simple solutions to important problems that all data scientists face almost every day. In short, a toolbox for the handyman, useful to busy professionals in any field.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2760849159?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2760849159?profile=RESIZE_710x" class="align-center"/></a></p>
<p><strong>1. Eliminating sample size effects</strong>. <span>Many statistics, such as correlations or R-squared, depend on the sample size, making it difficult to compare values computed on two data sets of different sizes. Based on re-sampling techniques, use this easy trick, to compare apples with other apples, not with oranges. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/simple-trick-to-normalize-correlations-r-squared-and-so-on" target="_blank" rel="noopener">here</a>. </span></p>
<p><span><strong>2. Sample size determination, and simple, model-free confidence intervals</strong>. We propose a generic methodology, also based on re-sampling techniques, to compute any confidence interval and for testing hypotheses, without using any statistical theory. Also, it is easy to implement, even in Excel. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/modern-re-sampling-and-statistical-recipes" target="_blank" rel="noopener">here</a>. </span></p>
<p><span><strong>3. Determining the number of clusters in non-supervised clustering</strong>. This modern version of the elbow rule also tells you how strong the global optimum is, and can help you identify local optima too. It can also be automated. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/how-to-automatically-determine-the-number-of-clusters-in-your-dat" target="_blank" rel="noopener">here</a>. </span></p>
<p><span><strong>4. Fixing issues in regression models when the assumptions are violated</strong>. If your data has serial correlation, unequal variances and other similar problems, this simple trick will remove the issue and allows you to perform more meaningful regressions, or to detect flaws in your data set. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/simple-trick-to-remove-serial-correlation-in-regression-models" target="_blank" rel="noopener">here</a>. </span></p>
<p><strong>5. Performing joins on poor quality data</strong>. This 40 year old trick allows you to perform a join when your data is infested with typos, multiple names representing the same entity, and other similar issues. In short, it performs a fuzzy join. Read more <a href="https://www.datasciencecentral.com/forum/topics/40-year-old-trick-to-clean-data-efficiently" target="_blank" rel="noopener">here</a>. </p>
<p><strong>6. Scale invariant techniques</strong>. Sometimes, transforming your data, even changing the scale of one feature, say from meters to feet, have a dramatic impact on the results. Sometimes, you want your conclusions to be scale-independent. This trick solves this problem. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/scale-invariant-clustering-and-regression" target="_blank" rel="noopener">here</a>. </p>
<p><strong>7. Blending data sets with incompatible data, adding consistency to your metrics</strong>. We are all too familiar with metrics that change over time and result in inconsistencies when comparing the past to the present, or when comparing different segments with incompatible measurements. This trick will allow you to design systems where again, apples are compared to other apples, not to oranges. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/how-to-stabilize-data-to-avoid-decay-in-model-performance" target="_blank" rel="noopener">here</a>.</p>
<p><em>To not miss this type of content in the future,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter">subscribe</a><span> </span>to our newsletter. For related articles from the same author, <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">click here</a><span> </span>or visit<span> </span><a href="http://www.vincentgranville.com/" target="_blank" rel="noopener">www.VincentGranville.com</a>. Follow me on<span> </span><a href="https://www.linkedin.com/in/vincentg/" target="_blank" rel="noopener">on LinkedIn</a>, or visit my old web page<span> </span><a href="http://www.datashaping.com">here</a>.</em></p>
<p><span style="font-size: 12pt;"><strong>Resources from our sponsors</strong></span></p>
<ul>
<li dir="ltr"><a href="https://dsc.news/2WFHJ0q" target="_blank" rel="noopener">The State of Data Preparation in 2019</a> - June 25</li>
<li dir="ltr"><a href="https://dsc.news/2JWn6XR" target="_blank" rel="noopener">AI in Action: Real-time Anomaly Detection</a> - June 18</li>
<li dir="ltr"><a href="https://dsc.news/2GZmBtn" target="_blank" rel="noopener">Balancing AI Endeavors with Analytic Talent</a> - DSC Podcast</li>
</ul>
<p></p>Gentle Approach to Linear Algebra, with Machine Learning Applicationstag:www.analyticbridge.datasciencecentral.com,2019-05-29:2004291:BlogPost:3925052019-05-29T03:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><span>This simple introduction to matrix theory offers a refreshing perspective on the subject. Using a basic concept that leads to a simple formula for the power of a matrix, we see how it can solve time series, Markov chains, linear regression, data reduction, principal components analysis (PCA) and other machine learning problems. These problems are usually solved with more advanced matrix calculus, including eigenvalues, diagonalization, generalized inverse matrices, and other types of…</span></p>
<p><span>This simple introduction to matrix theory offers a refreshing perspective on the subject. Using a basic concept that leads to a simple formula for the power of a matrix, we see how it can solve time series, Markov chains, linear regression, data reduction, principal components analysis (PCA) and other machine learning problems. These problems are usually solved with more advanced matrix calculus, including eigenvalues, diagonalization, generalized inverse matrices, and other types of matrix normalization. Our approach is more intuitive and thus appealing to professionals who do not have a strong mathematical background, or who have forgotten what they learned in math textbooks. It will also appeal to physicists and engineers. Finally, it leads to simple algorithms, for instance for matrix inversion. The classical statistician or data scientist will find our approach somewhat intriguing. </span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/2716936013?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2716936013?profile=RESIZE_710x" class="align-center"/></a></span></p>
<p><strong>Content</strong></p>
<p>1. Power of a matrix</p>
<p>2. Examples, Generalization, and Matrix Inversion</p>
<ul>
<li>Example with a non-invertible matrix</li>
<li>Fast computations</li>
</ul>
<p>3. Application to Machine Learning Problems</p>
<ul>
<li>Markov chains</li>
<li>Time series</li>
<li>Linear regression</li>
</ul>
<p><span><a href="https://www.datasciencecentral.com/profiles/blogs/new-approach-to-linear-algebra-in-machine-learning" target="_blank" rel="noopener">Read the full article</a>. </span></p>New Book: Classification and Regression In a Weekend (in Python)tag:www.analyticbridge.datasciencecentral.com,2019-05-17:2004291:BlogPost:3927002019-05-17T00:24:08.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>We have added a new free book in our selection exclusively for DSC members. See the first entry below, to get started with machine learning with Python.</p>
<p><strong>1. Book: Classification and Regression In a Weekend</strong></p>
<p>This tutorial began as a series of weekend workshops created by Ajit Jaokar and Dan Howarth. The idea was to work with a specific (longish) program such that we explore as much of it as possible in one weekend. This book is an attempt to take this idea online.…</p>
<p>We have added a new free book in our selection exclusively for DSC members. See the first entry below, to get started with machine learning with Python.</p>
<p><strong>1. Book: Classification and Regression In a Weekend</strong></p>
<p>This tutorial began as a series of weekend workshops created by Ajit Jaokar and Dan Howarth. The idea was to work with a specific (longish) program such that we explore as much of it as possible in one weekend. This book is an attempt to take this idea online. The best way to use this book is to work with the Python code as much as you can. The code has comments. But you can extend the comments by the concepts explained here.</p>
<p>The table of contents is available<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/free-book-classification-and-regression-in-a-weekend" target="_blank" rel="noopener">here</a>. The book can be accessed<span> </span><a href="https://www.datasciencecentral.com/page/free-books-1" target="_blank" rel="noopener">here</a><span> </span>(members only.)</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2626374029?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2626374029?profile=RESIZE_710x" class="align-center"/></a></p>
<p><strong>2. Book: Enterprise AI - An Application Perspective</strong> </p>
<p>Enterprise AI: An applications perspective takes a use case driven approach to understand the deployment of AI in the Enterprise. Designed for strategists and developers, the book provides a practical and straightforward roadmap based on application use cases for AI in Enterprises. The authors (Ajit Jaokar and Cheuk Ting Ho) are data scientists and AI researchers who have deployed AI applications for Enterprise domains. The book is used as a reference for Ajit and Cheuk's new course on Implementing Enterprise AI.</p>
<p>The table of content is available<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/free-ebook-enterprise-ai-an-applications-perspective" target="_blank" rel="noopener">here</a>. The book can be accessed<span> </span><a href="https://www.datasciencecentral.com/page/free-books-1" target="_blank" rel="noopener">here</a><span> </span>(members only.)</p>
<p><strong>3. Book: Applied Stochastic Processes</strong></p>
<p>Full title:<span> </span><em>Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems</em>. Published June 2, 2018. Author: Vincent Granville, PhD. (104 pages, 16 chapters.)</p>
<p>This book is intended to professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject. It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.</p>
<p>New ideas, advanced topics, and state-of-the-art research are discussed in simple English, without using jargon or arcane theory. It unifies topics that are usually part of different fields (data science, operations research, dynamical systems, computer science, number theory, probability) broadening the knowledge and interest of the reader in ways that are not found in any other book. This short book contains a large amount of condensed material that would typically be covered in 500 pages in traditional publications. Thanks to cross-references and redundancy, the chapters can be read independently, in random order.</p>
<p>The table of content is available<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes" target="_blank" rel="noopener">here</a>. The book (PDF) can be accessed<span> </span><a href="https://www.datasciencecentral.com/page/free-books-1" target="_blank" rel="noopener">here</a><span> </span>(members only.) </p>Confidence Intervals Without Pain, with Exceltag:www.analyticbridge.datasciencecentral.com,2019-05-09:2004291:BlogPost:3924682019-05-09T17:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>We propose a simple model-free solution to compute any confidence interval and to extrapolate these intervals beyond the observations available in your data set. In addition we propose a mechanism to sharpen the confidence intervals, to reduce their width by an order of magnitude. The methodology works with any estimator (mean, median, variance, quantile, correlation and so on) even when the data set violates the classical requirements necessary to make traditional statistical techniques…</p>
<p>We propose a simple model-free solution to compute any confidence interval and to extrapolate these intervals beyond the observations available in your data set. In addition we propose a mechanism to sharpen the confidence intervals, to reduce their width by an order of magnitude. The methodology works with any estimator (mean, median, variance, quantile, correlation and so on) even when the data set violates the classical requirements necessary to make traditional statistical techniques work. In particular, our method also applies to observations that are auto-correlated, non identically distributed, non-normal, and even non-stationary. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2383098025?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2383098025?profile=RESIZE_710x" class="align-center"/></a></p>
<p>No statistical knowledge is required to understand, implement, and test our algorithm, nor to interpret the results. Its robustness makes it suitable for black-box, automated machine learning technology. It will appeal to anyone dealing with data on a regular basis, such as data scientists, statisticians, software engineers, economists, quants, physicists, biologists, psychologists, system and business analysts, and industrial engineers. </p>
<p>In particular, we provide a confidence interval (CI) for the width of confidence intervals without using Bayesian statistics. The width is modeled as<span> </span><em>L</em><span> </span>=<span> </span><em>A</em><span> </span>/<span> </span><em>n^B</em> and we compute, using Excel alone, a 95% CI for<span> </span><em>B</em><span> </span>in the classic case where<span> </span><em>B</em><span> </span>= 1/2. We also exhibit an artificial data set where<span> </span><em>L</em><span> </span>= 1 / (log<span> </span><em>n</em>)^Pi. Here<span> </span><em>n</em><span> </span>is the sample size.</p>
<p><span>Despite the apparent simplicity of our approach, we are dealing here with martingales. But you don't need to know what a martingale is to understand the concepts and use our methodology. </span></p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/confidence-intervals-without-pain" target="_blank" rel="noopener">Read the full article here</a>.</p>Re-sampling: Amazing Results and Applicationstag:www.analyticbridge.datasciencecentral.com,2019-05-04:2004291:BlogPost:3925562019-05-04T18:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>This crash course features a new fundamental statistics theorem -- even more important than the central limit theorem -- and a new set of statistical rules and recipes. We discuss concepts related to determining the optimum sample size, the optimum<span> </span><em>k</em><span> </span>in<span> </span><em>k</em>-fold cross-validation, bootstrapping, new re-sampling techniques, simulations, tests of hypotheses, confidence intervals, and statistical inference using a unified, robust, simple…</p>
<p>This crash course features a new fundamental statistics theorem -- even more important than the central limit theorem -- and a new set of statistical rules and recipes. We discuss concepts related to determining the optimum sample size, the optimum<span> </span><em>k</em><span> </span>in<span> </span><em>k</em>-fold cross-validation, bootstrapping, new re-sampling techniques, simulations, tests of hypotheses, confidence intervals, and statistical inference using a unified, robust, simple approach with easy formulas, efficient algorithms and illustration on complex data.</p>
<p>Little statistical knowledge is required to understand and apply the methodology described here, yet it is more advanced, more general, and more applied than standard literature on the subject. The intended audience is beginners as well as professionals in any field faced with data challenges on a daily basis. This article presents statistical science in a different light, hopefully in a style more accessible, intuitive, and exciting than standard textbooks, and in a compact format yet covering a large chunk of the traditional statistical curriculum and beyond.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2301106250?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2301106250?profile=RESIZE_710x" class="align-center"/></a></p>
<p>In particular, the concept of<span> </span><em>p</em>-value is not explicitly included in this tutorial. Instead, following the new trend after the recent <em>p</em>-value debacle (addressed<span> </span>by the president of the American Statistical Association), it is replaced with a range of values computed on multiple sub-samples. </p>
<p>Our algorithms are suitable for inclusion in black-box systems, batch processing, and automated data science. Our technology is data-driven and model-free. Finally, our approach to this problem shows the contrast between the data science unified, bottom-up, and computationally-driven perspective, and the traditional top-down statistical analysis consisting of a collection of disparate results that emphasizes the theory. </p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/modern-re-sampling-and-statistical-recipes" target="_blank" rel="noopener">Read the full article here</a>.</p>
<p><span><strong>Contents</strong></span></p>
<p><span>1. Re-sampling and Statistical Inference</span></p>
<ul>
<li><span>Main Result</span></li>
<li><span>Sampling with or without Replacement</span></li>
<li><span>Illustration</span></li>
<li><span>Optimum Sample Size </span></li>
<li><span>Optimum <em>K</em> in <em>K</em>-fold Cross-Validation</span></li>
<li><span>Confidence Intervals, Tests of Hypotheses</span></li>
</ul>
<p><span>2. Generic, All-purposes Algorithm</span></p>
<ul>
<li><span>Re-sampling Algorithm with Source Code</span></li>
<li><span>Alternative Algorithm</span></li>
<li><span>Using a Good Random Number Generator</span></li>
</ul>
<p><span>3. Applications</span></p>
<ul>
<li><span>A Challenging Data Set</span></li>
<li><span>Results and Excel Spreadsheet</span></li>
<li><span>A New Fundamental Statistics Theorem</span></li>
<li><span>Some Statistical Magic</span></li>
<li><span>How does this work?</span></li>
<li><span>Does this contradict entropy principles?</span></li>
</ul>
<p><span>4. Conclusions</span></p>Some Fun with Gentle Chaos, the Golden Ratio, and Stochastic Number Theorytag:www.analyticbridge.datasciencecentral.com,2019-04-25:2004291:BlogPost:3923832019-04-25T13:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>So many fascinating and deep results have been written about the number (1 + SQRT(5)) / 2 and its related sequence - the Fibonacci numbers - that it would take years to read all of them. This number has been studied both for its applications (population growth, architecture) and its mathematical properties, for over 2,000 years. It is still a topic of active research.…</p>
<p></p>
<p>So many fascinating and deep results have been written about the number (1 + SQRT(5)) / 2 and its related sequence - the Fibonacci numbers - that it would take years to read all of them. This number has been studied both for its applications (population growth, architecture) and its mathematical properties, for over 2,000 years. It is still a topic of active research.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2197458362?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2197458362?profile=RESIZE_710x" class="align-center"/></a></p>
<p style="text-align: center;"><em>Lag-1 auto-correlation in digit distribution of good seeds, for b-processes</em></p>
<p>I show here how I used the golden ratio for a new number guessing game (to generate chaos and randomness in ergodic time series) as well as new intriguing results, in particular:</p>
<ul>
<li>Proof that the<span> </span><a href="http://mathworld.wolfram.com/RabbitConstant.html" target="_blank" rel="noopener">rabbit constant</a><span> </span>it is not normal in any base; this might be the first instance of a non-artificial mathematical constant for which the normalcy status is formally established.</li>
<li>Beatty sequences, pseudo-periodicity, and infinite-range auto-correlations for the digits of irrational numbers in the numeration system derived from perfect stochastic processes</li>
<li>Properties of multivariate<span> </span><em>b</em>-processes, including integer or non-integer bases.</li>
<li>Weird behavior of auto-correlations for the digits of normal numbers (good seeds) in the numeration system derived from stochastic<span> </span><em>b</em>-processes</li>
<li>A strange recursion that generates all the digits of the rabbit constant</li>
</ul>
<p><strong>Content of this article</strong></p>
<p>1. Some Definitions</p>
<p>2. Digits Distribution in b-processes</p>
<p>3. Strange Facts and Conjectures about the Rabbit Constant</p>
<p>4. Gaming Application</p>
<ul>
<li>De-correlating Using Mapping and Thinning Techniques</li>
<li>Dissolving the Auto-correlation Structure Using Multivariate b-processes</li>
</ul>
<p>5. Related Articles</p>
<p><em>Read full articles, <a href="https://www.datasciencecentral.com/profiles/blogs/some-fun-with-the-golden-ratio-time-series-and-number-theory" target="_blank" rel="noopener">here</a>. </em></p>Causality – The Next Most Important Thing in AI/MLtag:www.analyticbridge.datasciencecentral.com,2019-04-25:2004291:BlogPost:3923012019-04-25T01:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><strong><em>Summary:</em></strong><em> Finally there are tools that let us transcend ‘correlation is not causation’ and<span> </span><strong>identify true causal factors</strong><span> </span>and their relative strengths in our models. This is what prescriptive analytics was meant to be.</em></p>
<p> <a href="https://storage.ning.com/topology/rest/1.0/file/get/2132982369?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/2132982369?profile=RESIZE_710x" width="400"></img></a></p>
<p>Just when I thought we’d figured it all out,…</p>
<p><strong><em>Summary:</em></strong><em> Finally there are tools that let us transcend ‘correlation is not causation’ and<span> </span><strong>identify true causal factors</strong><span> </span>and their relative strengths in our models. This is what prescriptive analytics was meant to be.</em></p>
<p> <a href="https://storage.ning.com/topology/rest/1.0/file/get/2132982369?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2132982369?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>Just when I thought we’d figured it all out, something comes along to make me realize I was wrong. And that something in AI/ML is as simple as realizing that everything we’ve done so far is just curve-fitting. Whether it’s a scoring model or a CNN to recognize cats, it’s all about association; reducing the error between the distribution of two data sets. </p>
<p>What we should have had our eye on is CAUSATION. How many times have you repeated ‘correlation is not causation’. Well it seems we didn’t stop to ask how AI/ML can actually determine causality. And now it turns out it can.</p>
<p>But to achieve an understanding of causality requires us to cast loose of many of the common tools and techniques we’ve been trained to apply and to understand the data from a wholly new perspective. Fortunately the constant advance of research and ever increasing compute capability now makes it possible for us to use new relatively friendly tools to measure causality. </p>
<p>However, make no mistake, you’ll need to master the concepts of causal data analysis or you will most likely misunderstand what these tools can do.</p>
<p><em>Read the full article by Bill Vorhies, <a href="https://www.datasciencecentral.com/profiles/blogs/causality-the-next-most-important-thing-in-ai-ml" target="_blank" rel="noopener">here</a>. </em></p>New Stock Trading and Lottery Game Rooted in Deep Mathtag:www.analyticbridge.datasciencecentral.com,2019-04-15:2004291:BlogPost:3923672019-04-15T16:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>I describe here the ultimate number guessing game, played with real money. It is a new trading and gaming system, b<span>ased on state-of-the-art mathematical engineering, robust architecture, and patent-pending technology. It offers an alternative to the stock market and traditional gaming. This system is also far more transparent than the stock market, and can not be manipulated, as formulas to win the biggest returns (with real money) are made public. Also, it simulates a neutral,…</span></p>
<p>I describe here the ultimate number guessing game, played with real money. It is a new trading and gaming system, b<span>ased on state-of-the-art mathematical engineering, robust architecture, and patent-pending technology. It offers an alternative to the stock market and traditional gaming. This system is also far more transparent than the stock market, and can not be manipulated, as formulas to win the biggest returns (with real money) are made public. Also, it simulates a neutral, efficient stock market. In short, there is nothing random, everything is deterministic and fixed in advance, and known to all users. Yet it behaves in a way that looks perfectly random, and public algorithms offered to win the biggest gains require so much computing power, that for all purposes, they are useless -- except to comply with gaming laws and to establish trustworthiness.</span></p>
<p><span>We use private algorithms to determine the winning numbers, and while they produce the exact same results as the public algorithms (we tested this extensively), they are incredibly more efficient, by many orders of magnitude. Also, it can be mathematically proved that the public and private algorithms are equivalent, and we actually proved it. We go through this verification process for any new algorithm introduced in our system. </span></p>
<p><span>In the last section, we offer a competition: can you use the public algorithm to identify the winning numbers computed with the private (secret) algorithm? If yes, the system is breakable, and a more sophisticated approach is needed, to make it work. I don't think anyone can find the winning numbers (you are welcome to prove me wrong), so the award will be offered to the contestant providing the best insights on how to improve the robustness of this system. And if by chance you manage to identify those winning numbers, great, you'll get a bonus! But it is not a requirement to win the award.</span></p>
<p></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/2006368707?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2006368707?profile=RESIZE_710x" class="align-center"/></a></span></p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/data-science-foundations-for-a-new-stock-market" target="_blank" rel="noopener">Read the full article</a></p>
<p><strong>Content</strong></p>
<p>1. Description, Main Features and Advantages</p>
<p>2. How it Works: the Secret Sauce</p>
<ul>
<li>Public Algorithm</li>
<li>The Winning Numbers</li>
<li>Using Seeds to Find the Winning Numbers</li>
<li>ROI Tables</li>
</ul>
<p>3. Business Model and Applications</p>
<ul>
<li>Managing the Money Flow</li>
</ul>
<p>4. Challenge and Statistical Results</p>
<ul>
<li>Data Science / Math Competition</li>
<li>Controlling the Variance of the Portfolio Value</li>
<li>Probability of Cracking the System</li>
</ul>
<p>5. Designing 16-bit and 32-bit Systems</p>
<ul>
<li>Layered ROI Tables</li>
<li>Smooth ROI Tables</li>
<li>Systems with Winning Numbers in [0, 1]</li>
</ul>The graph visualization landscape 2019tag:www.analyticbridge.datasciencecentral.com,2019-04-09:2004291:BlogPost:3920612019-04-09T10:00:00.000ZElise Devauxhttps://www.analyticbridge.datasciencecentral.com/profile/EliseDevaux
<p><span style="font-size: 18pt;"><strong>Graph are meant to be seen</strong></span></p>
<p><span><br></br> The third layer of graph technology that we discuss in this article is the front-end layer, the graph visualization one. The visualization of information has been the support of many types of analysis, including </span><a href="https://en.wikipedia.org/wiki/Social_network_analysis"><span>Social Network Analysis</span></a><span>. For decades, visual representations have helped researchers,…</span></p>
<p><span style="font-size: 18pt;"><strong>Graph are meant to be seen</strong></span></p>
<p><span><br/> The third layer of graph technology that we discuss in this article is the front-end layer, the graph visualization one. The visualization of information has been the support of many types of analysis, including </span><a href="https://en.wikipedia.org/wiki/Social_network_analysis"><span>Social Network Analysis</span></a><span>. For decades, visual representations have helped researchers, analysts and enterprises derive insights from their data.</span></p>
<p><span><strong><br/> Visualization tools represent an important bridge between graph data and analysts. It helps surface information and insights leading to the understanding of a situation, or the solving of a problem.</strong></span></p>
<p><span><br/> While it’s easy to read and comprehend non-graph data in a tabular format such as a spreadsheet, you will probably miss valuable information if you try to analyze connected data the same way. Representing connected data in tables is not intuitive and often hides the connections in which lies the value. Graph visualization tools turn connected data into graphical network representations that takes advantage of the human brain proficiency to recognize visual patterns and more pattern variations.</span></p>
<p><span><br/> In the field of graph theory and network science, researchers started to imagine graph analysis and visualization tools as early as 1996 with the </span><a href="http://mrvar.fdv.uni-lj.si/pajek/history.htm"><span>Pajek</span></a><span> project. Even though these applications have long been confined to the field of research, it was the birth of computer tools for graph visualization.</span></p>
<div class="wp-caption aligncenter"><a href="https://storage.ning.com/topology/rest/1.0/file/get/1826401034?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1826401034?profile=RESIZE_710x" class="align-center"/></a></div>
<div id="attachment_7174" class="wp-caption aligncenter"><p class="wp-caption-text" style="text-align: center;"><em>The pajek software, initiated in 1996</em></p>
</div>
<p><span style="font-size: 18pt;"><strong>Visualization speeds up data analysis</strong></span></p>
<p><span><br/> There is a reason researchers started to develop these tools. As we previously wrote, </span><a href="https://linkurio.us/blog/why-graph-visualization-matters/"><span>graph visualization is critical for the analysis of graph data</span></a><span>. When you apply visualization methods to data analysis, you are more likely to cut the time spent looking for information because:<br/> <br/></span></p>
<ul>
<li><span>You have a greater ability to recognize trends & patterns.</span></li>
<li><span>You can digest larger amounts of data more easily.</span></li>
<li><span>You will compare situations or scenarios more easily.</span></li>
<li><span>And in addition, it will be easier to share and explain your findings through a visual medium.<br/> <br/></span></li>
</ul>
<p><span>Combined with the capabilities brought by computer machines, these advantages opened new doors for analysts seeking information in large volumes of data. It is also the reason graph visualization solutions are complementary to the <a href="https://linkurio.us/blog/graphtech-ecosystem-2019-part-2-graph-analytics/">graph analytics</a> and <a href="https://linkurio.us/blog/graphtech-ecosystem-2019-part-1-graph-databases/">graph databases tools</a> we discussed in the previous articles. Once data is stored and calculations are done, end-users need an intelligible way to process and make sense of the data. And graph visualization tools are useful in many scenarios.<br/> <br/></span></p>
<p><span>You need to </span><a href="https://linkurio.us/blog/big-data-technology-fraud-investigations/"><span>identify shady financial schemes in terabytes of data</span></a><span>? Graph data visualization. You need to </span><a href="https://linkurio.us/blog/critical-threats-project-delivers-timely-intelligence-linkurious/"><span>understand the human dynamic between criminal networks</span></a><span>? Graph data visualization. You need to quickly </span><a href="https://linkurio.us/blog/bforbank-detects-fraud-with-linkurious/"><span>assess the fraudulence of flagged transactions</span></a><span>? You guessed it, graph visualization.<br/> <br/></span></p>
<p><span>Most of the tools we are about to present can be plugged directly to database and analytics systems to further the analysis of graph data.<br/> <br/></span></p>
<p><span style="font-size: 18pt;"><strong>Graph visualization libraries and toolkits</strong></span></p>
<p>Among the common tools available today to visualize graph data are libraries and toolkit. These libraries allow you to build custom visualization application adjusted to your needs: from a basic graph layout displaying data in your browser, to an advanced application embedding a full panel of graph data customization and analysis features. They do require the knowledge of programming languages, or imply that you have development resources available.</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1826436756?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1826436756?profile=RESIZE_710x" class="align-center"/></a></p>
<p class="wp-caption-text" style="text-align: center;"><em>The graph visualization libraries and toolkit ecosystem</em></p>
<p></p>
<p><span>The catalog is wide, with plenty of choices depending on your favorite language, license requirement, budget or project needs. In the open-source world, some libraries offer many possibilities for data visualization, including graph, or network, representations. It’s the case of </span><a href="https://d3js.org/"><span>D3.js</span></a><span> and </span><a href="http://visjs.org/"><span>Vis.js</span></a><span> for instance that let you choose among different data representation formats.<br/> <br/></span></p>
<p><span>Other libraries solely focus on graph representations of data, such as </span><a href="http://js.cytoscape.org/"><span>Cytoscape.js</span></a><span> or </span><a href="http://sigmajs.org/"><span>Sigma.js</span></a><span>. Usually, these libraries provide a more features than the generalist ones. There are libraries in Java such as </span><a href="http://graphstream-project.org/"><span>GraphStream</span></a><span> or </span><a href="http://jung.sourceforge.net/"><span>Jung</span></a><span>, or in Python, with packages like </span><a href="https://www.nodebox.net/code/index.php/Graph"><span>NodeBox Graph</span></a><span>.<br/> <br/></span></p>
<p><span>You will also find commercial graph visualization libraries such as </span><a href="https://www.yworks.com/"><span>yFiles from yWorks, </span></a><span> </span><a href="https://cambridge-intelligence.com/keylines/"><span>Keylines</span></a><span> from Cambridge Intelligence, </span><a href="https://www.tomsawyer.com/perspectives/"><span>Tom Sawyer Perspectives</span></a><span> from Tom Sawyer Software, or our own solution </span><a href="http://ogma.linkurio.us/"><span>Ogma</span></a><span>. The commercial libraries have the advantage of guaranteeing continuous technical support and advanced performances.<br/> <br/></span></p>
<p><span style="font-size: 18pt;"><strong>Graph visualization software and web applications</strong></span></p>
<p><span><span style="font-size: 14pt;">Research applications <br/></span> <br/></span> <span>There are other solutions which do not require any development. These solutions are either, Saas, or on-premise software and web applications. As we mentioned earlier, the first off-the-shelf solutions spawn from the work of network theory researchers. After Pajek, other solutions were released, such as </span><a href="http://www.netminer.com/product/overview.do"><span>NetMiner</span></a><span>in 2001, a commercial software for exploratory analysis and visualization of large networks data. In the same line, the </span><a href="https://gephi.org/"><span>Gephi software</span></a><span>, created in 2008, brought a powerful open source tool to many researchers in the field of </span><a href="https://en.wikipedia.org/wiki/Social_network_analysis"><span>Social Network Analysis</span></a><span>. Co-founded by Linkurious’ CEO, Sébastien Heymann, Gephi played a key role in democratizing graph visualization methods.<br/> <br/></span></p>
<p><span>Other research projects emerged, as web technologies simplified their creation. For instance, </span><a href="http://hdlab.stanford.edu/palladio/about/"><span>Palladio</span></a><span>, a graph visualization web application for history researchers was created in 2013. More recently in 2016, the </span><a href="https://osome.iuni.iu.edu/tools/networks/"><span>research project OSoMe</span></a><span> (the Observatory on Social Media) released an online graph visualization application to study the spread of information and misinformation on social media.<br/> <br/></span></p>
<p><span>However, graph visualization is no longer the preserve of the academic and research worlds. Others understood the potential of graph visualization and how such tools could help organizations and businesses in other fields: network management, financial crime investigation, cybersecurity, healthcare development, and more. Companies started to provide enterprise-ready graph visualization solution, as did </span><a href="https://linkurio.us/blog/official-launch/"><span>Linkurious back in 2013</span></a><span>.<br/> <br/></span></p>
<p><span><span style="font-size: 14pt;">Generic and field-specific solutions</span><br/> <br/> Today you can easily find software or web applications to visualize graph data of various natures. </span><a href="http://www.bakamap.com/"><span>Bakamap</span></a><span>is a web application to visualization your spreadsheet data as interactive graphs. The cloud-based application </span><a href="https://begraph.net/"><span>BeGraph</span></a><span> offers a 3D data network visualizer. Historical open-source software such as </span><a href="https://www.graphviz.org/"><span>GraphViz</span></a><span>and </span><a href="https://cytoscape.org/what_is_cytoscape.html"><span>Cytoscape</span></a><span> also let you visualize any type of data as interactive graphs.<br/> <br/></span></p>
<p><span>Some companies propose solutions dedicated to certain use-cases. In these cases, the graph visualization application is often enhanced with features specially designed to answer needs specific to the given field. For instance, the </span><a href="https://linkurio.us/product/"><span>Linkurious Enterprise graph visualization platform</span></a><span> is dedicated today to anti-fraud, anti-money laundering, intelligence analysis, and cybersecurity scenarios. So in addition to graph visualization, it proposes alerts and pattern detection capabilities to support the work of analysts in these fields. Another example of a field-specific tool is </span><a href="https://vis.occrp.org/"><span>VIS</span></a><span> (Visual Investigative Scenarios), a tool designed by the OCCRP for journalists investigating major business or criminal networks. </span><a href="https://seeyournetwork.com/"><span>Synapp</span></a><span>, on another hand, is an application dedicated to the visualization of human resources within organizations. As the adoption of graph technology spread, more and more areas are witnessing the development of specific graph visualization tools.<br/> <br/></span></p>
<p><span style="font-size: 18pt;"><strong>Built-in visualizers and other add-ons</strong></span></p>
<p><span>Finally, the last set of tools dedicated to the visualization of graph data are the built-in visualizers and graph database add-ons.<br/> <br/></span></p>
<div id="attachment_7176" class="wp-caption aligncenter"><a href="https://storage.ning.com/topology/rest/1.0/file/get/1826482689?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1826482689?profile=RESIZE_710x" class="align-center"/></a><br/> <br/><p class="wp-caption-text" style="text-align: center;"><em>Built-in graph visualizers and add-ons</em></p>
</div>
<p><span>While graph visualization software and web applications are great for in-depth analysis or advanced graph data investigations, there are situations where you simply need a basic, accessible visualizer to get a glimpse of what a given graph dataset looks like. That is why some graph databases ship with built-in graph data visualizers. These features are a great asset for developers and data engineers working with graph data. Without leaving the graph database environment, you can easily access a graphical user interface to query and visualize your data. This is what the </span><a href="https://neo4j.com/developer/guide-neo4j-browser/"><span>Neo4j browser</span></a><span> offers for instance, which can be of great help when creating datasets or running graph algorithms. Similarly, TigerGraph proposes a built-in graphical user interface: </span><a href="https://www.tigergraph.com/category/graphstudio/"><span>GraphStudio</span></a><span> to visualize your database content. </span><a href="https://bitnine.net/blog-agens-solution/blog-agensbrowser/announcing-agensbrowser-web-v-1-0-release/"><span>Last year, Bitnine released AgensBrowser,</span></a><span> a visualization interface to help you manage and monitor the content of your AgensGraph graph database.<br/> <br/></span></p>
<p><span>On a similar notice, graph database vendors have started to widen their offerings with add-on visualization tools compatible with their storage product. For example, at the beginning of last year, Neo4j launched </span><a href="https://neo4j.com/bloom/"><span>Bloom</span></a><span>, an add-on application for the Neo4j desktop application. It offers a code-free visualization interface to explore data from Neo4j graph databases.<br/> <br/></span></p>
<p><span>We listed and presented in the following presentation a majority of visualization tools for graph data. </span><strong><a href="https://resources.linkurio.us/graph-visualization-software" target="_self">You can request the complete list of software and web-applications on Linkurious' blog.</a> <br/> <a href="https://fr.slideshare.net/Linkurious/graphtech-ecosystem-part-2-graph-visualization-139711307" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1826520404?profile=RESIZE_710x" class="align-center"/></a></strong></p>
<p style="text-align: center;"><b>This post was initially published on <a href="https://linkurio.us/blog/graphtech-ecosystem-part-3-graph-visualization/?utm_source=analyticbridge&utm_medium=post&utm_content=part3" target="_blank" rel="noopener">Linkurious blog.</a></b></p>
<p style="text-align: center;"></p>
<p><span>Read the part 1: <a href="https://linkurio.us/blog/graphtech-ecosystem-2019-part-2-graph-analytics/?utm_source=analyticbridge&utm_medium=post&utm_content=part3" target="_blank" rel="noopener">The graph database landscape 2019</a></span></p>
<p><span>Read the part 2: <a href="https://linkurio.us/blog/graphtech-ecosystem-2019-part-1-graph-databases/?utm_source=analyticbridge&utm_medium=post&utm_content=part3" target="_blank" rel="noopener">The graph analytics landscape 2019</a></span></p>A Radical AI Strategy - Platformicationtag:www.analyticbridge.datasciencecentral.com,2019-04-09:2004291:BlogPost:3920582019-04-09T05:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><strong><em>Summary:</em></strong><em> A new business model strategy based around intermediary platforms powered by AI/ML is promising the most direct path to fastest growth, profitability, and competitive success. Adopting this new approach requires a deep change in mindset and is quite different from just adopting AI/ML to optimize your current operations.…</em></p>
<p></p>
<p><strong><em>Summary:</em></strong><em> A new business model strategy based around intermediary platforms powered by AI/ML is promising the most direct path to fastest growth, profitability, and competitive success. Adopting this new approach requires a deep change in mindset and is quite different from just adopting AI/ML to optimize your current operations.</em></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1741416922?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1741416922?profile=RESIZE_710x" width="350" class="align-full"/></a>As a data scientist you may be wondering why you need to be concerned about strategy and business models. It’s simple. Different types of AI/ML are most appropriate for different business objectives. So whether you’re a data scientist being asked to plan and present the most appropriate portfolio of projects, or a CXO looking to support your new digital business model, you need to understand the relationship between data science and strategy.</p>
<p>In<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/now-that-we-ve-got-ai-what-do-we-do-with-it"><em><u>our last articl</u></em>e</a><span> </span>we laid out the four major AI/ML powered business models. We set up a structure to help you think about “AI Inside”, essentially pasted on and used to optimize an existing old-style business model versus “AI-First”, business models that can lead to real digital transformation.</p>
<p>AI-First models are typically associated with startups so not necessarily the first place a mature existing business would look for a strategy in its digital journey. But hidden in plain sight within AI-First is a business model strategy so bold that mature companies that have embraced it have outpaced their competitors by a wide margin. That’s adopting a “Platform Strategy”.</p>
<p><em>Read the full article, by Bill Vorhies, <a href="https://www.datasciencecentral.com/profiles/blogs/a-radical-ai-strategy-platformication" target="_blank" rel="noopener">here</a>. For more articles by the same author, <a href="https://www.datasciencecentral.com/profiles/blog/list?user=0h5qapp2gbuf8" target="_blank" rel="noopener">follow this link</a>. For more about AI applications, <a href="https://www.datasciencecentral.com/page/search?q=ai" target="_blank" rel="noopener">click here</a>. </em></p>Long-range Correlations in Time Series: Modeling, Testing, Case Studytag:www.analyticbridge.datasciencecentral.com,2019-04-01:2004291:BlogPost:3922472019-04-01T19:00:06.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>We investigate a large class of auto-correlated, stationary time series, proposing a new statistical test to measure departure from the base model, known as Brownian motion. We also discuss a methodology to deconstruct these time series, in order to identify the root mechanism that generates the observations. The time series studied here can be discrete or continuous in time, they can have various degrees of smoothness (typically measured using the Hurst exponent) as well as long-range or…</p>
<p>We investigate a large class of auto-correlated, stationary time series, proposing a new statistical test to measure departure from the base model, known as Brownian motion. We also discuss a methodology to deconstruct these time series, in order to identify the root mechanism that generates the observations. The time series studied here can be discrete or continuous in time, they can have various degrees of smoothness (typically measured using the Hurst exponent) as well as long-range or short-range correlations between successive values. Applications are numerous, and we focus here on a case study arising from some interesting number theory problem. In particular, we show that one of the times series investigated in my article on randomness theory [<a href="https://www.datasciencecentral.com/profiles/blogs/fascinating-new-results-in-the-theory-of-randomness" target="_blank" rel="noopener">see here</a>, read section 4.1.(c)] is not Brownian despite the appearance. It has important implications regarding the problem in question. Applied to finance or economics, it makes the difference between an efficient market, and one that can be gamed.</p>
<p>This article it accessible to a large audience, thanks to its tutorial style, illustrations, and easily replicable simulations. Nevertheless, we discuss modern, advanced, and state-of-the-art concepts. This is an area of active research. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1741616599?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1741616599?profile=RESIZE_710x" class="align-center"/></a> <strong>Content</strong></p>
<p>1. Introduction and time series deconstruction</p>
<ul>
<li>Example</li>
<li>Deconstructing time series</li>
<li>Correlations, Fractional Brownian motions</li>
</ul>
<p>2. Smoothness, Hurst exponent, and Brownian test</p>
<ul>
<li>Our Brownian tests of hypothesis</li>
<li>Data</li>
</ul>
<p>3. Results and conclusions</p>
<ul>
<li>Charts and interpretation</li>
<li>Conclusions</li>
</ul>
<p><strong>Read the full article, <a href="https://www.datasciencecentral.com/profiles/blogs/long-range-correlation-in-time-series-tutorial-and-case-study" target="_blank" rel="noopener">here</a>. </strong></p>The importance of Alternative Data in Credit Risk Managementtag:www.analyticbridge.datasciencecentral.com,2019-03-26:2004291:BlogPost:3916422019-03-26T05:15:15.000ZNaagesh Padmanabanhttps://www.analyticbridge.datasciencecentral.com/profile/NaageshPadmanaban
<p><em>The emergence of alternative data as a key enabler in expanding credit delivery and financial inclusion is unmistakable.</em></p>
<p>The saying that the only thing that is constant is change, is attributed to Heraclitus, the Greek Philosopher. This is so very relevant today in the way lenders use technology and scoring solutions to understand the credit worthiness of applicants. Credit Risk Management has come a long way from the days when banks used just one credit score cut off to…</p>
<p><em>The emergence of alternative data as a key enabler in expanding credit delivery and financial inclusion is unmistakable.</em></p>
<p>The saying that the only thing that is constant is change, is attributed to Heraclitus, the Greek Philosopher. This is so very relevant today in the way lenders use technology and scoring solutions to understand the credit worthiness of applicants. Credit Risk Management has come a long way from the days when banks used just one credit score cut off to decision loan applications. Risk managers now have a plethora of solution options to enable them to craft the right risk reward balance when they design a credit policy that would suit them.</p>
<p>It is common knowledge that large volumes of data are being constantly generated and a good portion of this can be used to better understand a potential borrower. This profusion of data has only provided greater depth and reach to lenders.</p>
<p>The emergence of alternative data as a key enabler in expanding credit delivery and financial inclusion is unmistakable. It not only expands the scorable population, but also deepens the understanding of their payment behavior. The three credit bureaus, realizing the value of this data asset have embarked on an acquisition spree.</p>
<p>A basic definition of traditional data as well as alternative data will help understand the scenario better.</p>
<p><strong>Traditional Data</strong></p>
<p>Traditional data typically refers to data that credit bureaus maintain on their files. This includes data provided by the customer in the loan applications, data on credit lines, loan repayment history, credit enquiries as well as public information like bankruptcies. Traditional data is FCRA compliant and the acid test is that it must be verifiable and disputable by the customer.</p>
<p>Industry research has shown that scoring solutions that use traditional data cannot score a significant section of the population. According to the Consumer Financial Protection Bureau (CFPB), these ‘credit invisibles’ number over 45 million people. It further points out that although this segment of the population may not have a regular loan payment track record, they may still be paying their other bills regularly. It is thus very important to track this payment history – e.g. utility payments – to estimate their credit risk.</p>
<p><strong>Alternative Data</strong></p>
<p>Definitions of alternative data may vary, depending on where you choose to look them up. But in a broad sense it pertains to data that includes, but limited to rent payments, mobile phone payments, Cable TV payments as well as bank account information, such as deposits, withdrawals or transfers.</p>
<p>While alternative data has a very important role in financial inclusion, it also has other important benefits. In addition to improving the assessment of the risk of the customer, it can provide timely information to lenders on activities that may not be reflected on bureau data. Further it enables lenders to provide enhanced customer experience. For example, when they share online bank account, the loan application processing may be faster.</p>
<p>Like traditional data, alternative data to is susceptible to inaccuracies. Consumers may not be able to readily review and correct alternative data although the standards governing it are constantly changing and evolving to meet customer and regulatory expectations.</p>Fascinating Developments in the Theory of Randomnesstag:www.analyticbridge.datasciencecentral.com,2019-03-21:2004291:BlogPost:3916312019-03-21T13:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>I present here some innovative results from my most recent research on stochastic processes. chaos modeling, and dynamical systems, with applications to Fintech, cryptography, number theory, and random number generators. While covering advanced topics, this article is accessible to professionals with limited knowledge in statistical or mathematical theory. It introduces new material not covered in my recent book (available …</p>
<p>I present here some innovative results from my most recent research on stochastic processes. chaos modeling, and dynamical systems, with applications to Fintech, cryptography, number theory, and random number generators. While covering advanced topics, this article is accessible to professionals with limited knowledge in statistical or mathematical theory. It introduces new material not covered in my recent book (available <a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes" target="_blank" rel="noopener">here</a>) on applied stochastic processes. You don't need to read my book to understand this article, but the book is a nice complement and introduction to the concepts discussed here.</p>
<p>None of the material presented here is covered in standard textbooks on stochastic processes or dynamical systems. In particular, it has nothing to do with the classical logistic map or Brownian motions, though the systems investigated here exhibit very similar behaviors and are related to the classical models. This cross-disciplinary article is targeted to professionals with interests in statistics, probability, mathematics, machine learning, simulations, signal processing, operations research, computer science, pattern recognition, and physics. Because of its tutorial style, it should also appeal to beginners learning about Markov processes, time series, and data science techniques in general, offering fresh, off-the-beaten-path content not found anywhere else, contrasting with the material covered again and again in countless, identical books, websites, and classes catering to students and researchers alike. </p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1529825331?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1529825331?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Some problems discussed here could be used by college professors in the classroom, or as original exam questions, while others are extremely challenging questions that could be the subject of a PhD thesis or even well beyond that level. This article constitutes (along with my book) a stepping stone in my endeavor to solve one of the biggest mysteries in the universe: are the digits of mathematical constants such as Pi, evenly distributed? To this day, no one knows if these digits even have a distribution to start with, let alone whether that distribution is uniform or not. Part of the discussion is about statistical properties of numeration systems in a non-integer base (such as the golden ratio base) and its applications. All systems investigated here, whether deterministic or not, are treated as stochastic processes, including the digits in question. They all exhibit strong chaos, albeit easily manageable due to their ergodicity.<span> </span></p>
<p>Interesting connections to the golden ratio, Fibonacci numbers, Pisano periods, special polynomials, Brownian motions, and other special mathematical constants, are discussed throughout the article. All the analyses were done in Excel. You can download my spreadsheets from this article; all the results are replicable. Also, numerous illustrations are provided. </p>
<p></p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/fascinating-new-results-in-the-theory-of-randomness" target="_blank" rel="noopener">Read the full article here</a>.</p>
<p><strong>Content of this article</strong></p>
<p>1. General framework, notations and terminology</p>
<ul>
<li>Finding the equilibrium distribution</li>
<li>Auto-correlation and spectral analysis</li>
<li>Ergodicity, convergence, and attractors</li>
<li>Space state, time state, and Markov chain approximations</li>
<li>Examples</li>
</ul>
<p>2. Case study</p>
<ul>
<li>First fundamental theorem</li>
<li>Second fundamental theorem</li>
<li>Convergence to equilibrium: illustration</li>
</ul>
<p>3. Applications</p>
<ul>
<li>Potential application domains</li>
<li>Example: the golden ratio process</li>
<li>Finding other useful b-processes</li>
</ul>
<p>4. Additional research topics</p>
<ul>
<li>Perfect stochastic processes</li>
<li>Characterization of equilibrium distributions (the attractors)</li>
<li>Probabilistic calculus and number theory, special integrals</li>
</ul>
<p>5. Appendix</p>
<ul>
<li>Computing the auto-correlation at equilibrium</li>
<li>Proof of the first fundamental theorem</li>
<li>How to find the exact equilibrium distribution</li>
</ul>
<p>6. Additional Resources</p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/fascinating-new-results-in-the-theory-of-randomness" target="_blank" rel="noopener">Read the full article here</a>.</p>How to Automatically Determine the Number of Clusters in your Data - and moretag:www.analyticbridge.datasciencecentral.com,2019-03-14:2004291:BlogPost:3912662019-03-14T00:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Determining the number of clusters when performing unsupervised clustering is a tricky problem. Many data sets don't exhibit well separated clusters, and two human beings asked to visually tell the number of clusters by looking at a chart, are likely to provide two different answers. Sometimes clusters overlap with each other, and large clusters contain sub-clusters, making a decision not easy.</p>
<p>For instance, how many clusters do you see in the picture below? What is the optimum number…</p>
<p>Determining the number of clusters when performing unsupervised clustering is a tricky problem. Many data sets don't exhibit well separated clusters, and two human beings asked to visually tell the number of clusters by looking at a chart, are likely to provide two different answers. Sometimes clusters overlap with each other, and large clusters contain sub-clusters, making a decision not easy.</p>
<p>For instance, how many clusters do you see in the picture below? What is the optimum number of clusters? No one can tell with certainty, not AI, not a human being, not an algorithm. </p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1405294997?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1405294997?profile=RESIZE_710x" class="align-center"/></a></p>
<p style="text-align: center;"><em>How many clusters here? (source: see<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/data-science-wizardry" target="_blank" rel="noopener">here</a>)</em></p>
<p>In the above picture, the underlying data suggests that there are three main clusters. But an answer such as 6 or 7, seems equally valid. </p>
<p>A number of empirical approaches have been used to determine the number of clusters in a data set. They usually fit into two categories:</p>
<ul>
<li>Model fitting techniques: an example is using a<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/decomposition-of-statistical-distributions-using-mixture-models-a" target="_blank" rel="noopener">mixture model</a> to fit with your data, and determine the optimum number of components; or use density estimation techniques, and test for the number of modes (see<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-original-underused-statistical-tests" target="_blank" rel="noopener">here</a>.) Sometimes, the fit is compared with that of a model where observations are uniformly distributed on the entire support domain, thus with no cluster; you may have to estimate the support domain in question, and assume that it is not made of disjoint sub-domains; in many cases, the convex hull of your data set, as an estimate of the support domain, is good enough. </li>
<li>Visual techniques: for instance, the silhouette or elbow rule (very popular.)</li>
</ul>
<p>In both cases, you need a criterion to determine the optimum number of clusters. In the case of the elbow rule, one typically uses the percentage of unexplained variance. This number is 100% with zero cluster, and it decreases (initially sharply, then more modestly) as you increase the number of clusters in your model. When each point constitutes a cluster, this number drops to 0. Somewhere in between, the curve that displays your criterion, exhibits an elbow (see picture below), and that elbow determines the number of clusters. For instance, in the chart below, the optimum number of clusters is 4.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1405610723?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1405610723?profile=RESIZE_710x" class="align-center"/></a></p>
<p style="text-align: center;"><em>The elbow rule tells you that here, your data set has 4 clusters (elbow strength in red)</em></p>
<p>Good references on the topic are available. Some R functions are available too, for instance fviz_nbclust. However, I could not find in the literature, how the elbow point is explicitly computed. Most references mention that it is mostly hand-picked by visual inspection, or based on some predetermined but arbitrary threshold. In the next section, we solve this problem.</p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/how-to-automatically-determine-the-number-of-clusters-in-your-dat" target="_blank" rel="noopener">Read full article here</a>. </p>Deep Analytical Thinking and Data Science Wizardrytag:www.analyticbridge.datasciencecentral.com,2019-03-07:2004291:BlogPost:3913552019-03-07T20:46:51.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Many times, complex models are not enough (or too heavy), or not necessary, to get great, robust, sustainable insights out of data. Deep analytical thinking may prove more useful, and can be done by people not necessarily trained in data science, even by people with limited coding experience. Here we explore what we mean by deep analytical thinking, using a case study, and how it works: combining craftsmanship, business acumen, the use and creation of tricks and rules of thumb, to provide…</p>
<p>Many times, complex models are not enough (or too heavy), or not necessary, to get great, robust, sustainable insights out of data. Deep analytical thinking may prove more useful, and can be done by people not necessarily trained in data science, even by people with limited coding experience. Here we explore what we mean by deep analytical thinking, using a case study, and how it works: combining craftsmanship, business acumen, the use and creation of tricks and rules of thumb, to provide sound answers to business problems. These skills are usually acquired by experience more than by training, and data science generalists (see<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/why-you-should-be-a-data-science-generalist" target="_blank" rel="noopener">here</a><span> </span>how to become one) usually possess them.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1308372685?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1308372685?profile=RESIZE_710x" class="align-center"/></a></p>
<p>This article is targeted to data science managers and decision makers, as well as to junior professionals who want to become one at some point in their career. Deep thinking, unlike deep learning, is also more difficult to automate, so it provides better job security. Those automating deep learning are actually the new data science wizards, who can think out-of-the box. Much of what is described in this article is also data science wizardry, and not taught in standard textbooks nor in the classroom. By reading this tutorial, you will learn and be able to use these data science secrets, and possibly change your perspective on data science. Data science is like an iceberg: everyone knows and can see the tip of the iceberg (regression models, neural nets, cross-validation, clustering, Python, and so on, as presented in textbooks.) Here I focus on the unseen bottom, using a statistical level almost accessible to the layman, avoiding jargon and complicated math formulas, yet discussing a few advanced concepts. <span> </span></p>
<p><span><a href="https://www.datasciencecentral.com/profiles/blogs/data-science-wizardry" target="_blank" rel="noopener">Read full article here</a>. </span></p>
<p><strong>Content</strong></p>
<p>1. Case Study: The Problem</p>
<p>2. Deep Analytical Thinking</p>
<ul>
<li>Answering hidden questions</li>
<li>Business questions</li>
<li>Data questions</li>
<li>Metrics questions</li>
</ul>
<p>3. Data Science Wizardry</p>
<ul>
<li>Generic algorithm</li>
<li>Illustration with three different models</li>
<li>Results</li>
</ul>
<p>4. A few data science hacks</p>The graph analytics landscape 2019tag:www.analyticbridge.datasciencecentral.com,2019-02-27:2004291:BlogPost:3912352019-02-27T12:00:00.000ZElise Devauxhttps://www.analyticbridge.datasciencecentral.com/profile/EliseDevaux
<h1 style="text-align: left;"><span style="font-size: 12pt;">Read the part 1 - <a href="https://linkurio.us/blog/graphtech-ecosystem-2019-part-1-graph-databases/?utm_source=analyticsbridge&utm_medium=article&utm_content=07" rel="noopener" target="_blank">The graph database landscape</a></span></h1>
<h1 style="text-align: center;"><strong>The graph analytics landscape 2019</strong></h1>
<p><span>Graph analytics frameworks consist of a set of tools and methods developed to extract…</span></p>
<h1 style="text-align: left;"><span style="font-size: 12pt;">Read the part 1 - <a href="https://linkurio.us/blog/graphtech-ecosystem-2019-part-1-graph-databases/?utm_source=analyticsbridge&utm_medium=article&utm_content=07" target="_blank" rel="noopener">The graph database landscape</a></span></h1>
<h1 style="text-align: center;"><strong>The graph analytics landscape 2019</strong></h1>
<p><span>Graph analytics frameworks consist of a set of tools and methods developed to extract knowledge from data modeled as a graph. They are crucial for many applications because processing large datasets of complex connected data is computationally challenging. </span></p>
<p></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/1213658813?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1213658813?profile=RESIZE_710x" class="align-full"/><br/></a></span></p>
<h2><span style="font-size: 18pt;"><strong>A need for analytics at scale</strong></span></h2>
<p><span>The field of graph theory has spawned multiple algorithms on which analysts can rely on to find insights hidden in graph data. From Google’s famous </span><a href="https://en.wikipedia.org/wiki/PageRank"><span>PageRank algorithm</span></a><span> to traversal and path-finding algorithms or community detection algorithms, there are plenty of calculations available to get insights from graphs.</span></p>
<p><span>The graph database storage systems we mentioned in </span><a href="https://linkurio.us/blog/graphtech-ecosystem-2019-part-1-graph-databases/?utm_source=analyticsbridge&utm_medium=article&utm_content=07" target="_self">the previous article</a><span> are good at storing data as graphs, or at managing operations such as data retrieval, writing real-time queries or at local analysis. But they might fall short on graph analytics processing at scale. That’s where graph analytics frameworks step in. Shipping with common graph algorithms, processing engines and, sometimes, query languages, they handle online analytical processing and persist the results back into databases.<br/> <br/></span></p>
<h2><span style="font-size: 18pt;"><strong>Graph processing engines</strong></span></h2>
<p><span>The graph processing ecosystem offers various approaches to answer the challenges of graph analytics, and historical players occupy a large part of the market.</span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/1213658813?profile=original" target="_blank" rel="noopener"></a></span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/1213663551?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1213663551?profile=RESIZE_710x" class="align-full"/></a></span></p>
<p><span>In 2010, Google led the way with the </span><a href="https://dl.acm.org/citation.cfm?id=1807184"><span>release of Pregel</span></a><span>, a “large-scale graph processing” framework. Several solutions followed, such as </span><a href="https://giraph.apache.org/"><span>Apache Giraph</span></a><span>, an open source graph processing system developed in 2012 by the Apache foundation. It leverages MapReduce implementation to process graphs and is the system used by Facebook to traverse its social graph. Other open source systems iterated on Google’s, for example, </span><a href="https://thegraphsblog.wordpress.com/the-graph-blog/mizan/"><span>Mizan </span></a><span>or </span><a href="http://infolab.stanford.edu/gps/"><span>GPS</span></a><span>.</span></p>
<p><span>Other systems, like </span><a href="https://github.com/GraphChi"><span>GraphChi</span></a><span> or </span><a href="http://www.powergraph.ru/en/soft/demo.asp"><span>PowerGraph Create</span></a><span>, were launched following GraphLab’s release in 2009. This system started as an open-source project at Carnegie Mellon University and is now known as </span><a href="https://turi.com/"><span>Turi</span></a><span>. </span></p>
<p><span>Oracle Lab developed </span><a href="https://www.oracle.com/technetwork/oracle-labs/parallel-graph-analytix/overview/index.html"><span>PGX</span></a><span> (Parallel Graph AnalytiX), a graph analysis framework including an analytics processing engine powering Oracle Big Data Spatial and Graph.</span></p>
<p><span>The distributed open source graph engine Trinity, presented in 2013 by Microsoft, is now known as </span><a href="https://www.graphengine.io/"><span>Microsoft Graph Engine</span></a><span>. </span><a href="https://spark.apache.org/graphx/"><span>GraphX</span></a><span>, introduced in 2014, is the embedded graph processing framework built on top of </span><a href="https://spark.apache.org/"><span>Apache Spark</span></a><span> for parallel computed. Some other systems have since been introduced, for example, </span><a href="https://github.com/uzh/signal-collect"><span>Signal/Collect</span></a><span>.<br/> <br/></span></p>
<h2><span style="font-size: 18pt;"><strong>Graph analytics libraries and toolkit</strong></span></h2>
<p><span>In the graph analytics landscape, there are also single-users systems dedicated to graph analytics. Graph analytics libraries and toolkit provide implementations of numbers of algorithms from graph theory.<br/> <br/></span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/1213665836?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1213665836?profile=RESIZE_710x" class="align-full"/></a></span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/1213663551?profile=original" target="_blank" rel="noopener"></a></span></p>
<p></p>
<p><span>There are standalone libraries such as </span><a href="https://networkx.github.io/"><span>NetworkX</span></a><span> and </span><a href="https://networkit.github.io/"><span>NetworKit</span></a><span>, python libraries for large-scale graph analysis, or </span><a href="https://igraph.org/redirect.html"><span>iGraph</span></a><span>, a graph library written in C and available as Python and R packages, and library provided by graph database vendors such as Neo4j with its </span><a href="https://neo4j.com/graph-machine-learning-algorithms/"><span>Graph Algorithms Library</span></a><span>.</span></p>
<p><span>Other technology vendors offer graph analytics libraries for high performance graph analytics. It is the case of the GPU technology provider NVIDIA with its </span><a href="https://developer.nvidia.com/nvgraph"><span>NVGraph library</span></a><span>. The geographic information software QGIS also built its own </span><a href="https://docs.qgis.org/testing/en/docs/pyqgis_developer_cookbook/network_analysis.html#graph-analysis"><span>library for network analysis</span></a><span>.</span></p>
<p><span>Some of these libraries also propose graph visualization tools to help users build graph data exploration interfaces, but this is a topic for the third post of this series.<br/> <br/></span></p>
<h2><span style="font-size: 18pt;"><strong>Graph query languages</strong></span></h2>
<p><span>Finally, one important piece of analytics frameworks that was not mentioned yet: graph query languages.</span></p>
<p><span>As for any storage system, query languages are an essential element for graph databases. These languages make it possible to model the data as a graph, and their logic is very close to the graph data model. In addition to the data modeling process, graph query languages are used to query data. Depending on their nature they can be used against databases systems and as domain-specific analytics language. Most of the high-level computing engines allow users to write using these query languages.</span></p>
<p></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/1213668117?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1213668117?profile=RESIZE_710x" class="align-full"/></a></span></p>
<p><a href="https://neo4j.com/developer/cypher-query-language/"><span>Cypher</span></a><span> was created in 2011 by Neo4j to use on their own database. It has been </span><a href="https://neo4j.com/blog/open-cypher-sql-for-graphs/"><span>open-sourced in 2015</span></a><span> as a separate project named </span><a href="https://www.opencypher.org/"><span>OpenCypher</span></a><span>. Other notable graph query languages are: </span><a href="https://tinkerpop.apache.org/gremlin.html"><span>Gremlin</span></a><span> the graph traversal language of Apache TinkerPop query language created in 2009 or </span><a href="https://jena.apache.org/tutorials/sparql.html"><span>SPARQL</span></a><span>, the SQL-like language created by the W3C to query RDF graphs in 2008. More recently, TigerGraph developed its own graph query language name </span><a href="https://www.tigergraph.com/2018/05/22/crossing-the-chasm-eight-prerequisites-for-a-graph-query-language/"><span>GSQL</span></a><span> and Oracle created </span><a href="http://pgql-lang.org/"><span>PGQL</span></a><span>, both SQL-like graph query languages. </span><a href="https://arxiv.org/abs/1712.01550"><span>G-Core</span></a><span> was proposed by the Linked Data Benchmark Council (LDBC) in 2018 as a language bridging the academic and industrial worlds. Other vendors such as OrientDB went for the </span><a href="https://orientdb.com/docs/2.0/orientdb.wiki/Tutorial-SQL.html"><span>relational query language SQL</span></a><span>.</span></p>
<p><span>Last year, Neo4j launched an initiative to unify Cypher, PGQL and G-Core under a single standard graph query language: </span><a href="https://gql.today/"><span>GQL (Graph Query Language)</span></a><span>. The initiative will be discussed during a </span><a href="https://www.w3.org/Data/events/data-ws-2019/"><span>W3C workshop in March 2019</span></a><span>. Some other query languages are especially dedicated to graph analysis such as </span><a href="https://github.com/socialite-lang/socialite"><span>SociaLite</span></a><span>.</span></p>
<p><span>While not originally a graph query language, Facebook’s </span><a href="https://graphql.org/"><span>GraphQL</span></a><span> is worth mentioning. This API language has been extended by graph database vendors to use as a graph query language. </span><a href="https://docs.dgraph.io/master/query-language/"><span>Dgraph uses it natively</span></a><span>as its query language, Prisma is planning to </span><a href="https://www.prisma.io/features/databases"><span>extend it to various graph databases</span></a><span> and Neo4j has been pushing it into </span><a href="https://grandstack.io/"><span>GRANDstack</span></a><span> and its query execution layer </span><a href="https://github.com/neo4j-graphql/neo4j-graphql-js"><span>neo4j-graphql.js</span></a><span>.<br/> <br/></span></p>
<p>This article was originally posted on <a href="https://linkurio.us/blog/graphtech-ecosystem-2019-part-2-graph-analytics/?utm_source=analyticsbridge&utm_medium=article&utm_content=07" target="_blank" rel="noopener">Linkurious blog</a>. It is part of a series of articles about the GraphTech ecosystem. This is the second part. It covers the graph analytics landscape. The first part introduced the <a href="https://linkurio.us/blog/graphtech-ecosystem-2019-part-1-graph-databases/?utm_source=analyticsbridge&utm_medium=article&utm_content=07" target="_blank" rel="noopener">graph database vendors</a>.</p>New Perspectives on Statistical Distributions and Deep Learningtag:www.analyticbridge.datasciencecentral.com,2019-02-23:2004291:BlogPost:3909252019-02-23T18:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>In this data science article, emphasis is placed on<span> </span><em>science</em>, not just on data. State-of-the art material is presented in simple English, from multiple perspectives: applications, theoretical research asking more questions than it answers, scientific computing, machine learning, and algorithms. I attempt here to lay the foundations of a new statistical technology, hoping that it will plant the seeds for further research on a topic with a broad range of potential…</p>
<p>In this data science article, emphasis is placed on<span> </span><em>science</em>, not just on data. State-of-the art material is presented in simple English, from multiple perspectives: applications, theoretical research asking more questions than it answers, scientific computing, machine learning, and algorithms. I attempt here to lay the foundations of a new statistical technology, hoping that it will plant the seeds for further research on a topic with a broad range of potential applications. It is based on mixture models. Mixtures have been studied and used in applications for a long time, and it is still a subject of active research. Yet you will find here plenty of new material.</p>
<p><span><strong>Introduction and Context</strong></span></p>
<p>In a previous article (see<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/new-perspective-on-central-limit-theorem-and-related-stats-topics" target="_blank" rel="noopener">here</a>) I attempted to approximate a random variable representing real data, by a weighted sum of simple<span> </span><em>kernels</em><span> </span>such as uniformly and independently, identically distributed random variables. The purpose was to build Taylor-like series approximations to more complex models (each term in the series being a random variable), to</p>
<ul>
<li>avoid over-fitting,</li>
<li>approximate any empirical distribution (the inverse of the percentiles function) attached to real data,</li>
<li>easily compute data-driven confidence intervals regardless of the underlying distribution,</li>
<li>derive simple tests of hypothesis,</li>
<li>perform model reduction, </li>
<li>optimize data binning to facilitate feature selection, and to improve visualizations of histograms</li>
<li>create perfect histograms,</li>
<li>build simple density estimators,</li>
<li>perform interpolations, extrapolations, or predictive analytics,</li>
<li>perform clustering and detect the number of clusters,</li>
<li>create deep learning Bayesian systems.</li>
</ul>
<p>Why I've found very interesting properties about stable distributions during this research project, I could not come up with a solution to solve all these problems. The fact is that these weighed sums would usually converge (in distribution) to a normal distribution if the weights did not decay too fast -- a consequence of the central limit theorem. And even if using uniform kernels (as opposed to Gaussian ones) with fast-decaying weights, it would converge to an almost symmetrical, Gaussian-like distribution. In short, very few real-life data sets could be approximated by this type of model.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1187940877?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1187940877?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Now, in this article, I offer a full solution, using mixtures rather than sums. The possibilities are endless. </p>
<p><span style="font-size: 14pt;"><strong>Content of this article</strong></span></p>
<p><strong>1. Introduction and Context</strong></p>
<p><strong>2. Approximations Using Mixture Models</strong></p>
<ul>
<li>The error term</li>
<li>Kernels and model parameters</li>
<li>Algorithms to find the optimum parameters</li>
<li>Convergence and uniqueness of solution</li>
<li>Find near-optimum with fast, black-box step-wise algorithm</li>
</ul>
<p><strong>3. Example</strong></p>
<ul>
<li>Data and source code</li>
<li>Results</li>
</ul>
<p><strong>4. Applications</strong></p>
<ul>
<li>Optimal binning</li>
<li>Predictive analytics</li>
<li>Test of hypothesis and confidence intervals</li>
<li>Deep learning: Bayesian decision trees</li>
<li>Clustering</li>
</ul>
<p><strong>5. Interesting problems</strong></p>
<ul>
<li>Gaussian mixtures uniquely characterize a broad class of distributions</li>
<li>Weighted sums fail to achieve what mixture models do</li>
<li>Stable mixtures</li>
<li>Nested mixtures and Hierarchical Bayesian Systems</li>
<li>Correlations</li>
</ul>
<p>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/decomposition-of-statistical-distributions-using-mixture-models-a" target="_blank" rel="noopener">here</a>. </p>A Plethora of Original, Not Well-Known Statistical Teststag:www.analyticbridge.datasciencecentral.com,2019-02-14:2004291:BlogPost:3909012019-02-14T02:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Many of the following statistical tests are rarely discussed in textbooks or in college classes, much less in data camps. Yet they help answer a lot of different and interesting questions. I used most of them without even computing the underlying distribution under the null hypothesis, but instead, using simulations to check whether my assumptions were plausible or not. In short, my approach to statistical testing is is model-free, data-driven. Some are easy to implement even in Excel. Some…</p>
<p>Many of the following statistical tests are rarely discussed in textbooks or in college classes, much less in data camps. Yet they help answer a lot of different and interesting questions. I used most of them without even computing the underlying distribution under the null hypothesis, but instead, using simulations to check whether my assumptions were plausible or not. In short, my approach to statistical testing is is model-free, data-driven. Some are easy to implement even in Excel. Some of them are illustrated here, with examples that do not require statistical knowledge for understanding or implementation.</p>
<p>This material should appeal to managers, executives, industrial engineers, software engineers, operations research professionals, economists, and to anyone dealing with data, such as biometricians, analytical chemists, astronomers, epidemiologists, journalists, or physicians. Statisticians with a different perspective are invited to discuss my methodology and the tests described here, in the comment section at the bottom of this article. In my case, I used these tests mostly in the context of experimental mathematics, which is a branch of data science that few people talk about. In that context, the theoretical answer to a statistical test is sometimes known, making it a great benchmarking tool to assess the power of these tests, and determine the minimum sample size to make them valid.</p>
<p>I provide here a general overview, as well as my simple approach to statistical testing, accessible to professionals with little or no formal statistical training. Detailed applications of these tests are found<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes" target="_blank" rel="noopener">in my recent book</a> and in<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/new-perspective-on-central-limit-theorem-and-related-stats-topics" target="_blank" rel="noopener">this article</a>. Precise references to these documents are provided as needed, in this article.</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1048765124?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1048765124?profile=RESIZE_710x" class="align-center"/></a></p>
<p style="text-align: center;"><span><em>Examples of traditional tests</em></span></p>
<p><span><strong>1. General Methodology</strong></span></p>
<p>Despite my strong background in statistical science, over the years, I moved away from relying too much on traditional statistical tests and statistical inference. I am not the only one: these tests have been abused and misused, see for instance<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/statistical-significance-and-p-values-take-another-blow" target="_blank" rel="noopener">this article</a><span> </span>on<span> </span><em>p</em>-hacking. Instead, I favored a methodology of my own, mostly empirical, based on simulations, data- rather than model-driven. It is essentially a non-parametric approach. It has the advantage of being far easier to use, implement, understand, and interpret, especially to the non-initiated. It was initially designed to be integrated in black-box, automated decision systems. Here I share some of these tests, and many can be implemented easily in Excel. </p>
<p><em><a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-original-underused-statistical-tests" target="_blank" rel="noopener">Read the full article here</a>. </em></p>Machine Learning Glossarytag:www.analyticbridge.datasciencecentral.com,2019-02-12:2004291:BlogPost:3909852019-02-12T19:31:40.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><span>For background to this post, please see </span><a href="https://www.datasciencecentral.com/profiles/blogs/learn-machinelearning-coding-basics-in-a-weekend-a-new-approach" rel="nofollow noopener" target="_blank">Learn Machine Learning Coding Basics in a weekend</a><span>. Here,we present the glossary that we use for the coding and the mindmap attached to these classes and upcoming book. About 80 terms are included in the glossary, covering Ensembles, Regression, Classification,…</span></p>
<p><span>For background to this post, please see </span><a rel="nofollow noopener" href="https://www.datasciencecentral.com/profiles/blogs/learn-machinelearning-coding-basics-in-a-weekend-a-new-approach" target="_blank">Learn Machine Learning Coding Basics in a weekend</a><span>. Here,we present the glossary that we use for the coding and the mindmap attached to these classes and upcoming book. About 80 terms are included in the glossary, covering Ensembles, Regression, Classification, Algorithms, Training, Validation, Model Evaluation and more. For instance, the section about Classification contains the following entries:</span></p>
<ul>
<li>Class </li>
<li>Hyperplane </li>
<li>Decision Boundary </li>
<li>False Negative (FN) </li>
<li>False Positive (FP) </li>
<li>True Negative (TN) </li>
<li>True Positive (TP) </li>
<li>Precision </li>
<li>Recall </li>
<li>F1 Score </li>
<li>Few-Shot Learning </li>
<li>Hinge Loss </li>
<li>Log Loss </li>
</ul>
<p>To download the glossary, <a href="https://www.datasciencecentral.com/profiles/blogs/learn-machinelearning-coding-basics-in-a-weekend-glossary-and" target="_blank" rel="noopener">follow this link</a>. </p>
<p><span style="font-size: 12pt;"><strong>DSC Resources</strong></span></p>
<ul>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/new-books-and-resources-for-dsc-members">Free Books</a><br/><a href="https://www.datasciencecentral.com/forum"></a></li>
<li><a href="https://www.datasciencecentral.com/forum">Forum Discussions</a><br/><a href="https://www.datasciencecentral.com/page/search?q=cheat+sheets"></a></li>
<li><a href="https://www.datasciencecentral.com/page/search?q=cheat+sheets">Cheat Sheets</a><br/><a href="https://analytictalent.com/"></a></li>
<li><a href="https://www.analytictalent.datasciencecentral.com/">Jobs</a></li>
<li><a href="https://www.datasciencecentral.com/page/search?q=one+picture" target="_blank" rel="noopener">Search DSC</a></li>
<li><a href="https://twitter.com/DataScienceCtrl" target="_self">DSC on Twitter</a></li>
<li><a href="https://www.facebook.com/DataScienceCentralCommunity/" target="_blank" rel="noopener">DSC on Facebook</a></li>
</ul>