Product of two large primes are at the core of many encryption algorithms, as factoring the product is very hard for numbers with a few hundred digits. The two prime factors are associated with the encryption keys (public and private keys). Here we describe a new approach to factoring a big number that is the product of two primes of roughly the same size. It is designed especially to handle this problem and identify flaws in encryption algorithms. Riemann zeta function in the complex planeWhile at first glance it appears to substantially reduce the computational complexity of traditional factoring, at this stage there is still a lot of progress needed to make the new algorithm efficient. An interesting feature is that the success depends on the probability of two numbers to be co-prime, given the fact that they don't share the first few primes (say 2, 3, 5, 7, 11, 13) as common divisors. This probability can be computed explicitly and is about 99%. The methodology relies heavily on solving systems of congruences, the Chinese Remainder Theorem, and the modular multiplicative inverse of some carefully chosen integers. We also discuss computational complexity issues. Finally, the off-the-beaten-path material presented here leads to many original exercises or exam questions for students learning probability, computer science, or number theory: proving the various simple statements made in my article. ContentSome Number Theory Explained in Simple EnglishCo-primes and pairwise co-primesProbability of being co-primeModular multiplicative inverseChinese remainder theorem, version AChinese remainder theorem, version BThe New Factoring AlgorithmImproving computational complexityFive-step algorithmProbabilistic optimizationCompact Formulation of the ProblemRead the full article here. Other Math Articles by Same AuthorHere is a selection of articles pertaining to experimental math and probabilistic number theory:Statistics: New Foundations, Toolbox, and Machine Learning RecipesApplied Stochastic ProcessesVariance, Attractors and Behavior of Chaotic Statistical SystemsNew Family of Generalized Gaussian DistributionsA Beautiful Result in Probability TheoryTwo New Deep Conjectures in Probabilistic Number TheoryExtreme Events Modeling Using Continued FractionsA Strange Family of Statistical DistributionsSome Fun with Gentle Chaos, the Golden Ratio, and Stochastic Number...Fascinating New Results in the Theory of RandomnessTwo Beautiful Mathematical Results - Part 2Two Beautiful Mathematical ResultsNumber Theory: Nice Generalization of the Waring ConjectureFascinating Chaotic Sequences with Cool ApplicationsSimple Proof of the Prime Number TheoremFactoring Massive Numbers: Machine Learning ApproachSee More

This post discusses what actually makes data anonymous, share about the misconception we have of it and describe the problems it raises.In the beginning, there was dataThe intent of anonymization is to ensure the privacy of data. Companies use it to protect sensitive data. This category of data encompasses:personal data,business information such as financial information or trade secrets,classified information such as military secrets or governmental information.So, anonymization is for instance a way of complying with the privacy regulations related to personal data. Personal data and business data types can overlap. This is where lies customer information. But not all business data falls under regulations. I’ll focus here on the protection of personal data.Example of sensitive data typesIn Europe, regulators define as “personal data” any information that relates to someone (your name for example). Information linking to a person in any way also falls under that definition.As personal data collection democratized over the previous century, the question of data anonymization started to rise. The regulations coming into effect around the world sealed the importance of the matter.What is data anonymization and why should we care?Let’s begin with the classic definition. The EU’s General Data Protection Regulation (GDPR) defines anonymized information as follow:“information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.”The “identifiable” and “no longer” parts are essential. It doesn’t only mean that your name shouldn’t appear in the data anymore. It also means that we shouldn’t be able to figure out who you are from the rest of the data. This refers to the process of re-identification (sometimes de-anonymization).The same GDPR recital also states a very important fact:“[…] data protection should therefore not apply to anonymous information”.So, if you manage to anonymize your data, you are no longer subject to GDPR data protection laws. You could perform any processing operations such as analysis or sales. This opens quite some opportunities:Selling data is an obvious first use. Around the world, privacy regulations are restricting the trade of personal data. Anonymized data offers an alternative for companies.It represents an opportunity for collaborative work. Many companies share data for innovation or research purposes. They can limit risks by using anonymized data.It also creates opportunities for data analysis and Machine Learning. Getting access to private, yet compliant, data is hard. Anonymized data represents a safe raw material for statistical analysis and model training.The opportunities are clear. But truly anonymized data is often not what we think.The spectrum of data privacy mechanismsPrivacy-preservation of data is a spectrum. Over the years, experts developed a collection of methods, mechanisms, and tools. These techniques produce data with various levels of privacy and various risk levels of re-identification. We could say it ranges from personally identifiable data to truly anonymized data.A spectrum of data privacyOn one end, you have data that contains direct personal identifiers. Those are elements from which we can identify you like name, address or telephone number. On the other end, you have the anonymous data that GDPR refers to.But there is an intermediary category of data. It lives between identifiable and anonymized data: pseudonymized and de-identified data. Note that I’m not certain of this delimitation. Some presentations make pseudonymization a part of de-identification, some don’t.In itself, there is nothing wrong with the techniques to produce this “intermediary data”. They are efficient data minimization techniques. Depending on the requirements of one’s use-cases, they will be relevant and useful.What we need to keep in mind is the fact they don’t produce truly anonymous data. Their mechanisms do not have the guarantee to prevent re-identification. And referring to the data they produce as “anonymous data”, is misleading.There is anonymous and “anonymous”Pseudonymization and de-identification are indeed a way of preserving certain aspects of data privacy. But they don’t produce anonymized data, per the GDPR definition.Pseudonymization techniques remove or replace the direct personal identifiers from the data. For instance, you delete all the names and emails from a dataset. You can’t identify someone directly from pseudonymized data.But you can do it indirectly. Indeed, the rest of the data often retains indirect identifiers. These are information that you can combine to create direct identifiers. They could date of birth, zip codes, or gender for example.For that matter, pseudonymization has a separate definition within the GDPR framework.“[…] the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information”.Contrary to anonymous data, pseudonymous data falls under the GDPR regulations.De-identifications techniques remove both direct and indirect personal identifiers from the data. On paper, the frontier between de-identified data and anonymized data is simple. The latest offers technical safeguards that guarantee that data can never be re-identified. It’s a “true until proven false” kind of situation. De-identified data is somehow anonymous until it’s not.And experts are pushing the line further every time they re-identify data that was de-identified.Data re-identifications keep on redefining anonymousThe mechanism types described above do not have the same effectiveness for privacy-preservation. Hence, what you intend to do with the data matters. Companies regularly release or sell data that they claim “anonymous”. It becomes a problem when the methods they used don’t guarantee that.Many events showed that pseudonymized data was a poor privacy-preservation mechanism. The indirect identifiers in the data create a strong risk for re-identification. And as available data volumes grow, so does the opportunities to cross-reference datasets:In 1990, an MIT graduate re-identified the Massachusetts Governor. He used de-identified medical dataset and census data.In 2006, AOL shared de-identified search data as part of a research initiative. Researchers were able to link search queries with the individuals behind them.In 2009, Netflix released an anonymized movie rating dataset as part of a contest. Texas researchers successfully re-identified the users.In 2009, researches predict an individual’s Social Security Number using only publicly available information.Recently, studies showed that de-identified data also was, in fact, re-identifiable. Researchers at UCL in Belgium and Imperial College London found that:“99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes”.Another study, conducted on anonymized cellphone data, showed that:“four spatio-temporal points are enough to uniquely identify 95% of the individuals”.Technology is improving. More data is being created. As a result, researchers are pushing the delimitation between de-identified and anonymous data. In 2017, researchers released a study stating that:“web-browsing histories can be linked to social media profiles using only publicly available data.”Another alarming point arises from the exposition of personal data through breaches. The amount of personal information leaked keeps on growing.Taken separately some datasets aren’t re-identifiable. But combined with leaked data, they represent a larger threat. Students from Harvard University were able to re-identify de-identified data using leaked data.In conclusion, what we consider “anonymous data”, is often not. Not all data sanitization methods generate truly anonymous data. Each presents its own advantages, but none offer the same level of privacy as anonymization. As we produce more data, it becomes harder to create truly anonymized data. And the risks of companies releasing potentially re-identifiable personal data grows.See More

Upcoming DSC WebinarsEmbracing Responsible AI from Pilot to Production- 5/27No-code ML for Forecasting and Anomaly Detection - 5/28How to Accelerate and Scale Your Data Science Workflows- 6/11Job SpotlightAssistant/Associate Professor of DS - Coppin State UniversityInformation Specialist - Open Society FoundationsData Science Analyst - Orange County Transportation AuthorityData Engineer (Contract-Remote) - Healthy Back InstituteFeatured JobsVice President, Data Science and Analytics - TripadvisorData Science Practice Lead - SpotifyManager, Data Science & Analytics - Volkswagen of America, IncMachine Learning Engineer - Harvard UniversitySenior Research Scientist - SonyData Analyst - JacobsData Scientist - Risk Analytics - John DeereStaff AI/Machine Learning Engineer - LG ElectronicsBusiness Strategy Analyst - NintendoPlanetary Data Engineer - NASA JPLData Scientist - Barclays Investment BankSenior Data Scientist - Procter & GambleData Scientist – Product Analytics - ZoomSoftware Engineer, Machine Learning - InstagramData Scientist — Risk - TikTokData Scientist, Analytics - FacebookML Engineer - Speech Recognition - McDonald'sBusiness Analytics Manager - NikeData Analytics Engineer - NokiaSr. Software and Machine Learning Engineer - eBayData Science Lead, Insurance - TeslaCheck out the most recent jobs on AnalyticTalent.comSee More

]]>

]]>

Job SpotlightData Engineer (Contract-Remote) - Healthy Back InstituteFeatured JobsApplied Machine Learning Engineer - TwitterSenior Analytics Engineer - NetflixData Engineer - Zoom Video CommunicationsData Engineer, Analytics - InstagramData Analytics Data Scientist - Bank of AmericaSenior Manager, Data Science - SonyMachine Learning Engineer, Apple Media Products Data ScienceResearch Data Scientist - FacebookData Scientist - PayPalResearch AST, Earth Sciences Remote Sensing - NASASenior Data Scientist, Growth - TwitterBusiness Strategy Analyst - NintendoAutopilot - Deep Learning Infrastructure Intern (Fall 2020) - TeslaData Scientist - Entry to Experienced Level - NSASoftware Engineer, General, Core - GoogleData Analyst - HubSpotMachine Learning Engineer, Generalist - PinterestML Engineer - Speech Recognition - McDonald'sBusiness Analytics Manager - NikeData Scientist, Artist Promotion - SpotifyNLP Research Scientist for Clinical Data - PhilipsSr. Software and Machine Learning Engineer - eBaySoftware Engineer - MicrosoftData Scientist CE Data Analytics - Bose CorporationCheck out the most recent jobs on AnalyticTalent.comFor Better or Worse Analytics and Data Science are ConvergingAuthor: Bill Vorhies - other articles by Bill VorhiesSummary: Analytic Platforms are rapidly being augmented with features previously reserved for data scientists. They are presented as easy to use but require substantial data literacy and advanced DS skills for the most complex. Business users and analysts can pursue more complex problems on their own, but need good oversight.click the image for the full articleRelated articles: AI | ML | Deep Learning | Data Analytics | Big DataSee More

We discuss a simple trick to significantly accelerate the convergence of an algorithm when the error term decreases in absolute value over successive iterations, with the error term oscillating (not necessarily periodically) between positive and negative values. We first illustrate the technique on a well known and simple case: the computation of log 2 using its well know, slow-converging series. We then discuss a very interesting and more complex case, before finally focusing on a more challenging example in the context of probabilistic number theory and experimental math.The technique must be tested for each specific case to assess the improvement in convergence speed. There is no general, theoretical rule to measure the gain, and if the error term does not oscillate in a balanced way between positive and negative values, this technique does not produce any gain. However, in the examples below, the gain was dramatic. Let's say you run an algorithm, for instance gradient descent. The input (model parameters) is x, the output if f(x), for instance a local optimum. We consider f(x) to be univariate, but it easily generalizes to the multivariate case, by applying the technique separately for each component. At iteration k, you obtain an approximation f(k, x) of f(x), and the error is E(k, x) = f(x) - f(k, x). The total number of iterations is N. starting with first iteration k = 1. The idea consists in first running the algorithm as is, and then compute the "smoothed" approximations, using the following m steps.Read the full article here.ContentGeneral framework and simple illustrationA strange functionEven stranger functionsSee More

]]>

]]>

]]>