Data Science is a broad scientific field - intersecting mathematics, statistics and computational science disciplines - that unifies processes, theories, concepts, tools and technologies that allow the analysis and extraction of knowledge from data. Data Science allows the use of theoretical, mathematical methods for the study and evaluation of data. Data Science is a field that has existed for over 50 years, but has gained more prominence in recent years due to various factors such as the emergence and popularization of Big Data and the development of areas such as Machine Learning.
A Data Scientist is a professional who specializes in analyzing, interpreting and developing complex data projects designed to meet specific business needs. A Data Scientist should possess multidisciplinary competencies ranging from mathematical and computational knowledge to the understanding of the business problem under analysis in order to extract insights from this information.
An algorithm is a mathematical or statistical formula executed by software to perform data analysis. It is a logical, finite, and defined sequence of instructions that must be followed to solve a problem or perform a task.
Predictive analysis is the use of historical data to predict trends or future events. By collecting, organizing and analyzing this data, it becomes possible to anticipate future behaviors, thus adapting and optimizing business strategies.
Data mining or mining of data is the process of discovering relevant, consistent patterns such as association rules or time sequences in order to identify systematic relationships between variables. It uses Statistics techniques or Artificial Intelligence to find information that may not be immediately visible, that is, counterintuitive.
Artificial Intelligence is a subfield of Computer Science that studies how to develop computers and systems that can behave as human beings and possess the rational capacity of the human being to solve problems, think or, in a broad way, be intelligent.
We can think of some basic characteristics of these systems, such as reasoning ability (applying logical rules to a set of available data to reach a conclusion), learning (learning from mistakes and correctness in order to act more effectively in the future), recognizing patterns (both visual and sensory patterns as well as patterns of behavior) and inference (the ability to apply reasoning to everyday situations).
The development of the area began shortly after World War II, with the article "Computing Machinery and Intelligence" by the English mathematician Alan Turing, and the term itself was coined in 1956. However only recently, with the emergence of exponential computing capacity and Big Data, has artificial intelligence gained the means and critical mass to establish itself as an integral science, with its own problems and methodologies. Since then, its development has extrapolated classical chess or conversion programs and involved areas such as computer vision, voice analysis and synthesis, fuzzy logic, artificial neural networks, and many others. Initially AI was intended to reproduce human thought.
Strong AI is a term used to describe a certain mindset of artificial intelligence development. Strong AI's goal is to develop artificial intelligence to the point where the machine's intellectual capability is functionally equal to a human's.
Strong or General Artificial Intelligence
Research on Strong Artificial Intelligence addresses the creation, in the form of computer-based intelligence that can reason and solve problems as a human being; a form of strong AI is classified as self-conscious.
Weak or Narrow Artificial Intelligence
Weak Artificial Intelligence focuses its research on the creation of artificial intelligence that is not capable of truly reasoning and problem solving. Such a machine with this characteristic of intelligence would act as if it was intelligent, but it has no self-consciousness or notion of self. The classical test for measuring intelligence in machines is the Turing Test.
Níveis de Inteligência Artificial
Among the theorists who study what is possible to achieve with AI there is a discussion where two basic proposals are considered: one known as "strong" and another known as "weak." Basically, the strong AI hypothesis considers that it is possible to create a conscious machine, that is, artificial systems must replicate the human mind.
A practical contribution by Alan Turing was what became known as the Turing Test done in 1950. The Turing Test, tests the ability of a machine to display intelligent behavior equivalent to, or indistinguishable from, a human being. The test consists of asking questions to a hidden person and a computer. A computer passes the test if, through the answers, it is impossible for someone to distinguish whether they are from a person or the machine. The conversation is restricted to a text channel, such as a keyboard and a screen so that the result is not dependent on the machine's ability to render words as speech audio.
Alan Turing addressed the notion of machine intelligence since at least 1941 and one of the earliest mentions of "computational intelligence" was made by him in 1947. In Turing's report, “Intelligent Machinery”, he investigated "the question whether it is possible or not for machine to present intelligent behavior" and, as part of the investigation, proposed what can be considered the precursor of what would become the Turing Test.
Cognitive Computing is the junction of several methods of Artificial Intelligence and Signal Processing to simulate human thought processes, which may include hardware (e.g. sensors, IoTs, robots, processors) and software (AI algorithms). Among the techniques used to emulate the functioning of the human brain are: machine learning, natural language processing, computer vision, speech recognition, noise filtering, pattern recognition, etc
Bayesian Inference consists of the evaluation of hypotheses by the maximum likelihood, an immediate consequence of the Bayes formula, which is fundamental for computational methods related to intelligence, data mining or linguistics, whether they are Bayesian machine learning or non-Bayesian methods. Bayesian Inference is an extension of Bayesian statistics and statistical inference for computational intelligence, where it is synonymous with learning and finds applications in equally generic domains, e.g. in biomedicine, cloud computing, algorithm research, computational creativity.
Machine Learning is a subfield of Computer Science and refers to algorithms and techniques through which systems "learn", autonomously, with each of the tasks they perform. In this way, we can say that the computer improves its performance in a given task every time it is performed. These algorithms consist of training a model using sample inputs to make predictions or decisions guided by data rather than simply following explicitly programmed instructions. While in Artificial Intelligence there are two types of reasoning (the inductive, which extracts rules and patterns from large datasets, and the deductive), Machine Learning only cares about the inductive.
Supervised Learning is the term used whenever the program is "trained" using a predefined set of data. Based on the training using pre-defined data training, the program can make accurate decisions when it receives new data. Example: You can use a human resource data set to train the Machine Learning algorithm, which has tweets marked as positive, negative, and neutral, and thus train a sentiment analysis classifier.
Unsupervised Learning is the term used when a program can automatically find patterns and relationships in a data set. Example: analysis of a set of e-mail data and automatic grouping of e-mails related to the topic, without the program having any previous knowledge about the data
Reinforcement Learning is concerned with how an agent should act in an environment that maximizes some sense of reward over time. Reinforcement Learning algorithms try to find the policy that maps the states of the world to the actions that the agent must take in those states. Reinforcement Learning is distinguished from the problem of Supervised Learning in the sense that correct input / output pairs are never presented, nor are sub-optimal actions explicitly corrected.
Classification algorithms are a sub-category of Supervised Learning. Classification is the process of taking some kind of input and assigning it a label. Classification systems are generally used when predictions are of a distinct nature, i.e. a simple "yes or no". Example: mapping an image of a person and classifying it as male or female.
Another subcategory of Supervised Learning used when the value being predicted differs from a "yes or no" and follows a continuous spectrum. Regression systems could be used, for example, to answer the questions: "How much?" or "How many are there?".
Clustering is an Unsupervised Learning method which consists of assigning a set of observations to subsets (so-called clusters) so that observations within the same cluster are similar according to some pre-designated criterion or criteria, whereas observations made in different clusters are not similar. Different Clustering techniques make different assumptions about the data structure, often defined by some similarity metrics and evaluated, for example, by internal compactness (similarity between members of the same cluster) and separation between different clusters. Other methods are based on density estimates and connectivity graphs.
Recommendation systems are methods based on Machine Learning to predict the classification that users would give to each item and displaying to them those items that were (probably) well classified. Companies like Amazon, Netflix and Google are known for the intensive use of Recommendation systems with which they gain great competitive advantage.
A Decision Tree is a decision support tool that uses a tree chart or decision model and its possible consequences. A decision tree is also a way to visually represent an algorithm.
Support Vector Machines (SVM)
Support Vector Machines (SVM) are a set of supervised type Machine Learning algorithms used for classification and regression. Given a set of training examples, each marked as belonging to one or two categories, an SVM training algorithm constructs a model that predicts whether a new example falls within one category or another.
In probability and statistics, a Generative Model is a model used to generate data values when some parameters are unknown. Generating models are used in Machine Learning for any data modeling directly or as an intermediate step for the formation of a conditional probability density function. In other words, we can model P (x, y) in order to make predictions (which can be converted to P (x | y) by applying the Bayes rule), as well as to be able to generate probable pairs (x, y ), which is widely used in unsupervised learning. Examples of Generator Models include Naive Bayes, Latent Dirichlet Allocation and Gaussian Mixture Model.
Discriminative models or Conditional models are a class of models used in Machine Learning to model the dependence of a variable y of a variable x. As these models attempt to calculate conditional probabilities, that is, P (y | x) they are often used in supervised learning. Examples include logistic regression, SVMs, and Neural Networks.
A Genetic Algorithm is a heuristic search that mimics the process of natural selection and uses methods with mutation and recombination to generate new genotypes in the hope of finding good solutions to a given problem. In machine learning, genetic algorithms have found some utility in the 1980s and 1990s. In reverse, Machine Learning has been used to improve the performance of genetic and evolutionary algorithms.
Deep Learning is a sub-area of Machine Learning that deals with human-brain inspired models based on artificial neural networks with multiple layers to generate complex and computationally demanding models. Deep Learning techniques, for example, have been very successful in solving image recognition problems due to their ability to choose the best features as well as to express layers of representation. This class of models has recently proven to be extremely effective for several Machine Learning problems, often reaching near or surpassing human performance
Natural Language Processing NLP
Natural Language Processing is a sub-area of Artificial Intelligence focused on the ability of computers to process and comprehend human language. Through this understanding - which is not easy at all - systems can generate a series of data that are used to create from attendant robots and digital assistants to the provision of customized products and services for certain groups of people.
Voice Recognition is the process of autmatically convert voice (speech in natural language) into text. Also known as TTS (Text to Speech) or ASR (Automated Speech Recognition).
Sentiment Analysis uses techniques and technologies used to identify and extract information about the feeling (positive, negative or neutral) of an individual or group of individuals about a particular topic.
Virtual Assistants are computer programs that simulate a human assistant providing some service to the user. They may or may not be able to chat (such as chatbots, they can serve clients, guide tasks, remember appointments and serve as an interface to other applications, such as Apple's Siri, Google Assistant, Microsoft’s Cortana, Amazon's Alexa or Facebook’s Messenger’s M assistant.
Chatbots are programs that use natural language features to interact with users via messages (Conversational Interface). Some chatbots use artificial intelligence to discover intent in the user's sentence, deal with ambiguities, find the best response, and learn from the interactions. They are already an option for customer service in several companies. They are used on websites, applications and social networks to talk to customers.
Internet of Things
IOT is the ability to collect, analyze and transmit data to things, increasing their usefulness. And we're talking about all sorts of things, from self-driving cars to refrigerators that generate grocery shopping lists.
Big Data is often associated with gigantic databases that require cost-effective and innovative data-processing strategies to improve the quality of insights about trends and behaviors, decision making, and process automation.
Hadoop is an open source project with licensing of the Apache Software Foundation and aims to provide a distributed platform for processing and exploiting Big Data using multiple computers interconnected in clusters. These clusters can hold up to thousands of machines, each of which provides local processing and storage capacity. In this way, instead of relying on a single hardware, a library provides high-availability services based on computer grids.
A Data Lake consists of a system that stores the data in large volumes and in its natural state, coming from all types of sources where users could "dive" and take samples. That is, a "lake" full of data. The storage of this type of data is more difficult since they generally have different formats and origins. This whole diversity, however, can be quite positive since it increases the possibilities of exploitation.
It is the process that involves data collection, processing and analysis to generate insights, to aid the decision-making process, based on information. Overall, it is a way of owning and analyzing data.
Business Intelligence is the collection, organization and analysis of information with the objective of providing insights for business decision making.
Data Preparation is the process of collecting, cleaning, normalizing, combining, structuring and organizing data for analysis. It is the initial (and fundamental) step for that Data Science work.
It is a repository of information (data) related by subject, integrated and permanent in order to help in the making of decisions by the company. This repository is isolated from operating systems and is used as a centralized database for all business areas and helps in the decision-making process of the company.
It is a subset of data from a Data Warehouse designed to meet the needs of a user community. They can be built for various areas of the company, such as Finance, Sales, Human Resources, etc. so that users in each business area see only the data that is relevant to them
Data visualization is the presentation of data in a pictorial / graphical context. Patterns, trends and correlations of data that could go undetected in text can be exposed and recognized more easily through visualization software. This technique facilitates the understanding of working with data, including the decision makers, who can extract more and better insights from the results presented visually.