About BI and Data Modeling: Requirements and Data Modeling

20191657948_ff15a71f32_z

There are two most common approaches to data modeling: relational modeling, often related to OLTP systems and dimensional modeling, which is a more appropriate technique for OLAP systems. Serra (2002) states that the process of building quality data models comes before the design of the first entity, starting with the understanding of the corporate model being addressed.

Collecting the business and data requirements is the foundation of the entire data warehouse effort—or at least it should be. Collecting the requirements is an art form, and it is one of the least natural activities for an IS organization. We give you techniques to make this job easier and hope to impress upon you the necessity of spending quality time on this step. (KIMBALL, 1998, p.6).

For any project aiming to build OLAP systems or OLTP systems, the reality with respect to the requirements is the same: they are the foundation of all the structure to be built and therefore, the necessary amount of time must be invested in order to prospect anything that is relevant. The more time invested in collecting and investigating requirements, the less time will be spent making unnecessary corrections in the future of the data model and the systems involved.

Fig. 2 - Definição de Requisitos e Modelagem Dimensional no Diagrama do Ciclo de Vida Dimensional do Negócio de Kimball
Definition of Requirements and Dimensional Modeling in the Kimball Business Dimensional Life-Cycle Diagram

From the business dimensional life-cycle model, on which Kimball’s (1998) methodology is based, note the importance not only of the process of defining the business requirements, but also, how much the dimensional data modeling process depends on these requirements, the dependency of the other processes of these two processes, and the critical path of the project, which tends to be strongly configured over this whole core set of processes.

KIMBALL (1998) states that in dimensional modeling, the definition of business requirements determines the data needed to meet the analytical requirements of business users. That is, for the ways of analysis that the business users intend to use can be made feasible, it is necessary to have the required data from the definition of the business requirements. A different approach than that used for the design of operational level systems is necessary to design data models that support such analyzes.

Even if there are applicability differences (and others) between data modeling techniques, relational or dimensional, the quality of the model will always be dependent on the quality of the requirements surveyed.

Text: Pedro Carneiro Jr.
Revision: Luis Cláudio R. da Silveira


These are the posts on the same “Enum and Quality in BI” monograph:

These will be the next posts on the same theme:

  • About BI and Data Modeling: Quality of Modeling, Data and Information
  • About BI and Data Modeling: Types of Data Modeling
    • About BI and Data Modeling: Relational Modeling
      • About BI and Data Modeling: Phases of Relational Data Modeling
      • About BI and Data Modeling: How to create an Entity-Relationship Diagram
    • About BI and Data Modeling: Dimensional Modeling
      • About BI and Data Modeling: Defining Granularity
      • About BI and Data Modeling: Detailing Dimensions
      • About BI and Data Modeling: Defining the Attributes of the Fact Table (s)
      • About BI and Data Modeling: Defining Aggregates

Justification

This short text is a mere Portuguese to English translation from part of my monograph “THE PERSISTENCE OF ENUMERATIONS IN POSTGRESQL DATABASES AND QUALITY IN BUSINESS INTELLIGENCE” (free translation of the title), also aliased as “Enum and Quality in BI”, which corresponds to a minor part of the document structure.


References:

Image credits:

 

What is Data Science?

Image taken from page 2 of 'Lady Lohengrin. [A novel.]'

I am a bit dissatisfied with the multiple definitions that data science have been receiving and the lack of at least one clear and scientific approach to a definition for it as it occurs with computer science, software development science, and a lot of other subjects. So I decided to write this post expecting to produce some findings and/or to light up some discussion around it. Who knows we may reach a more scientific definition in the future.

“The field of data science is emerging at the intersection of the fields of social science and statistics, information and computer science, and design. The UC Berkeley School of Information is ideally positioned to bring these disciplines together and to provide students with the research and professional skills to succeed in leading edge organizations.” – https://datascience.berkeley.edu/about/what-is-data-science/,
accessed on January 13rd, 2016.

Data Science Happens Not Only In California

Many people quote that a data scientist is “a data analyst who lives in San Francisco”. That alone might indicate the importance of the data analysts and all the data practitioners in California, but also it seems to be enough to determine that what we know as data science has a more practical or commercial appeal than a proper scientific definition for itself. Anyhow, we should not deny that this data science already has an identity: a fast-paced, rapidly-evolving one, just like any other field directly involved with modern technologies. But the distinct personality of data science is still a bit confusing.

Is Statistics Data Science Itself?

Many argue that data science might be statistics itself or whatsoever modern statistics does by the usage of computational means. That happens even in the academic ecosystem in a large scale, propelled by the popularity and the usage of big data, machine learning et cetera. Do statistics compose the whole data science? Does data science compose the whole statistics? In other words, are statistics and data science different sets, different sciences? The known truth so far is that statistics makes use of data science.

Data Science, According To Wikipedia

Many professors would not accept an Wikipedia definition as the basis for a scientific argument. Anyhow, let us ease things a little bit by using it. In my opinion, Wikipedia reflects what a majority think or at least tends to be an average of the mindset.

Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD). – Wikipedia, https://en.wikipedia.org/wiki/Data_science, accessed on January 12th, 2016.

Wikipedia, at this moment at least, defines data science as an interdisciplinary field. That is true. Another point of view affirms that too and provides the famous Data Science Venn Diagram. My question is: must a field be a science? A field is a subset or part of a science but the reciprocal is not necessarily true. In the citation above, Wikipedia affirms that statistics is a field too and we are considering Statistics as a science.

google-definition-datascience
Google uses Wikipedia’s definition of data science

 

A Data Science Visualization, according to Drew Conway One of the opinions that has a closer approach to a common place for a definition of Data Science is the one of Drew Conway. Despite I have not seen yet any statement that it is a definition, his visualization brings data science as an intersection of hacking skills, Statistics, and the areas of application, the famous Data Science Venn Diagram. It seems that it still misses key areas such as databases, data governance, and so on, but I think that he has put all Computer Science and databases stuff into a set called “hacking skills”. Also, that occurs probably because the world has much more programmers (people with hacking skills) than computer scientists or because those results oriented people with hacking skills are in more demand than computer scientists. Who knows computer science is so closed in itself (difficult to enter or to communicate) or it becomes so boring in university that there are more “people with hacking skills” from other areas behind the desks typing R command lines than good computer scientists doing the same.

“As I have said before, I think the term “data science” is a bit of a misnomer, but I was very hopeful after this discussion; mostly because of the utter lack of agreement on what a curriculum on this subject would look like. The difficulty in defining these skills is that the split between substance and methodology is ambiguous, and as such it is unclear how to distinguish among hackers, statisticians, subject matter experts, their overlaps and where data science fits. What is clear, however, is that one needs to learn a lot as they aspire to become a fully competent data scientist. Unfortunately, simply enumerating texts and tutorials does not untangle the knots. Therefore, in an effort to simplify the discussion, and add my own thoughts to what is already a crowded market of ideas, I present the Data Science Venn Diagram.” – Drew Conway, http://drewconway.com/zia/2013/3/26/the-data-sciencevenn-diagram, accessed on January 13rd, 2016.

According to Drew Conway, author of the DS Venn Diagram, the recent “data science” term forged for the recent usage of data may be a bit of a misnomer and I agree with him.

data_science_vd
Data Science Venn Diagram – Credits/source: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)

Data Science vs Data Science

We should ask then, what science is that one that the so called data science field sits in? Information Science, the “Data Science”, Statistics, Computer Science…? Wikipedia’s data science definition also says that DS is similar to KDD, but shouldn’t KDD be encompassed by DS simply because databases deal with data? Because of that, another question comes to mind: Is the real Data Science “the science of data” or “the science that extracts knowledge or insights from data in various forms”?

Here we encounter two definitions and only one of them is the real Data Science.

“Data science is the study of where information comes from, what it represents and how it can be turned into a valuable resource in the creation of business and IT strategies… Mining large amounts of structured and unstructured data to identify patterns can help an organization rein in costs, increase efficiencies, recognize new market opportunities and increase the organization’s competitive advantage. Some companies are hiring data scientists to help them turn raw data into information. To be effective, such individuals must possess emotional intelligence in addition to education and experience in data analytics.” – http://searchcio.techtarget.com/definition/data-science, accessed on January 13rd, 2016.

The Data Science Venn Diagram above helps a lot with that, but there is more to be discovered, mainly because, in my opinion, this “data science” Wikipedia, data analysts, statisticians, programmers, and business men talk about is more about what these data practitioners have been doing with statistics, substantive expertise and hacking skills to turn raw data into information then, for example, the science that studies data, a systematically organized body of knowledge on the particular subject of data, in other words, the science that studies data frames, data sets, databases, meta-data, data flows, data cubes, data models, and all the domain the subject of data might encompass and its frontiers. That makes us go after the definition of science.

“There is much debate among scholars and practitioners about what data science is, and what it isn’t. Does it deal only with big data? What constitutes big data? Is data science really that new? How is it different from statistics and analytics?… In virtually all areas of intellectual inquiry, data science offers a powerful new approach to making discoveries. By combining aspects of statistics, computer science, applied mathematics, and visualization, data science can turn the vast amounts of data the digital age generates into new insights and new knowledge.”, http://datascience.nyu.edu/what-is-data-science/, accessed on January 13rd, 2016.

What Science Is

I went after a classic definition for science and the first thing that came to me was, again, an Wikipedia definition. That’s the modern days, professors. Anyway, trying to be fair to the investigation, I tried to find other online sources and found some other definitions, including one that gets close to what is better to use when one wants to prove a science and that may be helpful in our future reasonings.

Science, According to Wikipedia

Wikipedia defines science as “a systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about the universe”.

Science, According to Google’s Definition

According to Google, science is “the intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment. (‘the science of criminology’)”; “a particular area of this. (‘veterinary science’)”;”a systematically organized body of knowledge on a particular
subject. (‘the science of criminology’)”; “synonyms: physics, chemistry, biology; physical sciences, life sciences (‘he teaches science at the high school’)”.

google-definition-science
Google Dictionary’s definition of science

Science, According to Merriam-Webster

At Merriam-Webster we read that science is “knowledge about or study of the natural world based on facts learned through experiments and observation; a particular area of scientific study (such as biology, physics, or chemistry); a particular branch of science; a subject that is formally studied in a college, university, etc.”

Science, According to BusinessDictionary.com

The BusinessDictionary.com defines science as “Body of knowledge comprising of measurable or verifiable facts acquired through application of the scientific method, and generalized into scientific laws or principles. While all sciences are founded on valid reasoning and conform to the principles of logic, they are not concerned with the definitiveness of their assertions or findings”. And adds, “In the words of the US paleontologist Stephen Jay Gould (1941-), ‘Science is all those things which are confirmed to such a degree that it would be unreasonable to withhold one’s provisional consent.’”

This one seams to be the best definition for science we found up to the moment as it mentions the scientific method as the way to measure and verify the facts and the laws or principles that compose a science.

A Raw First Definition of Data Science

This is raw, and maybe not sophisticated and prone to errors (we are not using the scientific method yet – let us keep that one for future posts), but let us imagine what a data science definition would be based on the definitions of science we listed above.

An Wikipedia-Would-Be Definition of Data Science

“A systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about data“.

Do we have a systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about data? What we have today is about data or about other things using data as the main support?

A Google-Would-Be Definition of Data Science

From Google’s definition of Science, it looks like our data science definition should at least become something like:

1. “the intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment of data. (‘the science of data’)”;
2. or “the intellectual and practical activity encompassing the systematic study of the structure and behavior of data in the physical and natural world through observation and experiment. (‘the science of data’)”.

We have first and second definitions, based on Google’s definition of science.

From recent practice and readings, I would bet that our first-created Google-would-be definition (1) is what all people involved have in mind as for what they/we think data science is. I think that is why many people tend to confuse data science with statistics, simply because the definition number one expresses very well what statistics does. But, actually, is that the proper definition for data science?

Other Google definitions would be like “a particular area of this. (‘data science’)”; “a systematically organized body of knowledge on the data subject. (‘the science of data‘)”.

Do we have a systematically organized body of knowledge on the data subject? As far as I know we have systematically organized bodies of knowledge on many subjects and they use data as a foundation.

A Merriam-Webster-Would-Be Definition of Data Science

“Knowledge about or study of data based on facts learned through experiments and observation; a particular area of scientific study (such as “DATA-o-logy”, biology, physics, or chemistry); a particular branch of science (data science); a subject that is formally studied in a college, university, etc.”

A BusinessDictionary-Would-Be Definition of Data Science

“Body of knowledge comprising of measurable or verifiable facts about data acquired through application of the scientific method, and generalized into scientific laws or principles.

We are here not to precisely inform the data science definition yet, but to throw the ball to the kicker.

Nowadays (we are in January, 2016), it is possible to find many definitions of data science and many (or all) of them still lack precision or lead to a practice that may be a misnomer of something people do with data for scientific and commercial reasons. As a science, there are people studying it, defining it (what we are trying to do), and not only using it. As a practice, people do not mind if it is a science or not since the tool set works for them. As many are trying to define it, according to their observations and experiences, it looks like everybody, while succeeding in a good definition for specific purposes, fails to discover a common place for the definition. As far as all scientists know, the proper common place for the definition of any science is Science itself.

Should one say that data science is “the science of data”, that would be vague, not precise, but that innocence would throw a light on a different perspective. What is science and what is data? That might help us reach better and more common-sense oriented definitions for both the practice of extracting knowledge or insights from data and the science of data and, who knows, turn us able to affirm that there is a lot of or no difference between the two things.

Just an important note: I searched the websites http://www.sciencecouncil.org/ and http://www.businessdictionary.com/ and found no definition for data science in their websites.

Text: Pedro Carneiro Jr.
Revision: Luis Cláudio R. da Silveira


Other Readings (References):

Other sources to find popular or different perspectives about data science are:

Image credits: