About Business Intelligence: The Adequacy of the Information to the Business Needs

14744081366_201b47c5b6_z

For Serra (2002), each information source has three attributes: form, age and frequency. Taking an example of a “Quantity Produced by Manufacturing Order” report, we can assume that it has the following characteristics:

  • As for the form: detailing of quantities produced by product;
  • As for age: to be received at 8:30 AM, with facts reported until 00:00 AM of the  previous day;
  • As for frequency: daily.

Kimball (1998), addressing the processes involved in the Data Warehouse lifecycle, calls for the importance of balancing the reality of business requirements with the availability of data to meet such demand. Preparation and time are fundamental to a good project that will involve considerable dialogue between the qualified personnel in the systems area with the information-consuming staff of the business area.

Before you can do a good job of defining your data marts, you need to do some homework. You must thoroughly canvass your organization’s business needs and thoroughly canvass the data resources. (KIMBALL, 1998, p. 268).

Text: Pedro Carneiro Jr.
Revision: Luis Cláudio R. da Silveira


These are the posts on the same “Enum and Quality in BI” monograph:

Our future posts that complete the current “About Business Intelligence” theme will be:

  • About Business Intelligence: Data Warehouse

Justification

This short text is a mere Portuguese to English translation from part of my monograph “THE PERSISTENCE OF ENUMERATIONS IN POSTGRESQL DATABASES AND QUALITY IN BUSINESS INTELLIGENCE” (free translation of the title), also aliased as “Enum and Quality in BI”, which corresponds to a minor part of the document structure.


References:

Image credits:

 

 

 

About Business Intelligence: The Relationship Between Operational Information and Managerial Information

20513932501_7541a75594_z

Information can be classified according to its operational or managerial purpose. For Serra (2002), information is both the source and the outcome of executive action: complete and current facts are essential for appropriate decisions. Information is operational when generated to maintain continuity of operations in the organization’s operational cycle and usually comes directly from transactional systems. Information is managerial when it aims to support some decision making. In addition, people of different management levels need management information of different levels.

Text: Pedro Carneiro Jr.
Revision: Luis Cláudio R. da Silveira


These are the posts on the same “Enum and Quality in BI” monograph:

Our future posts that complete the current “About Business Intelligence” theme will be:

  • About Business Intelligence: Data Warehouse

Justification

This short text is a mere Portuguese to English translation from part of my monograph “THE PERSISTENCE OF ENUMERATIONS IN POSTGRESQL DATABASES AND QUALITY IN BUSINESS INTELLIGENCE” (free translation of the title), also aliased as “Enum and Quality in BI”, which corresponds to a minor part of the document structure.


References:

Image credits:

About Business Intelligence: The Quality of Information

11022926536_d6652fba3d_z

Any quality information depends directly on quality data. One problem is the fact that today’s software production is still being done in an artisanal way. Serra (2002) even classifies system development professionals as “intellectual artisans” given the lack of controls and well-defined processes for that activity. Despite this difficulty in the efforts to measure the quality of software development processes, at least concrete results have been obtained by applying the methods of Kimball (1998) to Data Warehousing, in such a way that by them we have defined processes for the measurement and processing of information quality.

Consistent information means high-quality information. This means that all  of the information is accounted for and is complete. (KIMBALL, 1998, p.10).

Data staging is a major process that includes, among others, the following sub-processes: extracting, transforming, loading and indexing, and quality assurance checking. (KIMBALL, 1998, p.23).

Text: Pedro Carneiro Jr.
Revision: Luis Cláudio R. da Silveira


These are the posts on the same “Enum and Quality in BI” monograph:

Our future posts that complete the current “About Business Intelligence” theme will be:

  • About Business Intelligence: Data Warehouse

Justification

This short text is a mere Portuguese to English translation from part of my monograph “THE PERSISTENCE OF ENUMERATIONS IN POSTGRESQL DATABASES AND QUALITY IN BUSINESS INTELLIGENCE” (free translation of the title), also aliased as “Enum and Quality in BI”, which corresponds to a minor part of the document structure.


References:

Image credits:

 

About Business Intelligence: The Quality of Data

11208982434_a58edfc401_z

For Serra (2002), the effective Data Management function relies on standards and policies regarding data, their definition and usage. These standards and policies must be defined and adopted, being stringent, comprehensive, flexible to changes aiming reusability, stability, and the effective communication of the meaning of the data, as well as enabling their scalability. One should use tools such as data dictionary and repositories for data management. Data must be well defined, sound, consistent, reliable, safe and shared so that each new system defines only the data that is within its scope and shares the other data with other systems in the organization.

For Kimball (1998), warehouse design often begins with a load of historical data that requires cleansing and quality control. In existing ones, clean data comes from two processes: inserting clean data and cleaning/solving inserted data problems. In addition, establishing accountability for data quality and integrity can be extremely difficult in a Data Warehousing environment. In most transactional systems, important operational data is well captured, but optional fields do not receive attention and system owners do not care if they are accurate or complete if the required logic is being met. Thus, business and information systems groups must identify or establish an accountable person for each data source, whether internal or external, treating the data from a business perspective. The quality of the data depends on a series of events, many beyond the control of the data warehousing team, such as the data collection process that must be well designed and count on a great commitment of the people that perform the entry of those data with their respective quality. Once established the value of the data warehouse, it is easier to induce the necessary modifications to the data entry processes of the source systems aiming better data.

Kimball (1998) further argues that it is unrealistic to expect any system to contain perfect data, but each implementation must define its own standards of data quality acceptance. These standards are based on the characteristics of the quality data that are: accurate, complete, consistent, unique and timely – the warehouse data is consistent with the records system (accurate), and if not, reason can be explained; They represent the entire relevant set of data, and users are notified of the scope (complete); They have no contradictions (consistent); They always have the same name when they have the same meaning (unique); They are updated based on a useful agenda for business users, the schedule is known and people accept it that way (timely). In addition, quality data simply represent the truth of the facts.

Text: Pedro Carneiro Jr.
Revision: Luis Cláudio R. da Silveira


These are the posts on the same “Enum and Quality in BI” monograph:

Our future posts that complete the current “About Business Intelligence” theme will be:

  • About Business Intelligence: Data Warehouse

Justification

This short text is a mere Portuguese to English translation from part of my monograph “THE PERSISTENCE OF ENUMERATIONS IN POSTGRESQL DATABASES AND QUALITY IN BUSINESS INTELLIGENCE” (free translation of the title), also aliased as “Enum and Quality in BI”, which corresponds to a minor part of the document structure.


References:

Image credits:

 

What is Data Science?

Image taken from page 2 of 'Lady Lohengrin. [A novel.]'

I am a bit dissatisfied with the multiple definitions that data science have been receiving and the lack of at least one clear and scientific approach to a definition for it as it occurs with computer science, software development science, and a lot of other subjects. So I decided to write this post expecting to produce some findings and/or to light up some discussion around it. Who knows we may reach a more scientific definition in the future.

“The field of data science is emerging at the intersection of the fields of social science and statistics, information and computer science, and design. The UC Berkeley School of Information is ideally positioned to bring these disciplines together and to provide students with the research and professional skills to succeed in leading edge organizations.” – https://datascience.berkeley.edu/about/what-is-data-science/,
accessed on January 13rd, 2016.

Data Science Happens Not Only In California

Many people quote that a data scientist is “a data analyst who lives in San Francisco”. That alone might indicate the importance of the data analysts and all the data practitioners in California, but also it seems to be enough to determine that what we know as data science has a more practical or commercial appeal than a proper scientific definition for itself. Anyhow, we should not deny that this data science already has an identity: a fast-paced, rapidly-evolving one, just like any other field directly involved with modern technologies. But the distinct personality of data science is still a bit confusing.

Is Statistics Data Science Itself?

Many argue that data science might be statistics itself or whatsoever modern statistics does by the usage of computational means. That happens even in the academic ecosystem in a large scale, propelled by the popularity and the usage of big data, machine learning et cetera. Do statistics compose the whole data science? Does data science compose the whole statistics? In other words, are statistics and data science different sets, different sciences? The known truth so far is that statistics makes use of data science.

Data Science, According To Wikipedia

Many professors would not accept an Wikipedia definition as the basis for a scientific argument. Anyhow, let us ease things a little bit by using it. In my opinion, Wikipedia reflects what a majority think or at least tends to be an average of the mindset.

Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD). – Wikipedia, https://en.wikipedia.org/wiki/Data_science, accessed on January 12th, 2016.

Wikipedia, at this moment at least, defines data science as an interdisciplinary field. That is true. Another point of view affirms that too and provides the famous Data Science Venn Diagram. My question is: must a field be a science? A field is a subset or part of a science but the reciprocal is not necessarily true. In the citation above, Wikipedia affirms that statistics is a field too and we are considering Statistics as a science.

google-definition-datascience
Google uses Wikipedia’s definition of data science

 

A Data Science Visualization, according to Drew Conway One of the opinions that has a closer approach to a common place for a definition of Data Science is the one of Drew Conway. Despite I have not seen yet any statement that it is a definition, his visualization brings data science as an intersection of hacking skills, Statistics, and the areas of application, the famous Data Science Venn Diagram. It seems that it still misses key areas such as databases, data governance, and so on, but I think that he has put all Computer Science and databases stuff into a set called “hacking skills”. Also, that occurs probably because the world has much more programmers (people with hacking skills) than computer scientists or because those results oriented people with hacking skills are in more demand than computer scientists. Who knows computer science is so closed in itself (difficult to enter or to communicate) or it becomes so boring in university that there are more “people with hacking skills” from other areas behind the desks typing R command lines than good computer scientists doing the same.

“As I have said before, I think the term “data science” is a bit of a misnomer, but I was very hopeful after this discussion; mostly because of the utter lack of agreement on what a curriculum on this subject would look like. The difficulty in defining these skills is that the split between substance and methodology is ambiguous, and as such it is unclear how to distinguish among hackers, statisticians, subject matter experts, their overlaps and where data science fits. What is clear, however, is that one needs to learn a lot as they aspire to become a fully competent data scientist. Unfortunately, simply enumerating texts and tutorials does not untangle the knots. Therefore, in an effort to simplify the discussion, and add my own thoughts to what is already a crowded market of ideas, I present the Data Science Venn Diagram.” – Drew Conway, http://drewconway.com/zia/2013/3/26/the-data-sciencevenn-diagram, accessed on January 13rd, 2016.

According to Drew Conway, author of the DS Venn Diagram, the recent “data science” term forged for the recent usage of data may be a bit of a misnomer and I agree with him.

data_science_vd
Data Science Venn Diagram – Credits/source: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)

Data Science vs Data Science

We should ask then, what science is that one that the so called data science field sits in? Information Science, the “Data Science”, Statistics, Computer Science…? Wikipedia’s data science definition also says that DS is similar to KDD, but shouldn’t KDD be encompassed by DS simply because databases deal with data? Because of that, another question comes to mind: Is the real Data Science “the science of data” or “the science that extracts knowledge or insights from data in various forms”?

Here we encounter two definitions and only one of them is the real Data Science.

“Data science is the study of where information comes from, what it represents and how it can be turned into a valuable resource in the creation of business and IT strategies… Mining large amounts of structured and unstructured data to identify patterns can help an organization rein in costs, increase efficiencies, recognize new market opportunities and increase the organization’s competitive advantage. Some companies are hiring data scientists to help them turn raw data into information. To be effective, such individuals must possess emotional intelligence in addition to education and experience in data analytics.” – http://searchcio.techtarget.com/definition/data-science, accessed on January 13rd, 2016.

The Data Science Venn Diagram above helps a lot with that, but there is more to be discovered, mainly because, in my opinion, this “data science” Wikipedia, data analysts, statisticians, programmers, and business men talk about is more about what these data practitioners have been doing with statistics, substantive expertise and hacking skills to turn raw data into information then, for example, the science that studies data, a systematically organized body of knowledge on the particular subject of data, in other words, the science that studies data frames, data sets, databases, meta-data, data flows, data cubes, data models, and all the domain the subject of data might encompass and its frontiers. That makes us go after the definition of science.

“There is much debate among scholars and practitioners about what data science is, and what it isn’t. Does it deal only with big data? What constitutes big data? Is data science really that new? How is it different from statistics and analytics?… In virtually all areas of intellectual inquiry, data science offers a powerful new approach to making discoveries. By combining aspects of statistics, computer science, applied mathematics, and visualization, data science can turn the vast amounts of data the digital age generates into new insights and new knowledge.”, http://datascience.nyu.edu/what-is-data-science/, accessed on January 13rd, 2016.

What Science Is

I went after a classic definition for science and the first thing that came to me was, again, an Wikipedia definition. That’s the modern days, professors. Anyway, trying to be fair to the investigation, I tried to find other online sources and found some other definitions, including one that gets close to what is better to use when one wants to prove a science and that may be helpful in our future reasonings.

Science, According to Wikipedia

Wikipedia defines science as “a systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about the universe”.

Science, According to Google’s Definition

According to Google, science is “the intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment. (‘the science of criminology’)”; “a particular area of this. (‘veterinary science’)”;”a systematically organized body of knowledge on a particular
subject. (‘the science of criminology’)”; “synonyms: physics, chemistry, biology; physical sciences, life sciences (‘he teaches science at the high school’)”.

google-definition-science
Google Dictionary’s definition of science

Science, According to Merriam-Webster

At Merriam-Webster we read that science is “knowledge about or study of the natural world based on facts learned through experiments and observation; a particular area of scientific study (such as biology, physics, or chemistry); a particular branch of science; a subject that is formally studied in a college, university, etc.”

Science, According to BusinessDictionary.com

The BusinessDictionary.com defines science as “Body of knowledge comprising of measurable or verifiable facts acquired through application of the scientific method, and generalized into scientific laws or principles. While all sciences are founded on valid reasoning and conform to the principles of logic, they are not concerned with the definitiveness of their assertions or findings”. And adds, “In the words of the US paleontologist Stephen Jay Gould (1941-), ‘Science is all those things which are confirmed to such a degree that it would be unreasonable to withhold one’s provisional consent.’”

This one seams to be the best definition for science we found up to the moment as it mentions the scientific method as the way to measure and verify the facts and the laws or principles that compose a science.

A Raw First Definition of Data Science

This is raw, and maybe not sophisticated and prone to errors (we are not using the scientific method yet – let us keep that one for future posts), but let us imagine what a data science definition would be based on the definitions of science we listed above.

An Wikipedia-Would-Be Definition of Data Science

“A systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about data“.

Do we have a systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about data? What we have today is about data or about other things using data as the main support?

A Google-Would-Be Definition of Data Science

From Google’s definition of Science, it looks like our data science definition should at least become something like:

1. “the intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment of data. (‘the science of data’)”;
2. or “the intellectual and practical activity encompassing the systematic study of the structure and behavior of data in the physical and natural world through observation and experiment. (‘the science of data’)”.

We have first and second definitions, based on Google’s definition of science.

From recent practice and readings, I would bet that our first-created Google-would-be definition (1) is what all people involved have in mind as for what they/we think data science is. I think that is why many people tend to confuse data science with statistics, simply because the definition number one expresses very well what statistics does. But, actually, is that the proper definition for data science?

Other Google definitions would be like “a particular area of this. (‘data science’)”; “a systematically organized body of knowledge on the data subject. (‘the science of data‘)”.

Do we have a systematically organized body of knowledge on the data subject? As far as I know we have systematically organized bodies of knowledge on many subjects and they use data as a foundation.

A Merriam-Webster-Would-Be Definition of Data Science

“Knowledge about or study of data based on facts learned through experiments and observation; a particular area of scientific study (such as “DATA-o-logy”, biology, physics, or chemistry); a particular branch of science (data science); a subject that is formally studied in a college, university, etc.”

A BusinessDictionary-Would-Be Definition of Data Science

“Body of knowledge comprising of measurable or verifiable facts about data acquired through application of the scientific method, and generalized into scientific laws or principles.

We are here not to precisely inform the data science definition yet, but to throw the ball to the kicker.

Nowadays (we are in January, 2016), it is possible to find many definitions of data science and many (or all) of them still lack precision or lead to a practice that may be a misnomer of something people do with data for scientific and commercial reasons. As a science, there are people studying it, defining it (what we are trying to do), and not only using it. As a practice, people do not mind if it is a science or not since the tool set works for them. As many are trying to define it, according to their observations and experiences, it looks like everybody, while succeeding in a good definition for specific purposes, fails to discover a common place for the definition. As far as all scientists know, the proper common place for the definition of any science is Science itself.

Should one say that data science is “the science of data”, that would be vague, not precise, but that innocence would throw a light on a different perspective. What is science and what is data? That might help us reach better and more common-sense oriented definitions for both the practice of extracting knowledge or insights from data and the science of data and, who knows, turn us able to affirm that there is a lot of or no difference between the two things.

Just an important note: I searched the websites http://www.sciencecouncil.org/ and http://www.businessdictionary.com/ and found no definition for data science in their websites.

Text: Pedro Carneiro Jr.
Revision: Luis Cláudio R. da Silveira


Other Readings (References):

Other sources to find popular or different perspectives about data science are:

Image credits:

 

About The Software Design Science (Part 1 of 2)

Image from page 422 of "Bell telephone magazine" (1922)

As the first person to claim the science of software design, Kanat-Alexander (2012), when he speaks about what he calls “the missing science”, explores the concept of software design. All the fundamentation of the software design laws depend on that conceptualization. The missing science is the software design science. The approach defined by Kanat-Alexander (2012) transcribed bellow, reflects the practice and the facts. The software design science acts since before the beginning of programming phase, remains during its development, after the programming is finished and until the program enters in operation, for its maintenance.

Every programmer is a designer. (KANAT-ALEXANDER, 2012, p.6).

In the original version of this specific work of Kanat-Alexander, Code Simplicity, its title represents a fundamental truth to be followed by the software developers: the simplicity of the code.

Software design, as it is practiced in the world today, is not a science. What is a science? The dictionary definition is a bit complex, but basically, in order for a subject to be a science it has to pass certain tests. (KANAT-ALEXANDER, 2012, p.7).

In this defense made by him, of a new science, are the elements long ago perceived, but not yet organized by the more experienced programmers. In this way, are listed the tests by which the software project must pass to be considered a science:

  • A science must be composed of facts, not opinions, and these facts should have been gathered somewhere (like in a book).
  • That knowledge must have some sort of organization, be divided into categories and the various parts must be properly linked to each other in terms of importance etc.
  • A science must contain general truths or basic laws.
  • A science should tell you how to do something in the physical universe and be somehow applicable at work or in life.
  • Typically, a science is discovered and proven by means of scientific method, which involves the observation of the physical universe, piece together a theory about how the universe works, perform experiments to verify its theory and show that the same experiment works everywhere to demonstrate that the theory is a general truth and not just a coincidence or something that worked only for someone.

The whole software community knows there is a lot of knowledge recorded and collected in books, in a well-organized manner. Despite that, we still miss clearly stated laws. If experienced software developers know what is right to do, nobody knows for sure why some decisions represent the right thing. Therefore, Kanat-Alexander (2012) lists definitions, facts, rules and laws for this science.

The whole art of practical programming grew organically, more like college students teaching themselves to cook than like NASA engineers building the space shuttle…. After that came a flurry of software development methods: the Rational Unified Process, the Capability Maturity Model, Agile Software Development, and many others. None of these claimed to be a science—they were just ways of managing the complexity of software development. And that, basically, brings us up to where we are today: lots of methods, but no real science. (KANAT-ALEXANDER, 2012, p. 10).

Kanat-Alexander (2012) affirms that all the definitions below are applicable when we talk about software design:

  • When you “design software”, it is planned: the structure of the code, what technologies to use, etc. There are many technical decisions to be made. Often, one decides just mentally, other times, also jots plans down or makes a few diagrams;
  • Once that is done, there is a “software design” (a plan that was elaborated), may that be a written document or only several decisions taken and kept in mind;
  • Code that already exists also has “a project” (“project” as the plan that an existing creation follows), which is the structure that it has or the plan that it seems to follow. Between “no project” and “a project” there are also many possibilities, such as “a partial project”, “various conflicting projects in a code snippet”, etc. There are also effectively bad projects that are worse than having no project, like coming across some written code that is intentionally disorganized or complex: a code with an effectively bad project.

The science presented here is not computer science. That’s a mathematical study. Instead, this book contains the beginnings of a science for the “working programmer” — a set of fundamental laws and rules to follow when writing a program in any language… The primary source of complexity in software is, in fact, the lack of this science. (KANATALEXANDER, 2012, p. 11).

The science of software design is a science to develop plans and to make decisions about software, helping in making decisions about the ideal structure of a program’s code, the choice between execution speed or ease of understanding and about which programming language is more appropriate to the case. We note then, a new point of view for what we call software design, through the prism of the programmer, and that involves not only the activities after the requirements analysis, but which also perpetuates throughout all the programming and product life cycle, including in its maintenance, because for a good maintenance, such software you will need a good design as a reference taking into account its fundamental laws.

Text: Pedro Carneiro Jr.
Revision: Luis Cláudio R. da Silveira


These are the posts on the same “Enum and Quality in BI” monograph:

Justification

This short text is a mere Portuguese to English translation from part of my monograph “THE PERSISTENCE OF ENUMERATIONS IN POSTGRESQL DATABASES AND QUALITY IN BUSINESS INTELLIGENCE” (free translation of the title), also aliased as “Enum and Quality in BI”, which corresponds to a minor part of the document structure.


References:

Image credits: