Big Data resembles something new in science as well as in society—no doubt about that. But the question is, whether it also represents a “paradigm shift” in a Kuhnian sense and an epistemological change in our constitution of knowledge. In 2008, the editor-in-chief at Wired, Chris Anderson, wrote an article wherein he stated: “Petabytes allow us to say: Correlation is enough. We can stop looking for models. We can analyse the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot” (Anderson 2008, web). Although he might have originally meant this statement as partly a journalistic provocation (Frické 2015, 652), there are also thinkers who think there is some truth in it.
In this essay, I will first describe the theory of paradigm shift and new epistemology in the conception of Big Data and then explain how it can clarify what is at stake in the so-called Big Data revolution. My main source in this philosophical essay is Rob Kitchin’s research article ‘Big Data, new epistemologies and paradigm shifts’ (2014) from Big Data and Society. This article will set the starting point of my discussion; I will describe his main points, but his statements will also lead to a deeper philosophical discussion. Rob Kitchin is one of the foremost advanced researchers within the field of Big Data. He is a professor of Geography at Maynooth University, yet his research has dealt widely with the Big Data Revolution, the concept of Open Data, Data infrastructures, and the consequences of Data (Maynooth, Kitchin, web).
My thesis is that Big Data does create a new approach to science, and it does “revolutionize” the access to data in certain ways. However, it is too early to predict how Big Data will progress epistemologically and scientifically, and too early to determine if it is a paradigm shift. The use of Big Data is still mostly founded on theoretical assumptions that are quite similar to the criteria formulated by Popper and ‘normal’ science. Also, as Kitchin points out, induction is a very problematic philosophical concept in Big Data, as it is in “normal” science. As a consequence, I will, in continuation of Kitchin’s insights, suggest a change in “normal” science through, for example, the application of the concepts of ‘abduction’’ and ‘retroduction’ as defined by Peirce. These might suggest another scientific logic in Big Data. In addition, I will touch upon ‘counterfactuals’ and ‘lawlike induction’ as described by Nelson Goodman. Especially, ‘counterfactuals’ in Big Data may point towards a new paradigm in a Kuhnian sense because simulation and modelling give a much better opportunity for accurate decisions in theory-choice and modes of operation.
PART I Big Data according to Kitchin
Rob Kitchin’s research article ‘Big Data, new epistemologies and paradigm shifts’ examines the epistemological and scientific consequences of the invention of Big Data. The motto of the article is a quote from Sinan Aral: “Revolutions in science have often been preceded by revolutions in measurement,” (Kitchin 2014, 1). In continuation of this, Kitchin investigates on what grounds we could say that Big Data revolutionizes science through possibilities in measurement.
Without questioning, Big Date is a buzzword in contemporary science and technology. It has become a hegemonic discourse describing the newest trends within data-technology and science, or as it is often mentioned, the “explosion” of new possibilities. According to Kitchin, Big Data can, for example, be characterized by being “huge in volume”, “high in velocity”, “diverse in variety”, “exhaustive in scope”, “fine-grained in resolution and uniquely indexical in identification”, “relational in nature”, “flexible” (Kitchin 2014, 1-2). All in all, the key notes in the discourse of Big Data underline the uniqueness, comprehensiveness, completeness, and extensiveness of Big Data.
Yet Big Data also has far-reaching consequences for the epistemology of science. It seems to be leading to a “new empiricism” where data is able to speak for itself, and there is no need for theory. Furthermore, there is more data than ever, and the gathered data also seems to be objective without human bias (Kitchin 2014, 5). In a radical interpretation, one could argue that scientists are exempted from constructing hypotheses and models, and afterwards test them with experiments. Rather the scientists can “mine the complete set of data for patterns that reveal effects, producing scientific conclusions without further experimentation” (Kitchin 2014, 4). This is an important statement that goes beyond a ‘normal’ scientific method as is described by, for example, critical rationalism or the logical empiricists. Popper emphasized science as a continual empirical testing of theories and that theory controls empirical research (Corvi 1997, 47), and, as such, theory is the touchstone of science.
Limits to New Empiricism
The background for the argument for a New Empiricism is quite fallacious, or there are at least important limitations to it, as Kitchin states. First of all, the amount of data is not exhaustive because all data provides is an oligoptic view of the world (Kitchin 2014, 4). Secondly, Big Data is limited by tools and the plain physicality of data sets. The data-collection will never happen in a theory-empty universe; there will always exist pre-conditioned discursive assumptions that indirectly or directly frame the data mining. For example, there could be feminist bias in the approach or political bias (Sholl 2017, slides). Kitchin also points out that algorithms focus on data because they require interpretation within a scientific framing.
In addition, if we assume that Big Data analytics can be constituted by “correlation” only, the problem is (as I will show later) that: “Correlations between variables within a data set can be random in nature and have no or little causal association, and interpreting them as such can produce serious ecological fallacies” (Kitchin 2014, 5). This statement has pervasive consequences for the epistemology of Big Data because it questions the very foundation of knowledge production. The danger of false-positives in data sets is imminent in Big Data if science does not approach the possibilities of Big Data critically.
It is also an erroneous conception that everyone can access data and interpret it outside the scientific tradition and context, as Kitchin argues. Otherwise, Big Data will potentially repeat past mistakes within the scientific field. The danger of a misinterpretation of Big Data is that it can generally lead to an alarming misuse or reductionist approaches, which, in turn, can lead to fatal misconceptions about the investigated fields of research. Yet, Kitchin admits to the great opportunities of Big Data within business and marketing, but this is also where the abuse of Big Data is most prevalent.
Instead of arguing for a new empiricism, Kitchin presents the concept of a “data-driven science” which “seeks to hold the tenets of scientific method, but is more open to using a hybrid combination of abductive, inductive and deductive approaches to advance the understanding of a phenomenon” (Kitchin 2014, 5). The essential difference between a New Empiricism and a data-driven science is that in a data-driven science, data decides what theoretical approach should be selected among a variety of approaches of a data-driven science. Insights are mainly “born from data” rather than “born from the theory” (Kitchin 2014, 6) and not vice versa.
Yet a data-driven science does not discard theory altogether; rather, theory is still used as guidance or as a strategy in the setup of data production and analysis. Data is always collected within a framework with certain assumptions based on existing theories. Apparently, a data-driven science has a pragmatic approach to the choice of existing theory and methods. The data-driven science is pragmatic in the sense that the best approach to data is the one that gives the “most likely and valid way forward”. Science has to “tackle” the huge amount of data in such a way that it reveals information within the area of interest or information that can be subjected to further investigation (Kitchin 2014, 6), rather than revealing totally hidden truths and relationships within data.
In continuation of this, Kitchin defines the basic logical process in Big Data analytics as that of abduction, as the concept is defined by C. S. Peirce. I will return to this concept later, and just highlight that Kitchin believes that in Big Data, abduction is epistemologically the most justified modus operandi alongside induction and deduction because it “makes reasonable and logical sense, but is not definitive in its claim” (Kitchin 2014, 6). Again, Kitchin focuses on what makes sense, almost what is “common sense” in a proper scientific access to Big Data. Traditional theory is revitalized in a reconfigured version where sense-making is the guiding line in the choice of theory and approach to data. Keywords are also “interconnection”, “interrelation”, “interlinking”, “integration” and “interdisciplinary”. A data-driven science will, in this respect, be able to discover facts that a knowledge-driven science would not be able to find. A knowledge-driven science has a too reduced perspective, with its sharp separation of the different scientific faculties. As such, a data-driven science resembles an epistemological change because it is what one could call “commonsensical” and “interdisciplinary”.
Limitations to a Data-Driven Science
However, there are also limitations to a data-driven science, especially when applied on different scientific faculties, e.g. the humanities and the social sciences. It is, of course, possible to introduce Big Data to the humanities and the social sciences; in particular, to the field of digital humanities and computational social sciences. The difficulty arises with Big Data’s positivist and quantitative approach to the different research-areas. Obviously, Big Data is mainly applicable when there is an observable phenomenon that can be measured such as counts, distance, cost, time, etc. (Kitchin 2014, 7). Digital humanities, for example, is based on so-called ‘reading machines’; yet, what are the prospects of these reading machines? Humanities are traditionally not focused on large quantum of data and quantitative significance, but rather on concepts like “close analysis”, “source criticism” and “conceptual analysis” etc. In traditional humanities, “closeness” is more important than the “big picture”, though the vast and broad analysis and perspective cannot be discarded entirely, for example, in historical interpretation of a broad phenomenon. Values and qualities of a singular human manifestation is only rarely fully measurable, and cannot be obtained adequately by a computational device. As Kitchin says, “It is one thing to identify patterns; it is another to explain them” (Kitchin 2014, 8).
It seems like the conception of Big Data divides the waters, but, according to Kitchin, there is a middle ground with a passable epistemological conception of Big Data that fits a pragmatic use of Big Data analytics in science and technology. Kitchin argues that the self-conscious handling of the different advantages and shortcomings in Big Data analytics will eventually lead to a production of knowledge that we can assign trust to. We may know that there are some epistemological problems in the application of Big Data; yet, if we approach the great possibilities within Big Data with common sense, we might still gain great insights and enhance our knowledge within a certain field – if it is used situational and reflexive.
However, Roy Kitchin is not convinced that Big Data also resembles a paradigm shift in a Kuhnian sense. Big Data analytics can move in two directions: one goes towards a new empiricism; the other towards a data-driven science. Yet, there are already some flaws in the theory of a new empiricism as it might be too far-fetched to actually claim that Big Data is free of theory. In this case, the data-driven science is, according to Kitchin, likely to become the most successful understanding of Big Data and will seriously test knowledge-driven science.
At this point, I have described one theory of Big Data as Kitchin has uncovered the epistemology of Big Data in his article. In the remaining part of my essay, I will turn to a clarification of some of the main philosophical problems that arise from Kitchin’s article, namely the logic of scientific discovery and the question of paradigm shift. I will not only do a conceptual analysis of the main concepts viz. deduction, induction, and abduction, but also touch upon new ones like retroduction and counterfactuals.
The Logic of Big Data
Indirectly, the discussion of Big Data and its epistemology finds its predecessor in a large philosophical dilemma that Karl R. Popper has clarified in his work ‘Conjectures and Refutations’. In this investigation, he tries to describe how observation and reason both have an important role in knowledge production; and, furthermore, he argues that neither of them can stand alone as a source of knowledge (Popper 1985, 4). Theory and data are both inseparable parts of the scientific process. Popper, of course, speaks from the perspective of critical rationalism. His stance is that science develops as an interaction between hypotheses and observations where hypotheses have to be tested before they can be called scientifically valid. It is not of relevance how the hypotheses were originally created, the fundamental requirement of the scientific process is that the hypotheses are tested, which again is exactly what makes science a rational activity (Okasha 2016, 74). Basically, Popper is interested in deciding the difference between science and pseudo-science, and it is in this context he invents his concept of demarcation: the falsifiability of scientific theories (Corvi 1997, 27).
According to Kitchin, a list of researchers argue that Big Data tends to be a passive observation, and not based on theory. Conducting a survey is, in many instances, a sort of passive collection of data within a certain field. Yet, as Kitchin also emphasises in his article, Big Data is rarely based only on observations that can talk for themselves; indeed, the whole methodological setup is often there beforehand. In that sense, Big Data does not violate the basic epistemological assumptions about hypotheses and observations as defined by Popper. Probably, Big Data is not a new empiricism; it just highlights some of the fundamental epistemological problems that exist in all scientific discovery. Furthermore, Big Data is also supposedly falsifiable in the sense that Big Data is based on a trial and error method that leads to corroboration or refutation of certain scientific hypotheses.
The Problem of Induction
However, this does not settle the epistemological problem of Big Data. There is still a very foundational problem to be analysed, namely the problem of induction and its related logical concepts. According to Kitchin, Big Data bases itself on a blend of “aspects of abduction, induction, and deduction” (Kitchin 2014, 10). Here, I will turn to C. S. Peirce’s definitions of the concepts and introduce a new one: retroduction.
Concepts of Logic
1. Deduction is an important concept in logic, philosophy, and mathematics. The concept is also important in all data-processing, e.g. in Big Data analytics. In deductive reasoning, we start with the hypotheses of abstract rules—normally formulated in mathematical terms—and deduce from these rules the understanding of the particular phenomenon (C. T. Rodrigues 2011, 128). This also means that pure deductive inference does not increase our knowledge of the surrounding world; rather, the deductive reasoning only says something about the validity of the reasoning itself. This strictly logical approach is, in many ways, scientifically problematic because knowledge production does not leave the realm of mathematical reasoning. Experience is secondary in relation to theory, and experience is subordinated abstract reasoning; and there is no testing of hypotheses as we find in empirical scientific discovery. The general theory defines the truth of the particular experience as a kind of top-down inference (C. T. Rodrigues 2011, 130).
2. Kitchin emphasizes induction in his article, yet this concept has been subject to a continual philosophical debate since David Hume paid attention to the problem of induction in his work ‘An Enquiry Concerning Human Understanding’ (1748), as he argues that “even after we have experience of the operation of cause and effect, our conclusions from that experience are not founded on reasoning, or any process of the understanding” (Hume 2011, 598). Since Hume’s discovery, many philosophers have tried to come up with a solution. According to Peirce, induction is a kind of inversion of deduction or “induction is a form of reduction of the multiplicity to the unity, allowing for an assertion about facts, very likely to be true” (C. T. Rodrigues 2011, 131). In other words, the inductive method seeks to establish a certain rule or relation from observing a sequence of occurrences or phenomena. Inductive reasoning pursues how to formulate a general theory from the particular observation, and is, in that sense, a bottom-up inference. Kitchin touches upon it, as he says Big Data “seeks to incorporate a mode of induction into the research design” (Kitchin 2014, 6). The problem of induction is not restricted to Big Data; it rather touches all science based on statistics. As a gateway, Peirce claims that we can divide induction into several types. The first type of induction is also what we could call ‘crude’ induction. This kind of induction is the one everyone knows when we experience something several times, and we infer that future experience will be like the past (C. T. Rodrigues 2011, 134). Two other variants of induction Peirce calls abduction and retroduction:
3. According to Peirce, abduction is a kind of induction, but it has an improved ability to intensify knowledge because it is a kind of “reasoning to formulate hypotheses” (C. T. Rodrigues 2011, 132). Apparently, abduction resembles the way we investigate and interpret new scientific discoveries in order to create new and general hypotheses based on already existing knowledge. As he says, “[a]bduction is the process of forming explanatory hypotheses. It is the only logical operation which introduces any new idea” (Peirce on Abduction, Stanford Encyclopedia, web). Abduction is the logical process of inventing explanatory hypotheses from the contextual or already existing knowledge. In this sense, abduction is top-down-top inference, as Peirce argues: “we may look through the known facts and scrutinize them carefully to see how far they agree with the hypothesis and how far they call for modification of it” (C. T. Rodrigues 2011, 135). This way of reasoning might be the one that comes closest to the scientific logic in traditional science, and in Big Data analytics, as Kitchin says: “[Abduction] seeks a conclusion that makes reasonable and logical sense, but is not definitive in its claim. For example, there is no attempt to deduce what is the best way to generate data, but rather to identify an approach that makes logical sense given what is already known about such data production” (Kitchin 2014, 6). Yet this order of inference does not fully describe the logic of scientific discovery in Big Data analytics. In fact, it is partly the other way around, as Kitchin also describes in his article, without going into details about this inference: ”the epistemological strategy adopted within data-driven science is to use guided knowledge discovery techniques to identify potential questions (hypotheses) worthy of further examination and testing” (Kitchin 2014, 6). He also states – although critically – that a data-driven science “seeks to generate hypotheses and insights ‘born from data’ rather than ‘born from theory’” (Kitchin 2014, 5-6). Kitchin also indicates a down-top-down inference in Big Data, which is not fully captured in the concept of abduction. Yet Peirce also has an explanation for this logical process, and he calls it retroduction.
4. Retroduction is a kind of logical inference that starts with observation; then on the grounds of observation, a hypothesis is formulated that will form a starting point for further investigation and observation. “Retro” has to do with “backward” reasoning from experimental facts. The first step in scientific discovery—as in Big Data analytics—is often the surprising or unanticipated observation, which then leads to the creation of hypotheses, which then again are tested and explained, as C. T. Rodrigues paraphrases Peirce’s argumentation: “The reasoning is called retroduction exactly because the framing of the hypothesis begins with the observation of a striking fact. Its logical form is the following: The surprising fact, C, is observed; But if A were true, C would be a matter of course. Hence, there is reason to suspect that A is true” (C. T. Rodrigues 2011, 136). Retroductive reasoning is as far, as I can see, the mode of logical inference that comes closest to the logic in Big Data analytics, and to a lesser degree abduction, which may have been closer to traditional ‘normal’ scientific discovery. Retroduction also gives data a much more prevalent significance in scientific discovery without discarding hypotheses entirely. This is also supported by the intuitive understanding of the prospects of data-driven science in Big Data that hypotheses are subsequent to data collection, but still necessary in knowledge production.
Last Word to Popper
What would Popper say to this thesis that the logic of scientific discovery in Big Data analytics is a kind of retroduction, mixed with abduction, induction, and deduction? In continuation of Popper, one could always discuss the problem of what comes first in scientific discovery—hypothesis or data? Indeed, the problem is unsolvable as the story about ‘Which comes first, the hen (H) or the egg (D)’ (Popper 1985, 47). The observations in Big Data will also be presupposed by an adopted frame of reference, and as such a frame of expectations and therefore ultimately a frame of theory. Thus Big Data is, as Kitchin states, seldom born out of nothing, and in other words, even Big Data analytics seems to live up to the scientific standards of critical rationalism.
There will always, even when Big Data is used for scientific purposes, be a complex reciprocal exchange of hypotheses and data. Furthermore, in his work ‘Conjectures and Refutations’, Popper argues that the logic of scientific discovery cannot be understood as a kind of induction at all, as he says “Induction, i.e. inference based on many observations, is a myth” (Popper 1985, 53). Scientific discovery does not need the justification of induction in order to formulate hypotheses. Rather, science invents daring hypotheses, which are tested against observations. The logic of scientific discovery operates with falsifiability and corroboration. Hypotheses are corroborated by surviving continual scientific testing, as he says: “the theory is successful with its unexpected prediction–then we say that it is corroborated by the experiment” (Popper 1985, 112). However, we must concede that Popper’s description of the mutual exchange of data and hypotheses with corroboration and refutation does resemble the logical process of abduction described by Kitchin.
Yet Popper’s rejection of the role of induction in the logic of scientific discovery does not describe the epistemology of Big Data analytics when it comes to the understanding of ‘counterfactuals’ in Big Data. Indeed, it is difficult to decide the falsifiability of counterfactuals, yet maybe possible (Kapsner 2017, 522)? However, Big Data is especially known for also being able to simulate or create models of possible events and worlds (Brodersen 2015, web). Counterfactuals are an important aspect of Big Data that Kitchin only implicitly describes. Also, Popper does not really describe counterfactuals in his Conjectures. However, this is a shortcoming, I will touch upon in this essay, since precisely “counterfactuals” seem to be of huge importance in Big Data. Within the fields of econometric and statistics, a research describes how counterfactuals are used in Google analytics:
“Our method generalises the widely used difference-indifferences approach to the time-series setting by explicitly modelling the counterfactual of a time series observed both before and after the intervention. […] It improves on existing methods in two respects: it provides a fully Bayesian time-series estimate for the effect; and it uses model averaging to construct the most appropriate synthetic control for modelling the counterfactual. […] Partly because of recent interest in big data, many firms have begun to understand that a competitive advantage can be had by systematically using impact measures to inform strategic decision making” (Brodersen 2015, 247).
This text passage suggests the significance of the concept “counterfactuals” in Big Data analytics, and it can be interpreted as a philosophical concept connected to Big Data as well (Pietsch 2016, 149). First of all, we see the description of the intermingling of data and hypotheses. The point is that “modelling the counterfactual of a time series” represents an inductive sequence. As such, counterfactuals leave us with serious philosophical problems and send us back to the problem of induction. However, exactly ‘counterfactuals’ can also help us in deciding the lawlikeness of inductive reasoning, as pointed out by Nelson Goodman. In his philosophical work ‘Fact, Fiction and Forecast’ (1954), Goodman states that the difference between lawlike induction and just accidental induction is that lawlike statements inferred from induction are confirmed by instances of these statements, and that they are supported by counterfactuals (Cohnitz 41). Let’s look at Goodman’s argument:
We have two statements: (1) All butter melts at 65⁰ C. (2) All the coins in my pocket are silver. Why could one say that the first statement (1) is based on a lawlike induction while the second statement (2) is only an (true) accidental induction? The first statement is lawlike because there are actual examples or “data” that justify the statement, and furthermore, even though there remain many cases to be measured, we still predict these examples to conform with the statement. The statement is supported by counterfactuals (Goodman 1983, 20; Cohnitz 2014, 41). The second statement might be accepted as a true description of the circumstances, yet there can be no counterfactuals based on this statement. This leads Goodman to his point that “the principle we use to decide counterfactual cases is a principle we are willing to commit ourselves to in deciding unrealized cases that are still subject to direct observation.” Of course, the problem of induction and counterfactuals is not solved with this example (Goodman 1893, 27; Cohnitz 2014, 45; Chateaubriand 2011, 386), yet Goodman’s tentative criterion of lawlike induction propose a rationally adequate way to exclude problematic statements. In continuation of this, my point is that Goodman’s analysis of lawlike induction and counterfactuals can be applied on Big Data as well, as it reminds one of the logic of retroduction.
Although Goodman analysed two simple statements and compared them, the argument behind can be used on Big Data and the problem of induction too. Although counterfactuals as possible worlds are not exactly the same as counterfactual conditions as statements, yet according to the volume ‘Logic, Rationality, and Interaction’, David Lewis “considers counterfactuals to be about possible worlds that bear a particular relation to the actual world: They are worlds in which the antecedent of the counterfactual conditional is true and which, moreover, are maximally similar to the actual world” (Baltag 2017, 499). In continuation of this, the amount of data behind a statement or theory does not change Goodman’s main point that if we can find examples that actually suggest the truth of a theory, and if we in the same time are able to find support from counterfactuals, then it is a supposedly lawlike theory. Kitchin mentions this problem in Big Data in his article: “Patterns found within a data set are not inherently meaningful. Correlations between variables within a data set can be random in nature and have no or little causal association, and interpreting them as such can produce serious ecological fallacies” (Kitchin 2014, 5). Here, I interpret the concept “ecological fallacies” as a kind of lack of support of “true” counterfactuals. On the other hand, if these found patterns and correlations are supposedly meaningful, and not random in nature, and they are supposedly supported by counterfactuals and do supposedly not produce ecological fallacies, then we can argue that the theory based on these data is supposedly lawlike, yet not necessarily true. Goodman’s thesis itself is not something new in the philosophy of science, as Cohnitz says: “A satisfying account of induction (or corroboration [Popper’s term]) as well as a satisfying account of explanation and prediction needs such a divide” where science can tell the difference between the lawlike and the accidental (Cohnitz 2014, 42). A prevalent feature in Big Data analytics is exactly that it creates counterfactuals and prediction, maybe even forecasts, from data through simulations and modelling much better than traditional science. This is only a tentative analysis of counterfactuals and Big Data, yet my analysis underlines the importance of counterfactual thinking in Big Data analytics that might suggest the contours of a new paradigm.
Is Big Data a paradigm shift in a Kuhnian sense, as asked in the beginning of this essay? I will not describe Kuhn’s theory in detail, but only use it in my understanding of the epistemology of Big Data. According to Kuhn, a paradigm shift happens when established ideas in science are overturned by a thoroughly new set of ideas. The revolution in science occurs when certain anomalies start to be discovered that will eventually become a rupture in ‘normal’ science. As a consequence, a paradigm shift is both a change in theoretical assumptions and a change in the set of exemplary scientific problems and approaches, as well as a change in the social context of science (Okasha 2016, 75). This revolution in ideas in science only occurs very infrequently, and within the paradigm Kuhn speaks of a ‘normal’ science that describes the day-to-day performance of scientific research that does not violate the laws of the paradigm (Okasha 2016, 75). There can be no doubt about Big Data presenting ‘normal’ science to quite an amount of anomaly. There are unexplained facts and observations gained through Big Data. Big Data is also characterized by its uniqueness, comprehensiveness, completeness, and extensiveness. Big Data resembles new ways of measurement. But there are also great dangers in the application of Big Data analytics. In short, Big Data does create an anomaly in science that has to be dealt with, as Kitchin concludes: “there is an urgent need for wider critical reflection” (Kitchin 2014, 10). The social context of science is also changing rapidly, as Kitchin says: “Big Data is a disruptive innovation, presenting the possibility of a new approach to science” (Kitchin 2014, 10).
On the other hand, Kitchin is cautious in his appointment of Big Data as a paradigm shift in the theoretical assumptions and the set of prototypes of particular scientific problems. He shows – at least superficially – that Big Data basically still uses a ‘normal’ logic of science. The scientific discovery in Big Data is more or less in keeping with the epistemological insights from Popper’s philosophy of science and Peirce’s ideas about abduction, as I have clarified in this essay. As such, Kitchin does not deviate significantly from critical rationalism in his understanding of Big Data.
Still Big Data does in certain cases reverse or change the relationship between theory and experiment, as Kitchin has also identified in his research. It might be an oversimplification to put data-driven science on the same methodological level as theory and experiment (Pietsch 2016, 138); yet, there is a change, mainly in the sequence of theory and experiment. Data-driven research replaces knowledge-driven research. Kitchin focuses on abduction instead of induction, and he describes how this concept can explain the many approaches and interpretations of Big Data as a process of testing and invention of new hypotheses. Without disregard for Kitchin’s explanation, I have also pointed out that not only Peirce’s understanding of abduction can be applied, also retroduction is no less applicable to Big Data. Actually, this concept might also be a step towards a new understanding of the logic of science because knowledge then potentially arises from data, and not from hypotheses, although hypotheses are still important in the process of scientific discovery. Retroduction also describes the logic of scientific discovery in data-driven science.
Kitchin also emphasizes a conscious theory-choice in Big Data analytics. Obviously, he removes science further away from pure logical reasoning to a conception where one, with Aristotelian terms, could say that the user of Big Data must rely on a kind of phronesis – or practical virtue or common sense in the handling of data and approaches. As Kuhn says in his Postscript: “There is no neutral algorithm for theory-choice, no systematic decision procedure which, properly applied, must lead each individual in the group to the same decision” (Kuhn 1970, 200). Basically, we cannot create an algorithm that can tell us which theory to choose, we have to do that as human beings.
Yet, as I have clarified, counterfactuals are also a very important aspect of the prospects of Big Data. Through simulation and modelling we meet (visualized) opportunities which even might be a new exemplary and prototypical way to do ‘experimentation’. Through Big Data simulations, science has a much better starting point in the choice of theory and operation. Big Data has an ability to create counterfactuals, and these might actually be the real signs of a new paradigm because they enable scientists to make much safer and accurate decisions from counterfactual projections.
Big Data does create a new approach to science, and it does “revolutionize” the access to data in certain ways. However, it is too early to predict how Big Data will progress epistemologically and scientifically, and too early to actually determine if it is a paradigm shift. The use of Big Data is still mostly founded on theoretical assumptions that are quite similar to the criteria formulated by Popper and ‘normal’ science. Also, as Kitchin points out, induction is a very problematic philosophical concept in Big Data, as it is in ‘normal’ science. As a consequence, I have, in continuation of Roy Kitchin’s insights, suggested a change in ‘normal’ science, for example through the application of the concepts of ‘abduction’’ and ‘retroduction’ as defined by Peirce. These might suggest another scientific logic in Big Data. In addition, I have touched upon ‘counterfactuals’ and ‘lawlike induction’ as described by Goodman. Especially, “counterfactuals” in Big Data maybe point towards a new paradigm in a Kuhnian sense because simulation and modelling give a much better opportunity for accurate decisions in theory-choice and modes of operation.
 The problem with “similarity” is that logically everything is similar to everything in certain aspects (Frické 653). Similarity claims are quite scientifically uninformative, and the same goes with correlations. Correlation does not imply causation because it is a logical fallacy, since that A follows from B, does not necessarily mean that A is a cause of B.
 It might be necessary to mention that Popper actually defined his scientific testing method as hypothetical-deductive, as a way of bypassing induction in science. In short, Popper clarifies the logic of scientific discovery as ‘deductive’ because it uses deductive inference in formulation of scientific statements from observation and trustworthy hypotheses (Popper 1985, 51).
 As he says: “the so-called ‘problem of counterfactual conditionals’, I have never been able, in spite of strenuous efforts, to understand this problem” (Popper 1985, 278).
Although, we must bear in mind that Kuhn in his early writing did not believe that two paradigms could exist simultaneously because of their inherently incommensurability (Okasha 2016, 80).
Anderson, Chris. (2008) “The End of Theory: The Data Deluge Makes The Scientific Method Obsolete”. Link: https://www.wired.com/2008/06/pb-theory/ Accessed 04.01.2018.
Baltag, Alexandru (Ed.). (2017). Logic, Rationality, and Interaction. 6th International Workshop, LORI 2017 Sapporo, Japan. Springer-Verlag GmbH Germany. eBook.
Brodersen, Kay. (2015). Causal attribution in an era of big time-series data. Link: http://www.unofficialgoogledatascience.com/2015/09/causal-attribution-in-era-of-big-time.html?m=1 Accessed 04.01.2018.
Brodersen, Kay H. and Fabian Gallusser, Jim Koehler, Nicolas Remy, Steven L. Scott. (2015). “Inferring Causal Impact Using Bayesian Structural Time-Series Models.” In: The Annals of Applied Statistics. Vol. 9, No. 1, Pp. 247–274.
Chateaubriand, Oswaldo. (2011). “Goodman and Parry on Counterfactuals.” In: Principia 15(3). Pp. 383–397. (NEL—Epistemology and Logic Research Group, Federal University of Santa Catarina (UFSC), Brazil).
Cohnitz, Daniel & Marcus Rossberg. (2014). Nelson Goodman. Routledge. London.
Corvi, Roberta. (1997). An Introduction to the Thought of Karl Popper. Routledge. London.
Frické, Martin. (2015). “Big Data and Its Epistemology”. In: Journal of the Association for Information Science and Technology. Vol. 66, No. 4. Pp. 651-661.
Goodman, Nelson. (1983 (1954)). Fact, Fiction and Forecast. 4th Edition. Harvard University Press. Cambridge, Massachusetts.
Hume, David. (2011). Hume – The Essential Philosophical Works. Wordsworth Classics of World Literature. Hertfordshire.
Kapsner, Andreas and Hitoshi Omori. (2017). “Counterfactuals in Nelson Logic”. In:
Baltag, Alexandru (Ed.). Logic, Rationality, and Interaction. 6th International Workshop, LORI 2017 Sapporo, Japan, Springer-Verlag GmbH Germany. eBook. Pp. 497-511.
Kitchin, Rob. (2014). “Big Data, new epistemologies and paradigm shift” In: Big Data & Society. April-June 2014. Pp. 1-12.
Kitchin, Rob. “Rob Kitchin” Web: Maynooth University Website: https://www.maynoothuniversity.ie/people/rob-kitchin#3 Accessed 07.02.2018
Kuhn, Thomas. (1970). The Structure of Scientific Revolutions. Second Edition, Enlarged. The University of Chicago. Chicago.
Okasha, Samir. (2016). Philosophy of Science – A Very Short Introduction. Oxford University Press. Oxford.
Pietsch, Wolfgang. (2016). “The Causal Nature of Modeling with Big Data”. In: Philos Technol. 29. Pp. 137-171.
Popper, Karl R. (1985 (1963)). Conjectures and Refutations – The Growth of Scientific Knowledge. Routledge and Kegan Paul. London.
Rodrigues, Cassiano Terra. (2011). “The Method of Scientific Discovery in Peirce’s Philosophy: Deduction, Induction, and Abduction.” In: Logica Universalis. Springer Basel. Pp. 127-164.
Sholl, Jonathan. (2017). Big Data, Technoscience, New Paradigms? Unpublished Lecture 31. 2017. Aarhus University.
Stanford Encyclopedia of Philosophy. (2014). “The Problem of Induction”. (Wed Nov 15, 2006); Fri Mar 14, 2014. Link: https://plato.stanford.edu/entries/induction-problem/ Accessed: 04.01.2018.
Stanford Encyclopedia of Philosophy. “Peirce on Abduction“. Link: https://stanford.library.sydney.edu.au/entries/abduction/peirce.html Accessed: 04.01.2018.
Stanford Encyclopedia of Philosophy. (2014). “Counterfactual Theories of Causation” (Wed Jan 10, 2001); Mon Feb 10, 2014. Link: https://plato.stanford.edu/entries/causation-counterfactual/ Accessed 05.01.2018.
Data and information are good up to a point. An overload of these do not lead to understanding.
“Knowing” occurs in the brain that receives signals from the senses. The brain itself has no way of knowing where these signals that it receives come from. It interprets reality based on patterns it can organize from the data that appear to facilitate its response to the purported outside world which provides feedback of new data which can be transformed by some rule which can match an adopted pattern. False correlation is a built in danger like the “Ice Cream” correlation error: When Ice Cream Sales go up, the crime statistics go up. Ice Cream sales go up in the Summer when kids are out of school, more people are out on the streets, it’s too hot and uncomfortable (making temper flare etc.)… Outlawing Ice Cream will not bring down crime except for criminals who have to calm themselves by eating Ice Cream after a bank robbery.
In a similar way, artificial intelligence, or math can not “know” if a data pattern has valid borders that through a transform can represent a reality or is just an artifact like a false face seen in a cloud in the sky. The data are like stars in the sky: lines can be draw to connect stars in a constellation pattern, but those patterns have no useful meaning. The “knowing” of a math equation can produce false solutions — sometimes there is one useful solution and an alternate one that is computationally correct but false. Some sort of test is needed… if a false premise were true then the deductions would be true, but they are false. The equation and the calculations can be an elegant and beautiful falsity. Some fiction can be useful, but some is not useful — a tall tale for a short entity.
LikeLiked by 1 person