I have been talking with folks in various fields of research. One term that has caught my interest is different understandings that people have for the term, “big data.”
I chatted with a friend of mine, recently, about his job interview experience. The confusion was terrible. My friend, an economist, was being asked questions about his experience with big data by a Manager, with an engineering background. The Manager’s understanding of big data was similar to the understanding that most IT folks have; whereas, my friend’s understanding the term was similar to what many economists have. Needless to say, neither of them understood what the other person was trying to say. The interview was clouded with confusion.
When talking with several economists, their understanding was consistent among them that big data is really reference to high-frequency data, or real-time data continuously being collected. For example, financial data on the activities on stock market for a specific company is collected every second. Or household data that is collected every 15 minutes for electricity usage.
When talking with several IT people, their understanding was consistent among them that big data was huge data sets that are too large to manipulate with standard methods or tools. This doesn’t necessarily refer to high-frequency data.
Even Professors Ward and Barker at the University of St. Andrews in Scotland found out just how confusing the use of the term has become. http://www.technologyreview.com/view/519851/the-big-data-conundrum-how-to-define-it/ They list these definitions:
1. Gartner. In 2001, a Meta (now Gartner) report noted the increasing size of data, the increasing rate at which it is produced and the increasing range of formats and representations employed. This report predated the term “dig data” but proposed a three-fold definition encompassing the “three Vs”: Volume, Velocity and Variety.This idea has since become popular and sometimes includes a fourth V: veracity, to cover questions of trust and uncertainty.
2. Oracle. Big data is the derivation of value from traditional relational database-driven business decision making, augmented with new sources of unstructured data.
3. Intel. Big data opportunities emerge in organizations generating a median of 300 terabytes of data a week. The most common forms of data analyzed in this way are business transactions stored in relational databases, followed by documents, e-mail, sensor data, blogs, and social media.
4. Microsoft. “Big data is the term increasingly used to describe the process of applying serious computing power—the latest in machine learning and artificial intelligence—to seriously massive and often highly complex sets of information.”
5. The Method for an Integrated Knowledge Environment open-source project. The MIKE project argues that big data is not a function of the size of a data set but its complexity. Consequently, it is the high degree of permutations and interactions within a data set that defines big data.
6. The National Institute of Standards and Technology. NIST argues that big data is data which “exceed(s) the capacity or capability of current or conventional methods and systems.” In other words, the notion of “big” is relative to the current standard of computation.
In the field of education, the use of the term, big data, is convoluted with understandings about learning analytics(LA), or collecting a large amount of learner-produced data use to predict future learning, and data-mining (EDM) focused on developing and improving methods for extracting meaning from large data sets with data on learning in educational settings. EDM originated from tutoring paradigms and systems; whereas, LA originated from learning management systems.
When talking with two sociologists, their understanding was more inline with the IT peoples’ understanding–that big data concerned huge data sets that required substantial computing power to extract any meaning from the data. One sociologist referred to her use of a huge data set gathered from tweets on Twitter within a short time span on one day.
Several points are clear. 1.) We have too much confusion going on regarding the definition of the term, big data. 2.) Each field of study is going to have to come up with their own term to use because the use of “big data” causes too much confusion across the board. 3.) Each field needs to work on resolving the confusion that exists within their own field regarding the proper terms to use. Finally, 4.) Academic Journal Boards and Editors must require authors to use the most appropriate term in their publications. For example, use “high-frequency data” instead of “big data.”