Big Data is Big News, a Big Deal, and Big Business, but what is it, really? What does Big Data even mean? To those in the thick of it, Big Data is obvious and I’m stupid for even asking the question. But those in the thick of Big Data find most people stupid, don’t you? So just for a moment I’ll speak to those readers who are, like me, not in the thick of Big Data. What does it mean? That’s what I am going to explore this week in what I am guessing will be three long columns.
My PBS series Triumph of the Nerds was the story of the personal computer and its rise to prominence from 1975-95. Nerds 2.01: A Brief History of the Internet was the story of the Internet and its rise to prominence from 1966-98. But each show was really about the effect of Moore’s Law on technology and society. Personal computers became possible only when the cost of microprocessors dropped to where they could be purchased by individuals. It could not have happened before reaching this tipping point where the market was ready for explosive growth.
The commercial Internet in turn became possible only when the price of servers dropped another two orders of magnitude by the mid-1990s, making dial-up Internet economically feasible and creating another tipping point. Thinking in similar terms, Big Data is what happened when the cost of computing dropped yet another two orders of magnitude by 2005 making possible the most recent tipping point. We pretend that this happened earlier, in 1998, but it didn’t (that’s part of the story). 2005 marked the emergence of mobile computing, cloud computing, and the era of Big Data. And just as we did in my two documentary series, we as a people stand again on the cusp of a new era with virtually no understanding of how we got here or what it really means.
Personal computing changed America, the Internet changed the world, but Big Data is about to transform the world. Big Data will drive our technological development for the next hundred years.
Wherever you are in the world, computers are watching you and recording data about your activities, primarily noting what you watch, read, look at, or buy. If you hit the street in almost any city, surveillance video can be added to that: where are you, what are you doing, who or what is nearby? Your communications are monitored to some extent and occasionally even recorded. Anything you do on the Internet — from comments to tweets to simple browsing — never goes away. Some of this has to do with national security but most of this technology is simply to get you and me to buy more stuff — to be more efficient consumers. The technology that makes all this gathering and analysis possible was mainly invented in Silicon Valley by many technology startup companies.
How did we get here and where are we going? You see technology is about to careen off on new and amazing paths but this time, rather than inventing the future and making it all happen the geeks will be more or less along for the ride: new advances like self-driving cars, universal language translation, and even computers designing other computers are coming not from the minds of men and women but from the machines themselves. And we can blame it all on Big Data.
Big Data is the accumulation and analysis of information to extract meaning.
Data is information about the state of something — the who, what, why, where, when, and how of a spy’s location, the spread of disease, or the changing popularity of a boy band. Data can be gathered, stored and analyzed in order to understand what is really happening, whether it is social media driving the Arab Spring, DNA sequencing helping to prevent disease. or who is winning an election.
Though data is all around us in the past we didn’t use it much, primarily because of the high cost of storage and analysis. As hunter-gatherers for the first 190,000 years of homo sapien life we didn’t collect data at all, having no place to keep it or even methods for recording it. Writing came about 8000 years ago primarily as a method of storing data as our culture organized, wanted to write down our stories, and came to need lists concerning population, taxes, and mortality.
Lists tend to be binary — you are on or off, dead or alive, tax-paying or tax-avoiding. Lists are about counting, not calculating. Lists can contain meaning, but not that much. What drove us from counting to calculating was a need to understand some higher power.
Thousands of years ago the societal cost of recording and analyzing data was so high that only religion could justify it. In an attempt to explain a mystical world our ancestors began to look to the heavens, noticing the movement of stars and planets and — for the first time — writing down that information.
Religion, which had already led to writing, then led to astronomy and astronomy led to mathematics all in search of mystic meaning in celestial movement. Calendars weren’t made up, for example: they were derived from data.
Data has been used through history for tax and census rolls and general accounting like the Domesday Book of 1086 — essentially a master tax record of Britain. There’s the operant term count. Most of the data gathered through history was by counting. If there was a lot of data to be considered (more than a few observations required in a scientific experiment) it nearly always had to do with money or some other manifestation of power (how many soldiers, how many taxpayers, how many male babies under the age of two in Bethlehem?). Every time you count, the result is a number and numbers are easy to store by writing them down.
Once we started accumulating knowledge and writing it down it was in our nature to seek ways to hide these data from others. This led to codes and cyphers and to statistical techniques for breaking them. A 9th century scientist named Abu Yusuf al-Kindi, wrote A Manuscript on Deciphering Cryptographic Messages, marking the beginning of statistics — the finding of meaning in data — and of cryptanalysis or code-breaking.
In his book al-Kindi promoted a technique now called frequency analysis to help break codes where the key was unknown. Most codes were substitution cyphers where every letter was substituted with another. If you knew which letter was which it was easy to decode. al-Kindi’s idea was that if you knew how frequently each letter was used in typical communication that frequency would be transferred unchanged to the coded message.
In English the most frequent letters are e, t, a, and o in that order. So given a large enough message to decode, whatever letter appears most frequently ought to be an e and so on. If you find a q it nearly always is followed by u, etc. That is unless the target language isn’t English at all.
What’s key in any frequency substitution problem is knowing the relative frequency and that means counting the letters in thousands of common documents, whatever the language — data gathering and analysis circa 900 AD.
But it was another 800 years after al-Kindi before collected data was shown to be of much public use. In London the Bills of Mortality were published weekly starting in 1603 to give a day-by-day accounting of all recorded deaths in London (Bring out your dead!). These weekly reports were later published in an annual volume and that’s where it gets interesting. Though the point of the Bills was simply to create a public record, meaning was eventually found by analyzing those pages after the Plague of 1664-65. Experts were able to plot that plague as it spread across London from infection points mapped on the city’s primitive water and sewer systems. It became clear from those data both the sources of the infection (mosquitoes, rats) and how to stay away from them (be rich, not poor). And so the study of public health was born.
What made Bills of Mortality so useful was not just the data of who died — the sheer numbers — but the metadata (data about the data) saying where the victims lived, where they died, their age, and type of work. The story of the Plague of 1664 could be read by mapping the metadata.
While Bills of Mortality was reputed to be a complete account of the plague, it’s doubtful that was the case. Many deaths were probably missing or attributed to the wrong causes. But one lesson of statistics was that didn’t matter if there was enough data for trends to be clear. In fact as the field of statistics grew in 18th century France it became clear that nearly as much could be learned from a random sample of data as from gathering all the information. We see this today when political pollsters guess elections based on small samples of random voters. The occasional failure of these pollsters to correctly predict election outcomes shows, too, that sampling is far from perfect.
Sampling and polling generates results that we “believe” to be true, but a 100 percent sample like a census or an election generates results we can “know.”
Data processing. Storing data isn’t the same as processing it. Libraries do a pretty good job of storing data but it isn’t very accessible. You still have to find the book, open it, read it and even then the level of detail we can retain is limited by our memories.
American statistician Herman Hollerith in the late 19th century envisioned a system that would automatically gather data and record it as holes punched in paper cards — cards that could then be mechanically sorted to obtain meaning from the data, a process that Hollerith called tabulating. Hollerith received a patent on the technology and his Washington, DC-based Tabulating Machine Company went on to become today’s International Business Machines (IBM).
For decades the primary machine function at IBM was sorting. Imagine those cards were each customer accounts at the electric company. It would be easy to use a machine to sort them in alphabetical order by last name, to sort them by billing date, to sort them by the amount of money owed, to sort them by those past-due and those not, etc. Early data processing meant sorting and punched cards were good for that. Of course people are fairly good at that, too. The reason for using a machine was primarily to save time so all the bills could go out before the end of the month.
The first databases, then, were stacks of those punched cards. If you were the electric company it was easy, too, to decide what should be on the card. Name and address, electricity usage in the current billing period, the date on which the bill is to be sent out, and your current payment status: are you paying your bills?
But what if you wanted to add a new product or service? That would require adding a new data field to every card including those that predated the new product or service. Such changes are the things that confounded mechanical card sorters. So in the spirit of flexibility a new kind of database was born in the 1950s that changed the world of business and travel.
Transaction processing. American Airlines’ SABRE reservation system was the world’s first real time automated system. Not just the first reservation system but the first-ever computer system to interact with operators in real time where business was conducted actually in the machine — a prelude to Big Data. This was back when we still tracked incoming Russian bombers by hand.
Up until SABRE, data processing had been always reactive. Accounting systems looked back at the quarter or month before and figured out how to represent what had already happened, taking as much time as needed to do so. But SABRE actually sold airline seats for future flights against an inventory of seats that existed solely in the computer.
Think of SABRE as a shoebox containing all the tickets for all the seats on AA flight 99. Selling tickets from the shoebox would prevent selling the same seat twice, but what if you wanted to sell the seats at the same time through agents in offices all over the country? That required a computer system and terminals, neither of which yet existed. It took American Airlines founder C.R. Smith sitting on a flight next to IBM’s T.J. Watson Jr. to get that ball rolling.
One key point about the SABRE story is that IBM didn’t have a computer the system could run on, it was so demanding. So American became the launch customer for the biggest computers made to that time. Rather than being programmed for the task, those computers in the world’s first corporate data center in Tulsa, Oklahoma (it’s still there), were hard-wired for selling airline seats and nothing else. Software came later.
American Airlines and SABRE got IBM into the mainframe business and those first systems were designed as much by American as IBM.
SABRE set the trend for data-driven computing applications from the 1950s until the 1980s. Bank tellers eventually got computer terminals, for example, but just like airline reservation agents their terminals only understood one thing — banking — and your status at the bank was typically represented on a single 80-column punched card.
Moore’s Law. As computers were applied to processing data their speed made it possible to delve deeper into those data, discovering more meaning. The high cost of computing at first limited its use to high-value applications like selling airline seats. But the advent of solid state computers in the 1960s began a steady increase in computing power and decrease in computing cost that continues to this day — Moore’s Law. So what cost American Airlines $10 to calculate in 1955 was down to a dime by 1965, to a tenth of a penny by 1975, and to one billionth of a cent today.
The processing power of the entire SABRE system in 1955 was less than your mobile phone today.
This effect of Moore’s Law and — most importantly — the ability to reliably predict where computing cost and capability would be a decade or more in advance, made it possible to apply computing power to cheaper and cheaper activities. This is what turned data processing into Big Data.
But for that to happen we had to get away from needing to build a new computer every time we wanted a new database. Computer hardware had to be replaced with computer software. And that software had to be more open to modification as the data needs of government and industry changed. The solution, called a relational database management system, was conceived by IBM but introduced to the world by a Silicon Valley startup company called Oracle Systems run by Larry Ellison.
Ellison started Oracle in 1977 with $1200. He is now (depending on the day you read this) the third richest man in the world and archetype for the movie character Iron Man.
Before Oracle, data was in tables — rows and columns — held in computer memory if there was enough or read to and from magnetic tape if memory was lacking as it usually was in those days of the 1970s. Such flat file databases were fast but the connections that could be made within the data often couldn’t be changed. If a record needed to be deleted or a variable changed it required changing everything, generating an entirely new database which was then written to tape.
With flat file databases change was bad and the discovery of meaning was elusive.
IBM’s Ted Codd, an expatriate mathematician from England working in San Jose, California, began to see beyond the flat file database around 1970. In 1973 he wrote a paper describing a new relational database model where data could be added and removed and the key relationships within the data could be redefined on the fly. Where before Codd’s model a payroll system was a payroll system and an inventory system was an inventory system, the relational approach separated the data from the application that crunched away on it. Codd saw a common database that had both payroll and inventory attributes and could be modified as needed. And for the first time there was a query language — a formal way to ask questions of the data and flexible ways to manipulate that data to produce answers.
This relational model was a huge step forward in database design, but IBM was making plenty of money with its old technology so they didn’t immediately turn the new technology into a product, leaving that opportunity to Ellison and Oracle.
Oracle implemented nearly all of Codd’s ideas then took the further step of making the software run on many types of computers and operating systems, further reducing the purchase barrier. Other relational database vendors followed including IBM itself and Microsoft, but Oracle remains the biggest player today. And what they enabled was not just more flexible business applications, but whole new classes of applications including human resources, customer relationship management, and — most especially — something called business intelligence. Business intelligence is looking inside what you know to figure out what you know that’s useful. Business intelligence is one of the key applications of Big Data.
The Internet and the World Wide Web. Computers that ran relational databases like Oracle were, to this point, mainframes and minicomputers — so called Big Iron — living on corporate data networks and never touched by consumers. That changed with the advent of the commercial Internet in 1987 and then the World Wide Web beginning in 1991. Though the typical Internet access point in those early years was a personal computer, that was a client. The server, where Internet data actually lived, was typically a much bigger computer easily capable of running Oracle or a similar relational database. They all relied on the same Structured Query Language (SQL) to ask questions of the data so — practically from the start — web servers relied on databases.
Databases were largely stateless, which is to say when you posted a query to the database even a modified query was done as a separate task. So you could ask, for example. “how many widgets did we sell last month” and get a good answer, but if you wanted a follow-up like “how many of those were blue?” your computer had to pose it as a whole new query: “how many blue widgets did we sell last month?”
You might wonder why this mattered? Who cared? Amazon.com founder Jeff Bezos cared, and his concern changed forever the world of commerce.
Amazon.com was built on the World Wide Web, which web inventor Tim Berners-Lee defined as being stateless. Why did Tim do that? Because Larry Tesler at Xerox PARC was opposed to modes. Larry’s license plate reads NO MODES, which meant as an interface guy he was opposed to having there be different modes of operation where you’d press a control key, for example, and whatever came after that was treated differently by the computer. Modes created states and states were bad so there were no states within a Xerox Alto computer. And since Tim Berners-Lee was for the most part networking together Altos at CERN to build his grandly named World Wide Web, there were no modes on the WWW, either. It was stateless.
But the stateless web created big problems for Amazon where Jeff Bezos had dreams of dis-intermediating every brick and mortar store on earth — a process that was made very difficult if you had to continually start over from the beginning. If you were an early Amazon user, for example, you may remember that when you logged-out all your session data was lost. The next time you logged-into the system (if it recognized you, which it typically didn’t) you could probably access what you had previously purchased but not what you had previously looked at.
Amazon’s obsession with the customer experience, as shown in this sketch from the company’s original business plan, was an inseparable part of its unique business model.
Bezos — a former Wall Street IT guy who was familiar with all the Business Intelligence tools of the time, wanted a system where the next time you logged-in the server would ask “are you still looking for long underwear?” It might even have sitting in your shopping cart the underwear you had considered the last time but decided not to buy. This simple expedient of keeping track of the recent past was the true beginning of Big Data.
Amazon built its e-commerce system on Oracle and spent $150 million developing the capability just described — a capability that seems like a no-brainer today but was previously impossible. Bezos and Amazon went from keeping track of what you’d bought to what you’d considered buying to what you’d looked at, to saving every keystroke and mouse click, which is what they do today, whether you are logged-in or not.
Understand we are talking about 1996 when an Internet startup cost $3-5 million venture capitalist dollars, tops, yet Amazon spent $150 million to create a Big Data buying experience where there never had been one before. How many standard deviations is that from the VC mean? Bezos, practically from the start, bet his entire company on Big Data.
It was a good bet, and that’s why Amazon.com is worth $347 billion today, with $59 billion of that belonging to Jeff Bezos.
This was the first miracle of Big Data, that Jeff Bezos was driven to create such a capability and that he and his team were able to do it on Oracle, a SQL RDBMS that had never been intended to perform such tasks.
(Want more? Here is Part Two.)
Wow, Bob. What a fantastic and insightful article.
An interesting article Bob, but you’ve slightly misunderstood the idea of state. I don’t want to get into a pedantic argument but I think you mean that the communication protocols and UI don’t maintain state. In fact a database is the ultimate outcome of state, that is it is the maintenance and storage of information. It’s just the query language SQL is a stateless protocol … Err … sort of if you ignore intermediate result sets on queries and update type calls. Similarly the Internet does have state all over the place. It’s just certain ways of programming on the web can (ideally) be stateless. Interestingly, good old pseudo-conversational programming within IBM CICS was probably one of the first (largely) stateless ways of designing an application where state was not maintained between user interactions. This was done to reduce resource usage within the system to increase the number of active users the system could handle in much the same way modern web programmers try to reduce/eliminate session state …
And how can the PARC be stateless? None of the logic gates or registers maintain state?
I think Bob means something a little different here.
I think stateless in this reference means “all I/O is on the same plane.”
if you hit shift-whatever or control-whatever, you are shifting the plane of the entry into another zone. add alt-whatever, you now have 4 distinct planes the same 7 bits can go into. ctrl-alt-whatever is a fifth plane, and so on. 5th-dimensional chess. it’s hard enough to assemble a programming team to deal with one plane effectively 😉
so all data is in the set of “stuff,” a hypercard stack if you’re old enough to remember them.
if all data is “stuff,” then all data is manipulated the same way. you can dynamically change the way you look at “stuff 3E-99Q” and get different results, which can be displayed, arrayed, chopped and mixed. a pivot table is one way of fooling around with “stuff” and getting different insights.
take it a couple quantum leaps further through differing sets of rules, and you have Watson. or 30 views through SAP or Hadoop or whatever you bought for reports and projections.
spend twice as much as budgeted based on the grinning salesman, and it’s Big Data. now you have to use it to pay it off.
voila, modern business.
That’s generally what I meant, thanks. I’ve very open to specific suggestions about how to change the wording. As you can imagine it is VERY hard to express these concepts in a manner that is both easily understood and not so simplistic that it’s boring to some readers. I’ve been tinkering with these columns for months as a result but decided I might never be done and so decided to just hit the publish button.
Great pitch, so do you have a backer for this series, or is it going to be crowdfunded?
This is neither a pitch nor a series, though I suppose it might make a good one. Maybe for some younger Bob Cringely than me,,,
‘I am not the Dread Pirate Roberts’ he said. ‘My name is Ryan; I inherited the ship from the previous Dread Pirate Roberts, just as you will inherit it from me. The man I inherited it from is not the real Dread Pirate Roberts either. His name was Cummerbund. The real Roberts has been retired 15 years and living like a king in Patagonia.’
*just substitute Cringley
I’m sorry, but I’ll have to disagree. BigData, from my point of view, is just a buzz word for “using extremely poor statistics to arrive to the wrong conclusion -that the costumer wants to hear and pays for, though- from noisy datasets that would better be deleted”.
It won’t save IBM (…even if they replaced Ginni with Watson!) nor other Big Data companies, such as Data Gravity (who are already firing data scientists), from their fate, nor it won’t be able to solve the current economic conumdrum, which is that consumers won’t consume because they simply can’t afford to (and there is nothing that Big Data can do about that).
Bob, in the following articles, are you planning to compile a list of Big Data companies (a bunch of large ones, and a few reference start ups as well) and look at the evolution of their revenue and profit over the last two years? There might be a few surprises there (most are not doing as well as people think).
Concur. It’ll drive richer, more aromatic bad conclusions ever more quickly. It’s the human way. Next up- the Forbin Project. 🙂
Agreed. Working with big data and predictive analytics for the last few years reminded me again about garbage in, garbage out. The ability to use data is based on the quality of the data and if it is ultimately predictive enough. It also depends on if you can really intact a change in behavior based on what the data tells you (such as can you price a product differently). No doubt that big data finds general trends but how actionable are they? Big data is one of the most overly generalized terms going sadly. It is marketed like crazy, but again with time spent I haven’t seen the payoff.
the same tools in the hands of a child, a master craftsman, or a demented murderer will yield different results.
having Big Data, statisticians, and SVPs hopping on one foot outside your office with their ties loose demanding results does not mean all is going to be useful.
it also helps if you don’t put garbage in, but there are still an almost infinite number of ways to get garbage out.
and badly managed charlie-fox companies will find them all, first.
Everyone has their own point of view and, as such, mine is as correct as yours, eh? I’m going to great lengths here to define the term Big Data as it is generally used. From your perspective that would appear to be how it is generally mis-used. I get that. But there is some value here, despite your cynicism. Will I name Big Data companies? That’s not a specific goal of mine. In order to make columns like these have legs you have to look mainly backwards (the theory being that if you got the history right it won’t change) and when you finally turn forwards it has to be with a view far enough into the future to be beyond 99 percent of your likely readers, which is to say at least five years in this case. I don’t think I can name a company today other than the obvious Googles and Amazons that I KNOW will be a Big Data stalwart five years from now.
State is when the same I’s and O’s produce different results. Dependence on previous conditions. Interactions between data, small reiterative effects, there are a lot of paradigms where it can occur.
Believe me when I log into Amazon and get stuck into the checkout area and cannot shop…. There Is STATE. It’s also big data. Amazon uses it a lot. I know when I’ve been all over searching for induction cooking units, I’m going to get inundated with related offers. It’s starting to become an obtrusive sook merchant grabbing my sleeves every 3 minutes with different offers at lower prices. I’m waiting for the computer to start haranging me about it’s 5 small children that need medicine and couldn’t I please buy an Amazon Prime membership in order to support them.
Consciousness is noticed by information deriving from multiple states simultaneously. I believe a corollary of the Church Thesis is that current computers can only be paper and pencil with a really really fast eraser. So, no consciousness, real limitations on the creativity of big data, garbage in garbage out, novel situations will flummox algorithms and there will always be novel situations. Yeah, big data good. But it has ineluctable limitations.
Basically, the only induction computers can do well……and I’m really happy about it…..is getting me better and cheaper induction cookware.
And I third the comment. Without good input data, this data is garbage. Documented by the fact that I leased a new Camry a month ago, and Cars.com still keeps showing up in my yahoo email header banner with offers for Camrys.
And a real question. If 5-6-7 people are living in a household, and possibly sharing a computer, how is Google going to know which query is whose?
In my house you have 3 Bernie-bots (the kids), 1 Ted Cruz (wife), and 1 Gary Johnson (me). How is Google going to reconcile that? My daughter is looking at Antique VW Bugs and I’m looking at Turbo Porsches. And my wife is yelling that we can’t afford either while reading Dave Ramsey. I’m sure it’s easy when 1 person = 1 computer, but unless you go to the phones, our searching habits on desktops and laptops aggregated would make a sane person pretty much mad.
Hence the constant invitations for you to log-in to your Google account in the upper right corner. If you weren’t logged in you wouldn’t see your gmail in the browser at gmail.com.
People still share a computer without separate logins? Separate logins means separate browser cookies, which means more targeted ads to avoid with script blocking.
This is going to be a great series. I predict that eventually, big data will lead to big anti-data. For reasons from personal privacy to disrupting competitors as a business strategy, anti-data will be generated and injected to destroy or negate existing data. The data wars begin!
Very good point. With the end of the Safe Harbor agreement and Privacy Shield probably going down a few months from now, it doesn’t bode well for the big data companies – specially start ups.
In fact, poll and survey respondents have been been giving misleading responses for a long time, for a variety of reasons, from a “none of the above” type of response, to just picking one at random when not having that option. Loaded poll questions get that a lot, as well as do simply clueless ones created by naive survey creators who have not thought through all the possible responses.
Also, when I sign up for various discussion or vendor web sites, I often enter bogus demographic info to avoid data mining/spamming/scamming, if I cannot avoid the questionnaire completely.
Indeed, GIGO!
The big differentiator is that the classic scientific method goes out the door, because we are not sampling any more. So we act on correlation, not causality.
In the spirit of post-modernism we are only looking for an adequate truth, not an absolute truth. Only needs to be adequate to inform our next decision and can then be discarded and reconstituted.
So Larry Ellison’s father was Howard Hughes?
Yeah never heard of Larry Ellison being any sort of role model for the Iron Man character. Hughes yes, absolutely, and also Elon Musk actually.
This is only one side of a story,one that elides the vital role of the US government in creating the technologies that others in the private sector adopted. Take SABRE, the Semi-Automatic Business Research Environment that became the basis of the airline reservation system. The software is a a port or transfer of the software developed for the Semi-Automatic Ground Environment (SAGE), the vast continental defense system that the US developed during the early Cold War. Sage brought us such technologies as RAM–magnetic core memory and the first big IP battle of the Information Age (MIT vs. IBM), the modem, the CRT screen, touch wand and a host of others including software. Sage matched weapons against targets, Sabre matched passengers against flights. Leaving out the military’s role, let alone the federal government’s allows for people to believe that the government is of little value, rather than the source of the technological foundations of our present. Without the ballistic missile programs we probably would not have much of the technology that we take for granted. You know this since Silicon Valley is a DoD construction that has forgotten its origins; hence Ash Carter’s attempts to remind the Valley of its origins and to reconnect with them. Whether that will work is another story, just as Big Data is as much a product of the NSA and Intel community as Big business.
Ditto on the amazing story BOB! @Michael and @Espen have a point, though. SAGE played a huge (and usually ignored) part in the evolution of data processing. From the SAGE Wikipedia page: “It was during the testing phase of the Reservisor that a high-ranking IBM salesman, Blair Smith, was flying on an American Airlines flight from Los Angeles back to IBM in New York in 1953. He found himself sitting next to American Airlines president C. R. Smith. Noting that they shared a family name, they began talking.. Just prior to this chance meeting, IBM had been working with the United States Air Force on their Semi Automatic Ground Environment (SAGE) project. SAGE used a series of large computers to coordinate the message flow from radar sites to interceptors, dramatically reducing the time needed to direct an attack on an incoming bomber. The system used teleprinter machines located all around the world to feed information into the system, which then sent orders back out to teleprinters located at the fighter bases. It was one of the first online systems.” Serendipity meets destiny on an airline flight.
Having worked on SAGE, I can tell you we owe everything from pointing devices and screens to stored instruction computers and datalink technology to it. Even Doug Engelbart acknowledged the lineage in his Mother Of All Demos in 1968. It’s not much of a stretch to see the links from there to the Internet as we know it today and the Big Data milieu we now find ourselves in, or on the edge of.
Big Data is the automation of deriving insight from data, with insight domain specified by need. Take a look at AlphaGo if you want to see where this is headed. The intersections of Big Data, Reinforcement Learning and cloud computing point to the next revolution in everything.
Interesting – I’ll be very interested to see where you go with this – I’ll have to admit that I’m personally skeptical about your arguments at the start of the article but you make some excellent points and I’m reviewing my stand on this as a result. What will be interesting in the long term (well beyond my lifetime) is the effect that this will have on human society – I suspect that like most things, we will adapt and it will become just part of life. It’s the immediacy of big data that fascinates me,
Great topic and great starting point. I’d like to add three points of view from my experience as a data scientist with HP, Inc. First, insights related to big data are more about columns than rows. More meaning and insights result from meaningfully connecting data from different sources than adding another 1,000,000 rows from the same source. Second, big data are great for learning about all the interrogatives but one. Big data are great at detailing the who, what, where, when, which, and how of most things. Big data are terrible at understanding the why. So far small data, (e.g. well-designed strategic-level surveys both qualitative and quantitative) are the best source of the why. Finally, I’ve never met a data set yet that is free from errors. Data are seldom what they seem. Measurement errors, naming errors, conversion errors, data missing errors, abandoned columns, etc. etc. Too often I’ve seen very smart people presume a level of data validity that just isn’t there. Just because an SQL database exists, doesn’t mean it’s accurate or well maintained.
I will go further and say that databases, as typically designed, cannot be accurate from a business perspective. bigdata is a doomed attempt to reimpose integrity upon business processes after it has been destroyed by conventional database design.
Databases almost always (I cannot prove that it has never been done right) overload every stage of the lifecycle of a business object onto a single union type, represented by a table or set of tables. Here is what I mean. Domain: finance; business object: invoice. Invoices must be approved before they can be paid. Up to a point in relative time, an invoice does not have an approver’s name associated with it. After that point, it does. How is this modelled, even in accounting software sold for tens of millions of dollars? One invoice table (with one security profile) with a “status code”, and with an approver-id field typed as nullable. Do you see the problem?
the Big Data thing is taking the existing database, perhaps buying a tubload of survey and consumer preference data and taping that on the side, and pawing through it all for correlations and anti-correlations. from those, The Wise Men you hired for the purpose will whistle up action lists and attempt to divine the future in which you dominate all who walk beside you.
or something like that.
existing data, analyzed in parallel with running (or ruining) the business, added to outside sources provides new insights when different folks from the PHBs work it over.
as always, insanely great meets insane may not equal profit
Bob !!! Wow what an impressive article detailing the evolution of computing and how we as a society have arrived to this stage of computing. I always knew you were very very smart ( I have been reading your work for more than 20 years up here in Canada) but this is one of your most impressive articles yet. I cannot wait for Part 2 and Part 3. DO NOT EVER RETIRE AND NEVER STOP EDUCATING US. LOL
Will you get into open source big data? i.e. Hadoop
Wow, just wow! You are going to make a new series, right? I work in this area, but I’ve never seen the different threads of the history pulled together, along with all the various contexts.
Reading the comments above, I agree that
a) data integrity is vital, and often overlooked, but in any large scale endeavor a 1-2% error rate isn’t going to invalidate the other 98-99% correct data (but careful about how much you bet on these things, and don’t get carried away by “unusual” insights)
b) yes, the number of column variables/fields are important, likely more so than the size/number of rows. But as computer power grows with database technology, we can grow these more and more and develop interesting new analyses and methods.
The interesting bit (which I didn’t see in your article, maybe I missed this) — as we get larger and larger healthcare databases, we can enter the era where we can truly provide individualized medical treatments. However, above and beyond the obvious privacy concerns, healthcare is fragmented (it isn’t a “system” as much as it is a cottage industry) and physicians have been slow to embrace technology, which while understandable, will have a negative impact on implementing positive developments for patients.
No mention of the 1920 census being done on a computer with punch cards?
One of your best, Bob. Thanks. You hit it out of the park again!
I have no quibble with Bob leaving out the government, Ingres, Ashton-Tate, and whoever else. This was a Will Durant-sized sweep of history and we commenters can fill in the blanks. I’ll also trust that parts 2 and 3 will draw a stronger connection from the technical capability being built up by Amazon and other companies trying to maintain state on the stateless Web to the implementation of that data.
Big Data can also mean the ability to extrapolate from the lack of data. With sufficient computing power (quantum?) a system will be able to determine where you probably are based on where you are definitely not. Privacy then becomes not only a situation of hiding where you are and the activities you do there, but placing false clues as to where you might be. Such a capacity will be beyond most citizens either technically or financially.
“a system will be able to determine where you probably are based on where you are definitely not.”
Which also means there is strong motivation to collect data on absolutely everyone in order to know who isn’t the target ( or spot “ghosts” ).
Great article, Bob!
It’s good to see you back in your wheelhouse!
There were many databases on the market before Oracle from companies such as Cullinet, Sybase and others. They just weren’t relational, but typically network databases.
Granted that Codd’s relational theory won out over other database organizational paradigms of the time, and Oracle rode that wave more successfully than their competitors.
But attributing the rise of databases to Oracle and Larry is a bit disingenuous at best.
Funny that now relational databases are under pressure from NoSQL databases, which incorporate some of the old network-oriented concepts of the past.
Plus ca change….
non-relational databases rising again… probably due to the cost of Oracle’s suites, and cost of their audits. there’s a reason he cuts his beard like that…
Jolly Roger that. Pirate Larry is the scourge of many a computer and board room.
A well written article that explains a complex topic, boring to most, in terms almost anyone would want to understand.
I’m one of “the jury is out” people when it comes to Big Data mainly because of Post hoc ergo propter hoc ie; “after this, therefore because of this” or in terms my first Stats professor said “correlation is NOT causation”.
That aside this is a great article, thanks
Quote: “Anything you do on the Internet — from comments to tweets to simple browsing — never goes away.”
Bob: I hope that you will include UNSTRUCTURED data in this series — email texts, PDFs, images, text messages, voice mail messages, code listings, compiled binaries, etc, etc. There’s a lot that databases can do, but unstructured data storage is not one of them. Where does unstructured data fit into the BIG DATA picture?
all as “stuff” plane. the word cloud goes into a probability matrix, and the more you rant about “terrorism,” the more likely you might get on the no-fly list, for example. not that you’re boosting it, but you’re fixated on it. facial recognition off Instagram or Facebook of every Adele or Meat Loaf concert ever held, which you attended, might stick you as a stalker.
for Big Data has no friends or enemies, only relationships to whatever rules it’s tested against.
so the question of “who owns the rules, and why?” begs answers. you always question the guy at the top.
Cringely Mineserver Kickstarter project…. Pitched on this forum so commenting on this forum. Over 8 Months late and 35 days since last updated stating yet again it would be shipping within days. Can you please throw your backers a bone and give us some updates?! Frequent Bad news is better than these once a quarter rose colored updates that never turn out to be true. 🙁
We’ll do a Mineserver update today, thanks.
Jeff Bezos’s wife is Mackenzie but he never worked at McKinsey.
Jesus, you are correct! Where the heck did I come up with that one? It’s fixed now, thanks.
I want my minecraft server or my money back!
Coupled with reader’s comments (which I have to filter through to find stuff that is useful and doesn’t troll nor detract) , this three parter, could be a nice .99 to 2.99 reprint on Amazon. Great history of how we got here from there ( enough that is tremendously annoying my more steeped in Republican Jebus friends who “reject” science that conflicts with their (small) worldview despite using communication devices from tablets, smart phones, laptops and some still chained to desk computers, all made possible by mathematicians and scientists
What about the Democratic global warming protesters who use the same devices that come from fossil fuels?
They’re not Protestant liberals and don’t care about signalling virtue to those lot, so self-righteous abstinence is not a matter to them. I have no qualms about using the system’s own tools against it, to the extent it is effective. Shear needs inside and outside force to operate.
If the above analysis is correct then shouldn’t Amazon Echo have already become the best voice product by far?
Great article, Bob – looking forward to the rest! Amazing to see the whole story so well and entertainingly told, though you run the risk of creating a “these X individuals changed the world” myth. But fun.
On that note: Was it really C J Smith and T J Watson that sat next to each other on a flight and hashed out SABRE? I did my doctorate on AA and IT, and all I heard was “one AA exec and a sales person from IBM”. (Copeland and McKenney’s article on SABRE (https://www.jstor.org/stable/249202) doesn’t mention this either. Neither does McKenney’s later book, which I helped research and write.) IBM had created the SAGE system for Strategic Air Command (the first system that used modems, incidentally), and that was the basis for the idea of a centralized, remotely connected CRS, as far as I know.
Espen
The official corporate history of American Airlines, Eagle, by Robert Serling, published in 1985 by St. Martin’s/Marek, recounts the story of the meeting on page 347, and indicates it was between C. R. Smith and IBM president Thomas J. Watson, who, as we know, was one great salesman.
He certainly sold that story!
I would think the real big data historic example is the following:
—
a. The invention of papyrus (4000 BCE, which is when the stone ages ended and the nilotic settled around the Nile to start civilization, agriculture and so on)
b. The export of papyrus to Greek via Byblos (the word for Papyrus in Greek)
c. 300 years later the start of Philosophy because of this (transfer knowledge to new generations and leverage on knowledge to produce new knowledge). The Greek heavily traveled in Egypt and founded the contents of Philosophy is thus heavily influenced by the mix of primitive science, mysticism, religion and so on). (Pythagoras took the soul concept and created a whole cult of it)
d. The Sciences arise out of Philosophy, Library of Alexandria follows based on the Greek Examples, making it the greatest university on Earth.
e. Profound influence of this Greek philosophy on the 300BC Roman Empire Religion using “the book of papyrus” which effectively stops acceleration of leveraging on this knowledge because producing new knowledge brings the chance of being killed + library of Alexandria destroyed
f. When papyrus is no longer available (due to the Roman Empire ending) the invention of the book printing press, which accelerates the volume of this data
g. Again 300 years later this is followed by an equal accelerated growth of science, a gazillion new inventions and discoveries, leveraging on the increasing speed data can be shared, together with most important vehicle: the telegraph and all of it’s child inventions like the telephone and all inventions that spin off the telephone (at&t) which includes satellites, laser and so on
h. 200 years later, the Internet brings this knowledge globally, portable and instant
i. So the question is if it will take 300 years now (2300) before we will really leverage on the Internet Effect or if really can be done in 2020 that we see the effects. The question is also if the counterforces that stopped leveraging on knowledge will rise again to slow down this process.
—
So 300 years after the export of papyrus to Greece, Philosophy arose, leveraging on the capabilities of storing information. So whether you store a gazillion papyrus documents in a library or whether you store them digital in a database: what differs is only the speed at which you can process the information (having it everywhere, fast, reliable). And whether you have Ancient scientists leverage on existing data and produce new data or a computer: does not matter, the difference is only speed. And if you have a super AI doing the same instead of a gazillion humans: the difference is only speed.
—
So the invention of Philosophy was the first example of how big the impact can be of big data: it brought us everything: all sciences, all inventions, all discoveries, all companies, beta, gamma
or alpha.
If you want more on why big data is useless: here’s William Binney of the NSA testifying to British Parliament on bulk data collection
https://www.theguardian.com/world/2016/jan/06/snoopers-charter-will-cost-british-lives-mps-warned
“He said: “Sixteen months before the attacks on America [9/11], our organisation [Signit Automation Research Center – Sarc] was running a new method of finding terrorist networks that worked on focusing on ‘smart collection’. Their plan was rejected in favour of a much more expensive plan to collect all communications from everyone.“The US large-scale surveillance plan failed. It had to be abandoned in 2005. Checks afterwards showed that communications from the terrorists had been collected, but not looked at in time.”
Binney said his experience as the lead NSA analyst for “strategic warning” concerning the Soviet Union and then Russia, and later dealing with terrorism, showed that “to be effective and timely we had to avoid burying our analysts”.
He said: “Our approach was totally different to the historic bulk collect and then word/phrase dictionary select-type approach in general use even to this day. In particular, we developed and deployed surveillance tools applying minimisation at the point(s) of collection. This approach reduces the burden on analysts required to review extremely large quantities of irrelevant material with consequent improvement to operational effectiveness.
“
Terrific article!
You might mention hierarchical databases that sprung up between flat files and relational databases. Can you give references to the early material on public health, cryptography, etc. when you put the three parts together?
We are still waiting for a technology that “Cringely” developed…. and waiting and waiting and waiting…. and waiting… https://www.kickstarter.com/projects/583591444/mineservertm-a-99-home-minecraft-server/comments
I have followed you for, gulp, decades. Every so often you write a column that is like a Roger Federer backhand crosscourt winner. Cracking good one here. And the bonus is that the comments are civil for the most part, and the threads here are just as interesting and informative as the column. Done gushing. Looking forward to more.
I think big data is like all revolutionary things, at least in my experience that come down the pike in computing. Think of a spread sheet. VisiCalc came along I think for the first Apple. Now all it really was, was a grid / matrix, man I hate that word, where you could put data or some sort of mathematical calculation. Yet that simple idea revolutionized small computing. Through the years I have seen this come up over and over, with usually, with me going that is interesting, not exactly sure what I will do with it, but that is very interesting. I think big data is that way. Right now we see potential and a few interesting things like Amazon picks about half the time something you just might be interested in buying? Mind you I don’t need to buy a new electric toothbrush every week? What grows out of this will take some time but I’m sure another simple idea, that looking back in ten years we will wonder just exactly how we lived with out it?
The article gives an excellent highlighted very abbreviated version of the history of the DEDUCTIVE uses of Big Data. Will the other installements discuss the INDUCTIVE uses? Such as the inference analytics performed by Deep Learning? Sort of a Revenge of the Neural Networks 😉
This has been a weakly written article on data processing, loosely tied to big data. I would recommend some changes, especially if you are going to develop it further into a series. Start with “why we need big data”. Talk about the use cases, the problems that it solves or solved. Your black plague and census are good examples. There are more. I think the prior comments about problems the military solved and the technology that enabled it would be useful. The SAGE discussion in the comments was MUCH MORE interesting and I suspect, more of the truth than the flight with CR Smith and Tom Watson Jr. CR Smith brought the Reservisor. Bob Crandall (actually Max Hopper) brought in the Travel Agencies, and AAdvantage, and scaled SABRE up to manage close to 50% of the reservations in the United States. I moved SABRE from the 2nd floor of a building at he North end of the Tulsa airport into “the bunker” where is and SABRE lives and is managed by HPE (formerly EDS) today. The “bunker” was full when we moved down there and now, SABRE is a small part of what is in that data center (1/16th?). A second article should hit upon the technology, but what you have provided here is a general data processing and transaction processing treatise that glosses over many of the great database pioneers of the 1980’s and 1990’s. I would not start with mainframes, but start with relational databases, with maybe a slight nod to ISAM technologies like PC File and DBase II, III, and IV. The Oracle part is very good, especially where Oracle basically wrote the software from Ted Codd’s research at IBM. I think you give Oracle too much credit and maybe it will be in the 2nd article, but more credit needs to be given to the data mining tools, starting with SAS and SPSS, and then moving into Crystal Reports, Cognos, and Business Objects. I stopped at about that point and moved hard-core into SQL (I didn’t need reporting tools to write a SQL query even with roll-ups and cubes), so I really don’t know what “big data” is all about now. Maybe I’ll learn more in the 2nd article.
I agree with some poster’s comments about data quality and I think it needs to be in a 3rd or 4th article, because I recently saw a home goods store try to go on the web, bought the big data tools, and the web store software, but can’t do squat, because they can’t even understand how to run a basic inventory system in their stores. They don’t even know what GTINs, UPCs, and SKUs they have in-store, what makes them think they do this online? And their stores reflect the sorry state of their business. Lots of markdown items, clearance bins everywhere. No way for a worker to even know where to put an item back in stock on a shelf. PROCESS is as important or more so than software. What problem are you trying to solve? Software is just a bunch of 1’s and 0’s that can’t do anything for you if you don’t know 1) what business you’re in 2) what problem you’re trying to solve and 3) a good process for solving that problem.
Oh, and throw in a side article on how Oracle is trying to kill java. Send me a private email and I’ll send you the link. Or you can just google it.
Google, now THEY are doing big data right. And probably Amazon Web Services as well. But I expect to hear about them in the next article anyways.
What role, if any, did the Justice Department antitrust suit against IBM play in making the IBM RDB technology into an Oracle product? Unbundling of software from hardware was part of the settlement.
[…] exploring the origins, history and meaning of Big Data from a layman’s perspective. Part 1 covers mainly structured data from Stonehenge and the Domesday book up to dotcom era Amazon. Part […]
You are correct that metadata is data about data but your example isn’t accurate.
“…but the metadata (data about the data) saying where the victims lived, where they died, their age, and type of work…”
In this scenario you are explaining data fields/columns that make up each record. Metadata is the collection of definitions for each data field/column. In your example, one of the columns being recorded about mortality could be weight. The metadata would be a definition of that column – “a measurement of the deceased person’s mass in kilograms.” Good metadata ensures that the data is understandable regardless of your country of origin (or department of the enterprise).
I suspect Bob is using a less technical definition: Metadata is “data that provides information about other data”. Two types of metadata exist: structural metadata and descriptive metadata. Structural metadata is data about the containers of data. Descriptive metadata uses individual instances of application data or the data content. https://en.wikipedia.org/wiki/Metadata
[…] are three articles in this series “Thinking About Big Data,” Part 1 and Part 2 were posted this week. They are not short, but are well written, generally avoiding […]
[…] part one we learned about data and how it can be used to find knowledge or meaning. Part two explained the […]
[…] In part one we learned about data and how it can be used to find knowledge or meaning. Part two explained the term Big Data and showed how it became an industry mainly in response to economic forces. This is part three, where it all has to fit together and make sense — rueful, sometimes ironic, and occasionally frightening sense. […]
Nice article Bob; and it looks like you’ve gone all James Burke on us. See:
https://www.amazon.com/Day-Universe-Changed-Companion-Television/dp/0316116955/ref=sr_1_2?ie=UTF8&qid=1468352679&sr=8-2&keywords=the+day+the+universe+change
[…] In part one we learned about data and how it can be used to find knowledge or meaning. Part two explained the term Big Data and showed how it became an industry mainly in response to economic forces. This is part three, where it all has to fit together and make sense — rueful, sometimes ironic, and occasionally frightening sense. […]
[…] In part one we learned about data and how it can be used to find knowledge or meaning. Part two explained the term Big Data and showed how it became an industry mainly in response to economic forces. This is part three, where it all has to fit together and make sense — rueful, sometimes ironic, and occasionally frightening sense. […]
[…] Big Data is Big News, a Big Deal, and Big Business, but what is it, really? What does Big Data even … […]
This is sure to help the hair become packed with life and healthy, from the
roots for the tips. Hair loss vitaminscan help to help keep existing hair while ensuring future growth too.
Biotene for hair loss There are several types of vitamins that are
recommended to the treatment of thinning hair.
Hair loss is because of the insufficient presence of B vitamins inside the food that is certainly consumed.
Without enough biotin, it causes cell growth abnormality for example the forming of perinatal cells into
endometrial cell inside abdominal regions.
“Once we started accumulating knowledge and writing it down it was **in our nature** to seek ways to hide these data from others.”
correct statement will be: we have benefits from hiding data from others (since we build such system where one can have benefits).
btw very good article!
The wide selection of cabling solutions available today make certain that consumers have what it requires
to discover efficient solutions to any or all their problems.
Unlike part cable that has five cords, three for videos
and a couple of for audios. Cable to rca converter box How to Buy Every
One on Your List the Perfect Gift for New Year’s Day.
Quality product are only able to deliver 100% safety with satisfaction. The colour system could be adjusted to
NTSC, PAL, PAL 60 Hz, M-NTSC or SECAM.
Whenever you might be looking to make use of xenon strobe
construction equipment always bear planned that it must not be used above 20 watt power.
Stylishly designed, Toshiba Satellite P305-S8904 Notebook carries a large,
bright 17.
There are, however, a superb number of excellent products available which can be committed repeaters.
This device is going to be running over a 366MHz Conneaut processor with hardware acceleration. Wifi
booster reviews Setting up Wi-fi compatability solutions
for virtually every large celebration may be hard.
For Robert to teach me, he has to keep distilling it as a result of very elementary Monopoly
form of issues to ensure that I can grasp what he’s talking about.
The radio or electric dog fences ensure the safety of one’s respective pets.
2015 spy gear Did you ever feel as you were starting to have
your groove on only to discover the following thing playing on radio stations was a song
you never like.
Liam Payne looks in good spirits as they heads in the BBC Radio 1 studios
for the interview. Baofeng uv-82 headset The electric fence can help
your canine understand its restrictions in your yard or places of containment.
The difference between old time radio and television or film, and something of its major strengths, is
always that it is audio-only.
[…] about Big Data — Part Three (the final and somewhat scary part). (Click these links for Part 1 and Part 2 of the series if you missed them last week.) This is required reading for […]
If you could have data stored over a SD Card and want to experiment with it back while using UD9004, it offers to be whether
JPEG, WAV, WMA, MP3, Divx or AAC file. It has excellent playback
quality, providing you using the ultimate home movie experience.
Can you convert scart to hdmi Comb filters include glass, digital, and 3DY, and differing types provide different degrees of quality, but ultimately, it’s better to own one
than not.
If you’re looking for the decent quality TV that offers you a great deal of features and adjustment options without hurting your financial allowance, the Panasonic TX-L32X20B might be described as a
good choice. You may want these features to plug
in a very memory card or USB drive for playing your media, including videos that you simply
take or maybe your personal DVDs that you just’ve ripped.
With havin so much content do you ever run into any issues of
plagorism or copyright violation? My site has a lot of exclusive content I’ve either authored myself
or outsourced but it appears a lot of it is popping
it up all over the web without my agreement. Do you know any techniques to
help prevent content from being stolen? I’d definitely appreciate it.