什么是数据科学?《What is data science》 by Mike Loukides翻译和精读04
Making data tell its story
让数据说出它自己的故事
A picture may or may not be worth a thousand words, but a picture is certainly worth a thousand numbers. The problem with most data analysis algorithms is that they generate a set of numbers. To understand what the numbers mean, the stories they are really telling, you need to generate a graph. Edward Tufte’s Visual Display of Quantitative Information is the classic for data visualization, and a foundational text for anyone practicing data science. But that’s not really what concerns us here. Visualization is crucial to each stage of the data scientist. According to Martin Wattenberg (@wattenberg, founder of Flowing Media), visualization is key to data conditioning: if you want to find out just how bad your data is, try plotting it. Visualization is also frequently the first step in analysis. Hilary Mason says that when she gets a new data set, she starts by making a dozen or more scatter plots, trying to get a sense of what might be interesting. Once you’ve gotten some hints at what the data might be saying, you can follow it up with more detailed analysis.
一张图片可能值得或不值得一千个字,但是一张图片肯定值得一千个数字。大部分数据分析算法的问题是它们产生了一系列的数字。要理解这些数字意味着什么,它们真正诉说的故事,你需要生成一张图表。Edward Tufte的《Visual Display of Quantitative Information》是数据可视化的经典,对于任何练习数据科学的人来说都是一个基础的文本。但那并不是真的让我们在这里认为是重要的东西。可视化对数据科学家的每一个阶段来说都是关键的。根据(Flowing Media的创始人)Martin Wattenberg所说,可视化是数据调节的关键:如果你想要找到你的数据有多糟糕,试着绘制它。可视化通常是分析的第一步。Hilary Mason称当她获得了一个新的数据集,她从制作至少一打的散点图开始,试着获得什么可能是有趣的感觉。一旦你已经就数据可能要说一些什么获得了一些提示,你就能使用更多细节化的分析来将它深入下去。
注:任何数据,不管是文本,图像,音频,视频,在计算机中都是以二进制的形式存储,也就是数字的形式存储。而人工智能提取出的“特征”,也是计算机能处理的数字矩阵。
There are many packages for plotting and presenting data. GnuPlot is very effective; R incorporates a fairly comprehensive graphics package; Casey Reas’ and Ben Fry’s Processing is the state of the art, particularly if you need to create animations that show how things change over time. At IBM’s Many Eyes, many of the visualizations are full-fledged interactive applications.
有很多用来绘制和表现数据的包。GnuPlot非常高效;R含有一个非常综合性的图形包;Casey Reas和Ben Fry的《Processing》是一种艺术,尤其是如果你需要制作变化如何随时间产生的动画。在IBM的Many Eyes中,许多种可视化都是羽翼丰满的交互式应用。
注:2007年MIT Press出版了Casey Reas和Ben Fry著作的书籍《Processing》。
Nathan Yau’s FlowingData blog is a great place to look for creative visualizations. One of my favorites is this animation of the growth of Walmart over time. And this is one place where “art” comes in: not just the aesthetics of the visualization itself, but how you understand it. Does it look like the spread of
cancer throughout a body? Or the spread of a flu virus through a population? Making data tell its story isn’t just a matter of presenting results; it involves making connections, then going back to other data sources to verify them. Does a successful retail chain spread like an epidemic, and if so, does that give
us new insights into how economies work? That’s not a question we could even have asked a few years ago. There was insufficient computing power, the data was all locked up in proprietary sources, and the tools for working with the data were insufficient. It’s the kind of question we now ask routinely.
要寻找有创意的可视化,Nathan Yau的《FlowingData》日志是一个好地方。我最喜欢的一个是随着时间《Walmart成长》的动画。这也是“艺术”进入的地方:不止是视觉本身的美学,还关于你如何理解它。它看起来像是癌症在你体内的扩散吗?或是流感病毒在人群中的传播吗?让数据说出它自己的故事不止是一件呈现结果的事;它还包括制作联系,然后返回其它数据源进行验证。如果一个成功的零售链条像传染病一样传播,这是否给我们经济如何起作用的新视角?那并不是一个我们能在几年前问出的问题。那时没有足够的算力,数据都被锁在专有的源中,处理数据的工具不足。这是我们现在例行问的问题。
Data scientists
数据科学家
Data science requires skills ranging from traditional computer science to mathematics to art.Describing the data science group he put together at Facebook (possibly the first data science group at a consumer-oriented web property), Jeff Hammerbacher said:
数据科学需要从传统计算机科学到数学到艺术的各种技能。Jeff Hammerbacher描述他在Facebook上进行的数据科学分组(可能是第一个以消费者为导向的网络属性的数据科学分组)时说到:
... on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization3
任何一天,一个小组成员可以使用Python来编写一个多段处理管道,设计一个假说测试,用R来进行一份数据样本上的回归分析,为Hadoop上的一些数据密集型产品或服务来设计并实施一个算法,或将我们的分析结果交流给组织的其它成员3。
3. “Information Platforms as Dataspaces,” by Jeff Hammerbacher (in Beautiful Data)
脚注3,“信息平台作为一个数据空间”,Jeff Hammerbacher在《Beautiful Data》书中写到。
Where do you find the people this versatile? According to DJ Patil, chief scientist at LinkedIn (@dpatil), the best data scientists tend to be “hard scientists,” particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem. When you’ve just spent a lot of grant money generating data, you can’t just throw the data out if it isn’t as clean as you’d like. You have to make it tell its story. You need some creativity for when the story the data is telling isn’t what you think it’s telling.
你要在哪里找到这样多功能的复合人才?根据LinkedIn的首席科学家DJ Patil所说,最好的数据科学家倾向于是“硬科学家”,最好是物理学家而不是计算机科学专业。物理学家有很强的数学背景,计算技能,来自一个生存依赖于从数据中获取最多的学科。他们不得不思考大的图片,大的问题。当你刚刚花了很多拨款生成数据,如果数据不是那么整洁,你不喜欢,你也不能只是因为这些原因就将你不喜欢和不那么整洁的数据扔掉。你必须要让数据说出他自己的故事。当数据所说的故事不是你认为它说的,这时,你需要一些创造性。
注:李飞飞本科是普林斯顿大学的物理学学士,后成为电子博士。在人工智能领域大家都注重算法,忽略数据时,李飞飞另辟蹊径,从数据入手。这段话想必能够让很多人想到李飞飞的经历。
Scientists also know how to break large problems up into smaller problems. Patil described the process of creating the group recommendation feature at LinkedIn. It would have been easy to turn this into a high-ceremony development project that would take thousands of hours of developer time, plus thousands of hours of computing time to do massive correlations across LinkedIn’s membership. But the process worked quite differently: it started out with a relatively small, simple program that looked at members’ profiles and made recommendations accordingly. Asking things like, did you go to Cornell? Then you might like to join the Cornell Alumni group. It then branched out incrementally. In addition to looking at profiles, LinkedIn’s data scientists started looking at events that members attended. Then at books members had in their libraries. The result was a valuable data product that analyzed a huge database —but it was never conceived as such. It started small, and added value iteratively. It was an agile, flexible process that built toward its goal incrementally, rather than tackling a huge mountain of data all at once.
科学家同样知道如何将大型问题拆开为更小的问题。Patil 描述在LinkedIn创建群组推荐特征的过程。将这些转换成仪式隆重的开发工程是容易的,但是这些工程要花费上千小时的开发者时间,加上上千小时的计算时间来在LinkedIn的成员关系中建立海量的关联。但是处理过程非常不一样:它以一个相对小,简单的程序开始,这个程序查看成员的简介,并根据简介做出推荐。问一些像这样的问题,你去康奈尔大学吗?然后你可能会想要加入康奈尔大学校友会。然后它逐渐分开扩展(根据你回答的是或否,有不同的后续)。除了查看简介,LinkedIn的数据科学家开始查看成员参加的事件。然后查看成员从他们的图书馆中看的书。结果是一个有价值的数据产品——但是它从未被设想成这样。它开始时很微小,是价值的迭代增加。它是敏捷的,灵活的过程,这个过程被构建逐渐地朝着它的目标前行,而不是同时应付堆积如山的数据。

Cassandra 职位/公司
想获得一份数据科学的工作并不容易。然后,O’Reilly Research的数据显示Hadoop和Cassandra招聘启事有着年复一年的稳健增长,对于整体的“数据科学”市场来说是好的指标。这张图显示了Cassandra工作机会随时间的增长,和公司招聘Cassandra职位随时间的增长。
This is the heart of what Patil calls “data jiujitsu”—using smaller auxiliary problems to solve a large, difficult problem that appears intractable. CDDB is a great example of data jiujitsu: identifying music by analyzing an audio stream directly is a very difficult problem (though not unsolvable—see midomi, for example). But the CDDB staff used data creatively to solve a much more tractable problem that gave them the same result. Computing a signature based on track lengths, and then looking up that signature in a database, is trivially simple.
这是Patil称为“data jiujitsu”的重点——使用更小的辅助问题来解决一个看起来无法处理的大型的,困难的问题。CDDB是data jiujitsu的一个很好的例子:通过分析一个音频流来直接识别音乐是一个非常困难的问题(虽然不是无法解决的——举例来说,看midomi)。但是CDDB工作人员创造性地使用数据来解决一个好处理地多的问题,并给出了同样的结果。基于音轨长度来计算前面,然后寻找数据库中的该签名,很平常的简单。
Entrepreneurship is another piece of the puzzle. Patil’s first flippant answer to “what kind of person are you looking for when you hire a data scientist?” was “someone you would start a company with.” That’s an important insight: we’re entering the era of products that are built on data. We don’t yet know what those products are, but we do know that the winners will be the people, and the companies, that find those products. Hilary Mason came to the same conclusion. Her job as scientist at bit.ly is really to investigate the data that bit.ly is generating, and find out how to build interesting products from it. No one in the nascent data industry is trying to build the 2012 Nissan Stanza or Office 2015; they’re all trying to find new products. In addition to being physicists, mathematicians, programmers, and artists, they’re entrepreneurs.
企业家是谜题的另一部分。对于“当你雇佣一个数据科学家时,你会寻找什么样的人”,Patil的第一个轻率的回答是“你会与之一起创办一家公司的人”。那是一个重要的洞察力:我们正在进入一个产品建立在数据之上的时代。我们还不知道哪些产品是什么,但是我们确实知道赢家会是那些找到那些产品的人和公司。Hilary Mason得出了同样的结论。她作为bit.ly的科学家的工作事实上是研究bit.ly正在生成的数据,并找出从中如何构造有趣的产品。新生的数据产业没有人试图打造2012 Nissan Stanza或Office 2015;他们都是试图发现新的产品。除了是物理学家,数学家,程序员和艺术家,他们是企业家。
Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdiscplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: “here’s a lot of data, what can you make from it?”
数据科学家结合了耐心,逐步构造数据产品的意愿,探索的能力,迭代解决问题的能力。他们内在是跨学科的。他们可以应付一个问题的所有方面,从初始的数据收集和数据调节到得出结论。他们能跳出局限想考,带来新的方法来审视问题,或处理定义很宽泛的问题:“这里有很多数据,你能从中制作出什么?”。
The future belongs to the companies who figure out how to collect and use data successfully. Google, Amazon, Facebook, and LinkedIn have all tapped into their datastreams and made that the core of their success. They were the vanguard, but newer companies like bit.ly are following their path. Whether it’s mining your personal biology, building maps from the shared experience of millions of travellers, or studying the URLs that people pass to others, the next generation of successful businesses will be built around data. The part of Hal Varian’s quote that nobody remembers says it all:
未来属于成功地弄清楚如何收集和使用数据的公司。谷歌,亚马逊,脸书和LinkedIn都已经开发它们的数据流并使之称为他们成功的核心。它们是先驱者,单项bit.ly这样的新公司正在追赶它们的道路。无论它是否在挖掘你的个人生理信息,从数百万旅行者的分享经验中构建地图,或研究人们传递给其它人的URL,成功业务的下一代会围绕数据构建。Hal Varian名言中那句没有人记得的部分完整的阐述了它:
The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades.
处理数据的能力——有能力去理解它,去处理它,从中提取价值,将其可视化,与它沟通——在下一个十年,那会是一个极其重要的技能。
Data is indeed the new Intel Inside.
数据确实是新的英特尔。