什么是数据科学?《What is data science》 by Mike Loukides翻译和精读01
(根据百度学术显示,Mike Loukides在2010年于《Oreilley Media》发表的报告《What is data science》一共有94次引用,这是对这篇报告的翻译和精读)
What is data science?数学科学是什么?
Mike Loukides 多伦多公共实验室的Mike Loukides
Table of Contents 目录
What is data science? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
什么是数据科学?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
The future belongs to the companies and people that turn data into products 1
未来属于能将数据转化为产品的公司和人 1
What is data science? 1
什么是数据科学? 1
Where data comes from 4
数据从哪里来 4
Working with data at scale 7
处理大规模数据 7
注:at scale除了按比例,还有大规模的意思
Making data tell its story 10
让数据说出它的故事 10
Data scientists 11
数据科学家 11
What is data science?
什么是数据科学?
The future belongs to the companies and people that
turn data into products
未来属于能将数据转化为产品的公司和人
We’ve all heard it: according to Hal Varian, statistics is the next sexy job. Five
years ago, in What is Web 2.0, Tim O’Reilly said that “data is the next Intel
Inside.” But what does that statement mean? Why do we suddenly care about
statistics and about data?
翻译:依照Hal Varian的《statistics is the next sexy job》,我们都听到了它(指未来属于能将数据转化为产品的公司和人)。5年前(这篇报告作于2010年,这里的5年前,是2005年),在《What is Web 2.0》中,Tim O’Reilly(O’Reilly出版社创始人)称“数据是下一个英特尔”。但是那句陈述意为着什么?为什么我们突然间关心统计学和数据?
解析:Intel Inside是英特尔处理器公司的标志,电脑上有Intel Inside标志是指内含英特尔处理器。
In this post, I examine the many sides of data science -- the technologies, the
companies and the unique skill sets.
在这份报告中,我仔细地调查了数据科学的很多方面——技术,公司和独特的技巧集合。
What is data science?
什么是数据科学
The web is full of “data-driven apps.” Almost any e-commerce application is a data-driven application. There’s a database behind a web front end, and middleware that talks to a number of other databases and data services (credit card processing companies, banks, and so on). But merely using data isn’t
really what we mean by “data science.” A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products.
网上充满了“数据驱动的应用”。几乎任何电子商务应用都是一个数据驱动的应用。网络前端的终点有一个数据库,还有与一些其它数据库和数据设备(信用卡处理公司,银行,等等)通信的中间件。但是仅仅使用数据并不是我们说的“数据科学”。一个数据应用从数据自身得到价值,并因此创造更多数据。它(指前面提到的数据应用)不仅仅是有着数据的应用,它也是一个数据产品。数据科学使数据产品的创造变得可行。
解析:a number of 表示不确定,可能是不多,也可能是很多。信用卡处理,一般指信用卡的支付。But merely using data isn’t really what we mean by “data science.” what从句中,what既是is系动词的表语,同时也是mean的宾语。acquire是及物动词,vt.(通过努力、能力、行为表现)获得;得到;购得。as a result因此。
One of the earlier data products on the Web was the CDDB database. The developers of CDDB realized that any CD had a unique signature, based on the exact length (in samples) of each track on the CD. Gracenote built a database of track lengths, and coupled it to a database of album metadata (track
titles, artists, album titles). If you’ve ever used iTunes to rip a CD, you’ve taken advantage of this database. Before it does anything else, iTunes reads the length of every track, sends it to CDDB, and gets back the track titles. If you have a CD that’s not in the database (including a CD you’ve made yourself), you can create an entry for an unknown album. While this sounds simple enough, it’s
revolutionary: CDDB views music as data, not as audio, and creates new value in doing so. Their business is fundamentally different from selling music, sharing music, or analyzing musical tastes (though these can also be “data products”). CDDB arises entirely from viewing a musical problem as a data problem.
网上早期的数据产品之一是CDDB数据库。CDDB的开发者基于CD上每一条音轨的具体长度(在样本中),意识到任一CD有一个独一无二的签名。Gracenote建立了一个音轨长度的数据库,并将它连接到一个专辑元数据(音轨名称,艺术家,专辑名称)数据库。如果你曾经使用iTunes来翻录CD,你已经抓住了这个数据库的优势,将它寄给CDDB,把音轨标题拿回来。如果你有一个不在数据中的CD(包括你自己制作的CD),你可以为一个未知专辑创造一个条目。尽管这听起来很简单,它是革命性的,CDDB将音乐视为数据,而不是音频,并在这么做的同时创造了新的价值。他们的业务与销售音乐,分享音乐或分析音乐品味(尽管这些也可以成为“数据产品”)有根本性不同。通过将音乐性问题视为数据问题,CDDB出现了。
解析:couple 作动词是连接起来,结合起来的意思。
Google is a master at creating data products. Here’s a few examples:
谷歌是创造数据产品的大师。这是一些例子:
• Google’s breakthrough was realizing that a search engine could use input other than the text on the page. Google’s PageRank algorithm was among the first to use data outside of the page itself, in particular, the number of links pointing to a page. Tracking links made Google searches much more
useful, and PageRank has been a key ingredient to the company’s success.
谷歌的突破是意识到一个搜索引擎能够利用输入而不只是利用网页的文本。谷歌的网页排名算法最先使用网页之外的数据,尤其是,指向网页的链接的数目。追踪链接让谷歌搜索更加好用,而网页排名是该公司成功的关键要素。
解析:other than除了。。。,不同于的意思。谷歌的网页排名,是通过哪些链接被更多人点击,提高推荐度。
• Spell checking isn’t a terribly difficult problem, but by suggesting corrections to misspelled searches, and observing what the user clicks in response, Google made it much more accurate. They’ve built a dictionary of common misspellings, their corrections, and the contexts in which they occur.
拼写检查不是一个可怕难题,但是通过对拼写错误的搜索者的提示更正,并观察用户点击回应的内容,谷歌让自己更加精确。他们已经建立了一个常用拼写错误的字典,他们的改正和这些错误发生的上下文情景。
注:所以不奇怪谷歌在2017年发表论文《Attention is all you need》,提出了史诗级别的transformer架构。谷歌在做自然语言处理有天然优势,和多年的深耕布局。
• Speech recognition has always been a hard problem, and it remains difficult. But Google has made huge strides by using the voice data they’ve collected, and has been able to integrate voice search into their core search engine.
语音识别一直都是一个困难的难题,现在也很困难(这是2010年的文章,当年语音识别还未有现在的突破)。但是谷歌通过使用他们收集到的语音迈出巨大的步伐,并且已经能够将语音搜索融入到他们核心的搜索引擎中。
• During the Swine Flu epidemic of 2009, Google was able to track the progress of the epidemic by following searches for flu-related topics.
在2009年猪流感爆发期间,谷歌通过关注流感相关话题来追踪流行病的过程。
注:社交网络的follow是关注的意思。

流感趋势
2007-2008美国流感活跃度——亚特兰大中部地区。
ILI percentage(流感样例占门诊量百分比) (蓝色)谷歌流感趋势 (黄色)CDC数据
公布的CDC报告,关于两周后(意思是两周后的竖虚线时间节点2008年1月28日),并没有显示这种增长。
谷歌流感趋势发现了流感活跃度的一个显著增长。
通过分析国家的不同地区的人们做出的搜索,谷歌有能力发现猪流感流行病的趋势,大约比CDC(美国中央疾控中心)早两周时间。
Google isn’t the only company that knows how to use data. Facebook and LinkedIn use patterns of friendship relationships to suggest other people you may know, or should know, with sometimes frightening accuracy. Amazon saves your searches, correlates what you search for with what other users search for, and uses it to create surprisingly appropriate recommendations. These recommendations are “data products” that help to drive Amazon’s more traditional retail business. They come about because Amazon understands that a book isn’t just a book, a camera isn’t just a camera, and a customer isn’t just a customer; customers generate a trail of “data exhaust” that can be mined and put to use, and a camera is a cloud of data that can be correlated with the
customers’ behavior, the data they leave every time they visit the site.
谷歌不是唯一一家知道如何使用数据的公司。脸书和LinkedIn 使用友谊关系的模式来推荐你可能认识的其它人,或应该认识的,有着令人惊恐的准确性。亚马逊保存你的搜索,将你搜索的与其它用户搜索的进行关联,并用它创造令人惊讶地恰当推荐。这些推荐就是帮助驱动亚马逊更多传统零售商务的“数据产品”。它们出现是因为亚马逊懂得一本书不止是一本书,一架相机不止是一架相机,一个顾客不止是一个顾客;顾客产生了一条可以被挖掘和投入使用的“数据排放”链,并且一架相机是一个可以关联用户行为的数据云端,每当他们访问网址就会有数据留下。
The thread that ties most of these applications together is that data collected from users provides added value. Whether that data is search terms, voice samples, or product reviews, the users are in a feedback loop in which they contribute to the products they use. That’s the beginning of data science.
这根把这些应用的大多数连起来的线是从提供额外价值的用户那里收集来的。数据是否是搜索条目,语音样本,或产品评价,用户是在一个为他们使用的产品做贡献的反馈循环中。那就是数据科学的开始。
In the last few years, there has been an explosion in the amount of data that’s available. Whether we’re talking about web server logs, tweet streams, online transaction records, “citizen science,” data from sensors, government data, or some other source, the problem isn’t finding data, it’s figuring out what to do with it. And it’s not just companies using their own data, or the data contributed by their users. It’s increasingly common to mashup data from a number of sources. “Data Mashups in R” analyzes mortgage foreclosures in Philadelphia County by taking a public report from the county sheriff’s office, extracting addresses and using Yahoo to convert the addresses to latitude and longitude, then using the geographical data to place the foreclosures on a map (another data source), and group them by neighborhood, valuation, neighborhood per-capita income, and other socio-economic factors.
在过去的一些年里,可用数据的量级有了爆炸性的增长。无论我们是否在讨论网络服务器日志,推特流,在线交易记录,“公民科学”,来自传感器、政府数据和其他来源的数据,问题不是在于寻找数据,而是弄清楚该如何处理数据。公司不仅仅是使用他们自己的数据,或由他们的用户贡献的数据。混合不同来源的数据越来越普遍。《Data Mashups in R》通过获取县警长办公室的一份公共报告,分析了费城县的丧失抵押品赎回权,提取地址并使用雅虎将地址转化为经度和维度,然后使用地理数据将抵押品赎回权的取消放入一张地图上(地图是另一个数据来源),并通过社区关系,估值,社区人均收入和其它社会经济因素来将其分组。
The question facing every company today, every startup, every non-profit, every project site that wants to attract a community, is how to use data effectively -- not just their own data, but all the data that’s available and relevant. Using data effectively requires something different from traditional statistics,
where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.
如何有效地使用数据,这个问题面向今天的每家公司,每个初创公司,每个非盈利公司,每个想要吸引社区的项目网站——不止是他们自己的数据,而是所有有用和有关的数据。有效地使用数据需要一些与穿着商业套装的精算师进行的晦涩难懂但是定义明确的传统统计学不一样的东西。区分数据科学与统计学的是,数据科学是一种全面整体的方式。我们越来越多地无序无意识地发现数据,而数据科学家陷入收集数据,包装数据成为一种易处理的形式,让数据说出自己的故事,并数据的故事传递给其它人。
注:数据科学中数据的价值不明确,因为数据是财富,不同的视角,也得出不同的结果,所以,数据科学肯定不像统计学一样要求晦涩但是精确。in the wild在自然环境中,处于野生状态,所以翻译成无序无意识地。
To get a sense for what skills are required, let’s look at the data lifecycle: where it comes from, how you use it, and where it goes.
要了解需要哪些技能,让我们看看数据生命周期:它是从哪里来,你要如何使用它,和它将去往哪里。