什么是数据科学？《What is data science》 by Mike Loukides翻译和精读03

2023-07-22 23:13 作者:跳舞的Jennifer 0人读过 | 我要投稿

Working with data at scale

处理大规模数据

We’ve all heard a lot about “big data,” but “big” is really a red herring. Oil companies, telecommunications companies, and other data-centric industries have had huge datasets for a long time. And as storage capacity continues to expand, today’s “big” is certainly tomorrow’s “medium” and next week’s “small.” The most meaningful definition I’ve heard: “big data” is when the size

of the data itself becomes part of the problem. We’re discussing data problems ranging from gigabytes to petabytes of data. At some point, traditional techniques for working with data run out of steam.

我们已经听了很多关于“大数据”的内容，但是“大”真的是一条红鲱鱼。石油公司，通讯公司，和其它数据中心产业，长时间以来有着巨大的数据集。并且随着存储容量继续膨胀，今天的“大”肯定是明天的“中”和下一周的“小”。我听过的最有意义的定义：“大数据”是数据本身的规模成了问题的一部分的时候。我们正在讨论从千兆到亿万范围的数据问题。在某种程度上，处理数据的传统技术，过时了。

注：red herring红鲱鱼在英语文化中指的是为转移注意力而提出的不相干事实或论点。

What are we trying to do with data that’s different? According to Jeff Hammerbacher2 (@hackingdata), we’re trying to build information platforms or dataspaces. Information platforms are similar to traditional data warehouses, but different. They expose rich APIs, and are designed for exploring and understanding the data rather than for traditional analysis and reporting. They accept all data formats, including the most messy, and their schemas evolve as the understanding of the data changes.

我们要做什么来处理不一样的数据？根据hackingdata的Jeff Hammerbacher所说，我们正在尝试建立信息平台或数据空间。信息平台类似于传统数据仓库，但是不一样。它们开发丰富的应用程序接口，被设计为发现和理解数据而不是传统的分析和报告。它们接受所有的数据形式，包括最杂乱的数据，它们的模式随着对数据变化的理解而发展。

2. “Information Platforms as Dataspaces,” by Jeff Hammerbacher (in Beautiful Data)

脚注2 “数据平台作为一个数据空间”，Jeff Hammerbacher（在《Beautiful Data》这本书中）

Most of the organizations that have built data platforms have found it necessary to go beyond the relational database model. Traditional relational database systems stop being effective at this scale. Managing sharding and replication across a horde of database servers is difficult and slow. The need to

define a schema in advance conflicts with reality of multiple, unstructured data sources, in which you may not know what’s important until after you’ve analyzed the data. Relational databases are designed for consistency, to support complex transactions that can easily be rolled back if any one of a complex set of operations fails. While rock-solid consistency is crucial to many applications, it’s not really necessary for the kind of analysis we’re discussing here. Do you really care if you have 1,010 or 1,012 Twitter followers? Precision has an allure, but in most data-driven applications outside of finance, that allure is deceptive. Most data analysis is comparative: if you’re asking whether sales to Northern Europe are increasing faster than sales to Southern Europe, you aren’t concerned about the difference between 5.92 percent annual growth and 5.93 percent.

大部分已经建立数据平台的机构，发现超越关系数据库模型是非常有必要的。传统的关系型数据库系统在这种（数据）规模下不再有效。跨越一群数据库服务器的管理分片和复制是困难和缓慢的。预定义模式的需要与多样、非结构化的数据源相冲突，在数据源中，你不会知道什么是重要的，直到你已经分析了这些数据。关系型数据库被设计为一致性，来支持一套复杂操作中的一个（事物）操作失败后，可以轻易被回滚的复杂事务。尽管坚如磐石的一致性对很多应用来说是关键的，对于我们在这里讨论的特定分析，却不是真正必要的。你真的会关心你是有1010个推特粉丝还是1012个吗？精确性有吸引力，但是在金融之外的大部分数据驱动应用中，那种吸引力是误导性的。大部分数据分析是比较性的：如果你在咨询北欧的销售增长是否比南欧快，你不会关心5.92%和5.93%的年增长差别。

注：The need to define a schema in advance conflicts with reality of multiple, unstructured data sources, in which you may not know what’s important until after you’ve analyzed the data. 这句话的主语是the need，to define a schema in advance是主语的修饰，谓语是conflict with，宾语是reality ，什么样的reality，reality of multiple, multiple, unstructured data sources。in which定语从句，是in data sources。

事务的回滚，是事务的一致性与原子性的结合，一个事务要么完全执行，要么回滚到完全不执行的状态。

To store huge datasets effectively, we’ve seen a new breed of databases appear. These are frequently called NoSQL databases, or Non-Relational databases, though neither term is very useful. They group together fundamentally dissimilar products by telling you what they aren’t. Many of these databases are

the logical descendants of Google’s BigTable and Amazon’s Dynamo, and are designed to be distributed across many nodes, to provide “eventual consis-tency” but not absolute consistency, and to have very flexible schema. While there are two dozen or so products available (almost all of them open source), a few leaders have established themselves:

要有效地存储大型数据集，我们已经看到了一种新的培育的数据库出现了。他们通常被称为NoSQL数据库（非结构化查询数据库，与结构化数据库相对比），或非关系型数据库，尽管两个术语中的任何一个都不是很有用。它们通过告诉你，它们不是什么来将根本上并不相似的产品分为一组。很多这种数据库是谷歌BigTable和亚马逊Dynamo的逻辑衍生物，被设计为跨越很多个节点的分布式，提供“最终一致性”但不是绝对的一致性，并有着非常灵活的模式。大概有至少24个产品是可用的（大部分都开源），少数领导者已经站稳了脚跟。

注：establish除了建立，还有稳固的意思。

• Cassandra: Developed at Facebook, in production use at Twitter, Rackspace, Reddit, and other large sites. Cassandra is designed for high performance, reliability, and automatic replication. It has a very flexible data model. A new startup, Riptano, provides commercial support.

Cassandra（数据库）：在脸书发展，在推特，Rackspace，Reddit和其它大型站点的产品使用中。Cassandra被设计为高性能，高可靠性，和自动复制。它有着非常灵活的数据模型。一个新的初创公司，Riptano，提供商业支持。

• HBase: Part of the Apache Hadoop project, and modelled on Google’s BigTable. Suitable for extremely large databases (billions of rows, millions of columns), distributed across thousands of nodes. Along with Hadoop, commercial support is provided by Cloudera.

HBase：Apache Hadoop工程的一部分，基于谷歌的BigTable建模。适用于超大型数据库（数十亿行，百万列），在上千节点上分布。和Hadoop一样，商业支持由Cloudera提供。

Storing data is only part of building a data platform, though. Data is only useful if you can do something with it, and enormous datasets present computational problems. Google popularized the MapReduce approach, which is basically a divide-and-conquer strategy for distributing an extremely large problem across an extremely large computing cluster. In the “map” stage, a programming task

is divided into a number of identical subtasks, which are then distributed across many processors; the intermediate results are then combined by a single reduce task. In hindsight, MapReduce seems like an obvious solution to Google’s biggest problem, creating large searches. It’s easy to distribute a search

across thousands of processors, and then combine the results into a single set of answers. What’s less obvious is that MapReduce has proven to be widely applicable to many large data problems, ranging from search to machine learning.

存储数据只是构造一个数据平台的一部分。数据只有在你能用它做一些事的时候才是有用的，并且庞大的数据集带来了计算问题。谷歌推广了MapReduce算法，这种算法基于分治策略，在一个超大的计算集群中分布一个超大的问题。在“map”阶段，一个编程任务被分为一系列的独立子任务，这些子任务随后被分布到很多个处理器上；中间结果随后会由一个单个的reduce 任务所结合。事后看来，MapReduce看起来像是谷歌最大问题——创造大型搜索的一个明显解决办法。将一个搜索分布到上千个处理器上是很容易的，然后将结果结合到一个单独的答案集合里。没那么明显的是MapReduce已经证明广泛适用于很多大型数据问题，从搜索到机器学习。

注：引号的map是映射的意思。reduce task的reduce是MapReduce的reduce，前面讲完了map阶段，后面介绍reduce task。

The most popular open source implementation of MapReduce is the Hadoop project. Yahoo’s claim that they had built the world’s largest production Hadoop application, with 10,000 cores running Linux, brought it onto center stage. Many of the key Hadoop developers have found a home at Cloudera,

which provides commercial support. Amazon’s Elastic MapReduce makes it much easier to put Hadoop to work without investing in racks of Linux machines, by providing preconfigured Hadoop images for its EC2 clusters. You can allocate and de-allocate processors as needed, paying only for the time you use them.

MapReduce最流行的开源应用是Hadoop工程。雅虎声称他们已经用运行在Linux上的10000个内核，建立了世界上最大的Hadoop应用产品，将这一产品带到了中央舞台。很多核心的Hadoop开发者在Cloudera找到了一个家，Cloudera提供商业支持。亚马逊的Elastic MapReduce使得在不需要投资Linux机器架子的情况下将Hadoop投入使用变得更加容易，方法是为它的EC2集群提供预先设定的Hadoop图像。你可以分配和回收需要的处理器，只需要在你使用时付费。

注：这一段就是现在阿里云等云服务商提供的云服务。

Hadoop goes far beyond a simple MapReduce implementation (of which there are several); it’s the key component of a data platform. It incorporates HDFS, a distributed filesystem designed for the performance and reliability requirements of huge datasets; the HBase database; Hive, which lets developers explore Hadoop datasets using SQL-like queries; a high-level dataflow language called Pig; and other components. If anything can be called a one-stop information platform, Hadoop is it.

Hadoop远远不止是一个简单的MapReduce应用（其中有几个）；它是数据平台的关键组件。

它内嵌HDFS——一个被设计为满足大型数据集的性能和可靠性要求的分布式文件系统；HBase数据库；让开发者使用类似SQL查询来探索Hadoop数据集的Hive; 一种被称为Pig的高阶数据流语言；和其它组件。如果有什么东西可以被称为一站式信息平台，Hadoop就是。

注：第一句话的意思是，Hadoop包含多个MapReduce应用。

Hadoop has been instrumental in enabling “agile” data analysis. In software development, “agile practices” are associated with faster product cycles, closer interaction between developers and consumers, and testing. Traditional data analysis has been hampered by extremely long turn-around times. If you start a calculation, it might not finish for hours, or even days. But Hadoop (and

particularly Elastic MapReduce) make it easy to build clusters that can perform computations on long datasets quickly. Faster computations make it easier to test different assumptions, different datasets, and different algorithms. It’s easer to consult with clients to figure out whether you’re asking the right

questions, and it’s possible to pursue intriguing possibilities that you’d otherwise have to drop for lack of time.

Hadoop在实现“敏捷”数据分析成为可能上发挥了作用。在软件发展中，“敏捷实践”与更快的产品周期，开发者和消费者之间更密切的交互，和测试（以上所有这些）关联。传统的数据分析被极其长的迭代时间所阻碍。如果你开始一个计算，它可能要几个小时，甚至几天才能结束。但是Hadoop（以及尤其是Elastic MapReduce）使得构建可以在长数据集上执行计算的集群变得容易。更快的计算让测试不同的假设、不同的数据集和不同的算法变得容易的多。与客户沟通来弄清楚你是否询问了正确的问题变得更加自在，追求有趣的可能性是可能的，不然你会因为缺少事件放弃了。

注：关于turn-around time，软件开发中，除了瀑布模型不需要迭代，其它模型的软件开发过程都需要某一或某几个阶段的不断迭代。

Hadoop is essentially a batch system, but Hadoop Online Prototype (HOP) is an experimental project that enables stream processing. Hadoop processes data as it arrives, and delivers intermediate results in (near) real-time. Near real-time data analysis enables features like trending topics on sites like Twitter. These features only require soft real-time; reports on trending topics don’t require millisecond accuracy. As with the number of followers on Twitter, a “trending topics” report only needs to be current to within five minutes -- or even an hour. According to Hilary Mason (@hmason), data scientist at bit.ly, it’s possible to precompute much of the calculation, then use one of the experiments in real-time MapReduce to get presentable results.

Hadoop本质上是一个批处理系统，但是Hadoop Online Prototype (HOP)是一个能够进行流处理的实验项目。Hadoop在数据到达时处理数据，并（近似）实时传递中间结果。近似实时数据分析让像推特这样的网址的趋势话题的特色得以实现。这些特点只需要软的实时；趋势话题的报告不需要毫秒精度。只要有一定数量的推特用户，一个“趋势话题”报告只需要5分钟之内——甚至一个小时内流行的。根据hmason的在bit.ly的数据科学家Hilary Mason，提前运算大部分计算是可能的，然后使用实时MapReduce实验结果的一种来获得符合要求的结果。

Machine learning is another essential tool for the data scientist. We now expect web and mobile applications to incorporate recommendation engines, and building a recommendation engine is a quintessential artificial intelligence problem. You don’t have to look at many modern web applications to see classification, error detection, image matching (behind Google Goggles and SnapTell) and even face detection -- an ill-advised mobile application lets you take someone’s picture with a cell phone, and look up that person’s identity using photos available online. Andrew Ng’s Machine Learning course is one of the most popular courses in computer science at Stanford, with hundreds of students (this video is highly recommended).

机器学习是对数据科学家来说另一种必要的工具。我们现在期待网络和移动应用能内嵌推荐引擎，建立一个推荐引擎是典型的人工智能问题。你不必非得用许多现代网络应用就可以看到分类，错误检测，图像匹配（在Google Googles和SnapTell背后），甚至面部检测——一个不建议的移动应用让你用一部手机就能拿到一些人的图片，还有用网上可用的照片来寻找人的身份信息。吴恩达的机器学习课程是斯坦福计算机科学系最受欢迎的课程，有数以百计的学生（高度推荐他的授课视频）。

There are many libraries available for machine learning: PyBrain in Python, Elefant, Weka in Java, and Mahout (coupled to Hadoop). Google has just announced their Prediction API, which exposes their machine learning algorithms for public use via a RESTful interface. For computer vision, the OpenCV library is a de-facto standard.

有很多机器学习可用的库：Python的PyBrain，Elefant，Java的Weka，还有Mahout（与Hadoop匹配）。谷歌刚刚宣布他们的Prediction API，通过一个闲置的借口来面向公众开放他们的机器学习算法。对机器视觉，OpenCV库是一个事实上的标准。

Mechanical Turk is also an important part of the toolbox. Machine learning almost always requires a “training set,” or a significant body of known data with which to develop and tune the application. The Turk is an excellent way to develop training sets. Once you’ve collected your training data (perhaps a

large collection of public photos from Twitter), you can have humans classify them inexpensively -- possibly sorting them into categories, possibly drawing circles around faces, cars, or whatever interests you. It’s an excellent way to classify a few thousand data points at a cost of a few cents each. Even a relatively large job only costs a few hundred dollars.

Mechanical Turk同样是工具箱的一个重要部分。机器学习几乎总是需要一个“训练集”，或一个用于开发和调整应用的已知数据的重要主体。Turk是一个开发训练集的非常好的方式。一旦你已经收集了你的训练数据（也许是从推特收集到的大量公共照片），你就能让人工花费不多地对它们分类——可能是将它们分类到目录，可能是在脸部附近、车子附近或任何让你感兴趣的地方周围画圆圈。这是一种对几千个数据点进行分类来说很好的方式，每个数据点只需要几美分。甚至一份相对大的作业只是花费几百美元。

注：possibly drawing circles around faces, cars, or whatever interests you. 就是数据标注的拉框。

While I haven’t stressed traditional statistics, building statistical models plays an important role in any data analysis. According to Mike Driscoll (@dataspora), statistics is the “grammar of data science.” It is crucial to “making data speak coherently.” We’ve all heard the joke that eating pickles causes death,

because everyone who dies has eaten pickles. That joke doesn’t work if you understand what correlation means. More to the point, it’s easy to notice that one advertisement for R in a Nutshell generated 2 percent more conversions than another. But it takes statistics to know whether this difference is significant, or just a random fluctuation. Data science isn’t just about the existence

of data, or making guesses about what that data might mean; it’s about testing hypotheses and making sure that the conclusions you’re drawing from the data are valid. Statistics plays a role in everything from traditional business intelligence (BI) to understanding how Google’s ad auctions work. Statistics has become a basic skill. It isn’t superseded by newer techniques from machine learning and other disciplines; it complements them.

虽然我还没有强调过传统的统计学，建立统计学模型在任何数据分析中都起着一个重要的作用。根据dataspora的Mike Driscoll，统计学是“数据科学的语法”。对于“让数据连贯地说话”来说，统计学是关键的。我们都听说过吃腌菜引起死亡的笑话，因为每个死去的人都吃过腌菜。如果你理解相关性意味着什么，那个笑话就不会起作用。更为重要的是，很容易注意到在《R in Nutshell》的一条广告能够比别的多生成2%的转化率。但是需要统计学才能知道，这个差异是否是显著的，或是否只是一个随机的波动。统计学不仅仅是数据的存在，或是让关于数据可能意味着什么来做猜测；它是验证假设和确定你从数据得到的结论是有效的。统计学已经成了一个基本技能。它没有被更新的技术——从机器学习到其它训练方法所取代，它（指统计学）补充了它们（指机器学习和其它训练方法）。

While there are many commercial statistical packages, the open source R language -- and its comprehensive package library, CRAN -- is an essential tool. Although R is an odd and quirky language, particularly to someone with a background in computer science, it comes close to providing “one stop shopping” for most statistical work. It has excellent graphics facilities; CRAN includes parsers for many kinds of data; and newer extensions extend R into distributed computing. If there’s a single tool that provides an end-to-end solution for statistics work, R is it.

尽管有很多商业统计包，开源R语言——和它的综合包库CRAN——一个基本的，必不可少的工具。尽管R是一种奇怪的语言，特别是对于有计算机科学背景的人来说，它（指R语言）近乎为大部分统计工作提供了“一站式购物”。它有着优秀的图形装置；CRAN含有对很多种数据的语法分析；而且新扩展将R扩展到分布式计算的领域。如果有一个单独的工具能够为统计工作提供端到端的解决方案，R语言就是。

注：对有计算机科学或软件背景的学生来说，C语言等语言早已深入人心，R语言和编程语言的内部逻辑很不一样。

标签：数据分析数据挖掘数据科学 data science

什么是数据科学？《What is data science》 by Mike Loukides翻译和精读03

什么是数据科学？《What is data science》 by Mike Loukides翻译和精读03的评论 (共条)

你可能也喜欢这些文章

最新发布的文章

什么是数据科学？《What is data science》 by Mike Loukides翻译和精读03

本文作者的其他文章

什么是数据科学？《What is data science》 by Mike Loukides翻译和精读03的评论 (共 条)

你可能也喜欢这些文章

最新发布的文章

什么是数据科学？《What is data science》 by Mike Loukides翻译和精读03的评论 (共条)