欢迎光临散文网 会员登陆 & 注册

RookieDB_Project 1: SQL

2023-02-10 14:36 作者:CodeSnake  | 我要投稿

Tips💁‍♂

  • SQL项目主要锻炼同学们的SQL语句基础与编写能力,非常建议大家认真完成其中SQL联系,会发现许多之前没发现的SQL问题

  • 原文中作者使用的python3命令来调用python,我自己的系统安装部署的python只允许我使用python命令来调用,不过问题不大,只要能调用python运行脚本即可

  • 在开始第四部分的tasks时,需要认真将第三部分的Testing认真熟悉,可以做到事半功倍

一、Getting-started

(一)、Logistics

后勤

This project is due Friday, 9/9/2022 at 11:59PM PDT (GMT-7). It is worth 5% of your overall grade in the class.

该项目应于2022年9月9日星期五在太平洋时间11:59PM(GMT-7)提交。它的价值是你在课堂上总成绩的5%。

(二)、Prerequisites

先决条件

You should watch the SQL I lecture before beginning this project. Later questions will require material from the SQL II lecture.

在开始这个项目之前,你应该观看SQL I讲座。后面的问题将需要SQL II讲座的材料。

(三)、Fetching the released code

获取发布代码

The GitHub Classroom link for this project is in the Project 1 release post on Edstem. Once your private repo is set up clone the project 1 skeleton code onto your local machine.

这个项目的GitHub教室链接在Edstem上的项目1发布帖中。一旦你的私人 repo 设置好了,就把项目1的骨架代码克隆到你的本地机器上。

(四)、Required Software

所需软件

1. SQLite3

Check if you already have sqlite3 instead by opening a terminal and running sqlite3 --version. Any version at 3.8.3 or higher should be fine.

If you don't already have SQLite on your machine, the simplest way to start using it is to download a precompiled binary from the SQLite website.

检查你是否已经有了sqlite3而不是通过打开终端并运行sqlite3--版本。任何在3.8.3或更高的版本都应该是好的。

如果你的机器上还没有SQLite,开始使用它的最简单方法是从SQLite网站上下载一个预编译的二进制文件。

安装windows版本的SQLite

  1. Visit the download page linked above and navigate to the section Precompiled Binaries for Windows. Click on the link sqlite-tools-win32-x86-*.zip to download the binary.

  2. Unzip the file. There should be a sqlite3.exe file in the directory after extraction.

  3. Navigate to the folder containing the sqlite3.exe file and check that the version is at least 3.8.3: cd path/to/sqlite_folder ./sqlite3 --version

  4. Move the sqlite3.exe executable into your fa22-proj1-yourname directory (the same place as the proj1.sql file)

  1. 访问上面链接的下载页面,并导航到用于Windows的预编译二进制文件部分。点击链接sqlite-tools-win32-x86-*.zip来下载二进制文件。

  2. 解压缩该文件。解压缩后的目录中应该有一个sqlite3.exe文件。

  3. 导航到包含sqlite3.exe文件的文件夹,检查版本是否至少是3.8.3: cd path/to/sqlite_folder ./sqlite3 --version

  4. 将sqlite3.exe可执行文件移至fa22-proj1-yourname目录(与proj1.sql文件的位置相同)。

    fa22-proj1-xx下载

Note:省略SQLite其他系统的安装


2. Python

You'll need a copy of Python 3.5 or higher to run the tests for this project locally. You can check if you already have an existing copy by running python3 --version in a terminal. If you don't already have a working copy download and install one for your appropriate platform from here.

你需要一个Python 3.5或更高版本的拷贝来在本地运行这个项目的测试。你可以通过在终端运行 python3 --版本来检查你是否已经有了一个现有的副本。如果你还没有一个可用的拷贝,请从这里下载并安装一个适合你的平台的拷贝。

image-20230105161638768


(五)、Download and extract the data set

下载并提取数据集

Download the data set for this project from the course's Google Drive here. You should get a file called lahman.db.zip. Unzip the lahman.db.zip file inside your fa22-proj1-yourname directory. You should now have a lahman.db file in your fa22-proj1-yourname directory (the same place as the proj1.sql file)

在这里 从课程的Google Drive下载本项目的数据集。你应该得到一个叫做lahman.db.zip的文件。在你的fa22-proj1-yourname目录下解压lahman.db.zip文件。你现在应该在你的fa22-proj1-yourname目录下有一个lahman.db文件(与proj1.sql文件的位置相同)。

image-20230105163923163


(六)、Running the tests

运行测试任务

If you followed the instructions above you should now be able to test your code. Navigate to your project directory and try using python3 test.py. You should get output similar to the following:

如果你按照上面的说明,你现在应该可以测试你的代码了。导航到你的项目目录,尝试使用python3 test.py。你应该得到类似于以下的输出。

image-20230105164144960

If so, move on to the next section to start the project. If you see ERRORinstead of FAIL create a followup on Edstem with details from your your_output/ folder.

如果是这样,就进入下一节,开始项目。如果你看到的是ERROR而不是FAIL,请在Edstem上创建一个跟帖,并附上你的_output/文件夹中的详细信息。

二、SQL vs. SQLite

[Note: You can skip this section for now and come back to it while you're doing project 1.]

注意:你可以暂时跳过这一节,在做项目1的时候再来讨论。

(一)、Why Are We Using SQLite in This Class?

为什么我们要在这门课上使用SQLite?

As you may have learned mostly SQL synax, it will not be the engine that we use for this project. Instead, we will use a more lightweight variant called SQLite. As noted on the docs of SQLite official website, Client/server SQL database engines strive to implement a shared repository of enterprise data. They emphasize scalability, concurrency, centralization, and control. SQLite strives to provide local data storage for individual applications and devices. As such, SQLite is very easy to set up and run, while a standard SQL engine requires setting up an entire server.

正如你可能已经了解到的大多数SQL synax(语法),它将不会是我们在这个项目中使用的引擎。相反,我们将使用一个更轻量级的变体,叫做SQLite。正如在SQLite官方网站的文档中指出的,客户/服务器 SQL数据库引擎努力实现企业数据的共享库。它们强调可扩展性、并发性、集中化和控制。SQLite努力为单个应用程序和设备提供本地数据存储。因此,SQLite非常容易设置和运行,而标准SQL引擎需要设置整个服务器。

Now, with downloading an app of several megabytes, you can quickly run SQL-like queries on any database you want!

现在,只要下载一个几兆字节的应用程序,你就可以在任何你想要的数据库上快速运行类似SQL的查询!


(二)、New Autograder

新的自动评分

Starting this semester, we will be using a new autograder integrating Cosette to grade your work. The Cosette SQL Solver will check the equivalence of two SQL queries, and that implies if you are not writing in standard SQL syntax, but somehow SQLite engine understood it, Cosette will complain. And you will be deducted 5% of points for that question even if the output produced by your query matches the output of the official solution.

从本学期开始,我们将使用一个新的自动评分器集成Cosette来评分你的作业。珂赛特SQL求解器将检查两个SQL查询的等价性,这意味着如果您不是用标准SQL语法编写,但不知为何SQLite引擎理解了它,[Cosette将会抱怨。即使你的查询结果与官方解答的结果相匹配,你也会被扣5%的分数。


(三)、SQLite Syntax Difference

SQLite语法差异

SQLite is a much more tolerant language than SQL, so a lot of queries that raise an error in SQL will be inferred and run successfully by SQLite. We do not wish that you utilize this tolerance to write "incorrect" queries. Next, we will go over some most common errors that students make and which Cosette Solver will complain about.

SQLite是一种比SQL更宽容的语言,所以很多在SQL中产生错误的查询都会被SQLite推断出来并成功运行。我们不希望你利用这种容忍度来编写 "不正确 "的查询。接下来,我们将介绍一些学生最常犯的错误,Cosette Solver会抱怨这些错误。


(四)、Most Common SQL Errors

最常见的SQL错误

  • Use the alias directly in WHERE/HAVING clause 在WHERE/HAVING子句中直接使用别名

    SELECT birthyear, AGG(col1) AS foo, ...
     FROM 186_TAs
     GROUP BY birthyear
     HAVING foo > "bar"
     ...

    The problem here is that SELECT is applied after the TAs are "GROUP BY"ed and filtered by "HAVING". At the stage of "HAVING", the SQL engine doesn't understand the alias "foo" in the SELECT clause yet.

    这里的问题是,SELECT是在TAs被 "GROUP BY "编辑并被 "HAVING "过滤后应用的。在 "HAVING "阶段,SQL引擎还不能理解SELECT子句中的别名 "foo"。


  • "=="

    Something that you may learn in the first day of a CS class includes that computers start at index 0, and "=" is the assignment operator rather than comparison. This convention will be broken in SQL world, where you should use "=" for direct comparison.

    在CS课程的第一天,你可能会学到一些东西,包括计算机从索引0开始,而"="是赋值运算符,而不是比较。这个惯例在SQL世界中会被打破,你应该使用"="来直接比较。


  • (INNER | { LEFT | RIGHT | FULL } [OUTER]) JOIN without a join condition (连接条件)

    In SQL, only NATURAL JOIN does not require a join condition as it automatically infers the common column names. It is the language's rule that you are required to give some condition with the ON clause.

    在SQL中,只有NATURAL JOIN不需要连接条件,因为它能自动推断出公共列名。这是语言的规则,你需要在ON子句中给出一些条件。


  • GROUP BY without aggregate ()

    This is probably one of the most common mistakes made by using SQLite. Let's take a look at the following example, where we are trying to gain insight into the attendance rate of each student in 186, displayed with their sid, number of appearances in sections, along with their names.

    这可能是使用SQLite最容易犯的错误之一。让我们看一下下面的例子,我们试图深入了解186班每个学生的出勤率,用他们的sid、在各科出现的次数以及他们的名字来显示。


    SELECT s.sid, SUM(a.attendance) AS attend_rate, s.name
     FROM 186_students s INNER JOIN section_attendance a ON s.sid = a.sid
     GROUP BY s.sid
     ...

    In this SQL query, s.sid will be recognized without any issue, as it's the GROUP BY key; same for SUM(a.attendance), as it is the Aggregate column. But how about s.name? It doesn't fall into either of the categories, so it is invalid to use it here.

    在这个SQL查询中,s.sid将被识别,没有任何问题,因为它是GROUP BY键;SUM(a.attend)也一样,因为它是Aggregate列。但是s.name呢?它不属于这两个类别,所以在这里使用它是无效的。

(五)、OK, so SQLite Seems Untrustworthy...

好的,所以SQLite似乎不值得信任......

Now, you may be very concerned that some code that gets executed in SQLite engine will fail the autograder check. Don't worry about that, as SQLite is a commercial use database engine, it is quite fault-tolerant, that is to say, it catches a lot of syntax issues. The ones mentioned above are just slightly more demanding syntax rules in SQL. So if your code passes the SQLite check, and it is following the rule taught in class, you should be good to go!

现在,你可能非常担心一些在SQLite引擎中执行的代码会无法通过自动检查器的检查。不要担心这个问题,因为SQLite是一个商业用途的数据库引擎,它具有相当的容错性,也就是说,它可以捕捉到很多语法问题。上面提到的那些只是SQL中要求稍高的语法规则。所以,如果你的代码通过了SQLite的检查,并且遵循了课堂上讲授的规则,你就应该可以顺利通过了!

三、Testing

You can run your answers through SQLite directly by running sqlite3 lahman.db to open the database and then entering .read proj1.sql

你可以通过SQLite直接运行你的答案,运行sqlite3 lahman.db来打开数据库,然后输入.read proj1.sql(执行 proj1.sql文件中的 SQL)。

This can help you catch any syntax errors in your SQLite.

这可以帮助你抓住SQLite中的任何语法错误。

To help debug your logic, we've provided output from each of the views you need to define in questions 1-4 for the data set you've been given. Your views should match ours, but note that your SQL queries should work on ANY data set. We will test your queries on a (set of) different database(s), so it is NOT sufficient to simply return these results in all cases! Please also note that queries that join on extra, unnecessary tables will slow down queries and not receive full credit on the hidden tests.

为了帮助调试你的逻辑,我们提供了你在问题1-4中需要定义的每个视图的输出,用于你所得到的数据集。你的视图应该与我们的一致,但请注意,你的SQL查询应该在任何数据集上工作。我们将在一个(一组)不同的数据库上测试你的查询,所以在所有情况下仅仅返回这些结果是不够的。还请注意,在额外的、不必要的表上连接的查询将减慢查询速度,并且在隐藏测试中不能得到满分。

To run the test, from within the fa22-proj1-yourname directory:

要运行测试,从fa22-proj1-yourname目录中。

image-20230105165932131

Become familiar with the UNIX diff format, if you're not already, because our tests saves a simplified diff for any query executions that don't match in diffs/. As an example, the following output for diffs/q1i.txt::

如果你还不熟悉UNIX的diff格式,因为我们的测试为任何在diffs/中不匹配的查询执行保存了一个简化的diff。作为一个例子,下面是diffs/q1i.txt的输出:

- 1|1|1
+ Jumbo|Diaz|1984
+ Walter|Young|1980

indicates that your output has an extra 1|1|1 (the - at the beginning means the expected output doesn't include this line but your output has it) and is missing the lines Jumbo|Diaz|1984 and Walter|Young|1980 (the plus at the beginning means the expected output does include those lines but your output is missing it). If there is neither a + nor - at the beginning then it means that the line is in both your output and the expected output (your output is correct for that line).

表示你的输出有一个额外的1|1|1(开头的-表示预期输出不包括这一行,但你的输出有),并且缺少Jumbo|Diaz|1984Walter|Young|1980这两行(开头的+表示预期输出包括这两行,但你的输出缺少这一行)。如果开头没有 "+"或"-",则意味着你的输出和预期输出中都有该行(你的输出对该行来说是正确的)。

If you care to look at the query outputs directly, ours are located in the expected_output directory. Your view output should be located in your solution's your_output directory once you run the tests.

如果你想直接看一下查询输出,我们的输出位于expected_output目录下。你的视图输出应该位于你的解决方案的your_output目录中,一旦你运行测试。

Note: For queries where we don't specify the order, it doesn't matter how you sort your results; we will reorder before comparing. Note, however, that our test query output is sorted for these cases, so if you're trying to compare yours and ours manually line-by-line, make sure you use the proper ORDER BY clause (you can determine this by looking in test.py). Different versions of SQLite handle floating points slightly differently so we also round certain floating point values in our own queries. A full list is specified here for convenience:

注意:对于我们没有指定顺序的查询,你如何对你的结果进行排序并不重要;我们会在比较之前重新排序。但是请注意,在这些情况下,我们的测试查询输出是排序的,所以如果你想手动逐行比较你的和我们的,请确保你使用适当的ORDER BY子句(你可以通过查看test.py来确定)。不同版本的SQLite处理浮点的方式略有不同,所以我们在自己的查询中也对某些浮点值进行了舍入。为了方便起见,这里指定了一个完整的列表。


四、Tasks

In this project we will be working with the commonly-used Lahman baseball statistics database (our friends at the San Francisco Giants tell us they use it!) The database contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2019. It includes data from the two current leagues (American and National), four other "major" leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875.

在这个项目中,我们将使用常用的拉赫曼棒球统计数据库(我们在旧金山巨人队的朋友告诉我们他们使用它!)该数据库包含了1871年至2019年美国职业棒球大联盟的投球、击球和击球统计。它包括目前的两个联盟(美国和国家),其他四个 "主要 "联盟(美国协会、联盟协会、球员联盟和联邦联盟),以及1871-1875年的国家协会的数据。

At this point you should be able to run SQLite and view the database using either ./sqlite3 -header lahman.db (if in the previous section you downloaded a precompiled binary) or sqlite3 -header lahman.db otherwise. If you're using windows and you find that the previous command doesn't work, try running winpty ./sqlite3 lahman.db.

在这一点上,你应该能够运行SQLite并使用./sqlite3 -header lahman.db(如果在上一节中你下载了一个预编译的二进制文件)或sqlite3 -header lahman.db查看数据库。如果你使用的是windows系统,而你发现前面的命令不起作用,可以尝试运行winpty ./sqlite3 lahman.db。

$ sqlite3 lahman.db
SQLite version 3.33.0 2020-08-14 13:23:32
Enter ".help" for usage hints.
sqlite> .tables

Try running a few sample commands in the SQLite console and see what they do:

试着在SQLite控制台运行几个示例命令,看看它们的作用。

sqlite> .schema people sqlite>  SELECT playerid, namefirst, namelast FROM people; sqlite> SELECT COUNT(*) FROM fielding;

(一)、Understanding the Schema

了解用例库

image-20230105170142926

The database is comprised of the following main tables:

该数据库由以下主要表组成。

People - Player names, date of birth (DOB), and biographical info # 球员姓名、出生日期(DOB)和传记信息 Batting - batting statistics # 击球统计 Pitching - pitching statistics # 投手统计 Fielding - fielding statistics # 外场统计

It is supplemented by these tables:

它由这些表加以补充。

 AllStarFull - All-Star appearances # 全明星出场次数   HallofFame - Hall of Fame voting data # 名人堂投票数据   Managers - managerial statistics # 经理人的统计数据   Teams - yearly stats and standings # 年度统计和排名   BattingPost - post-season batting statistics # 季后击球统计   PitchingPost - post-season pitching statistics # 季后赛投手统计数据   TeamFranchises - franchise information # 特许经营信息   FieldingOF - outfield position data # 外场位置数据   FieldingPost- post-season fielding data # 季后赛的外野数据   FieldingOFsplit - LF/CF/RF splits # LF/CF/RF的分界线   ManagersHalf - split season data for managers # 经理人的分赛季数据   TeamsHalf - split season data for teams # 球队的分赛季数据   Salaries - player salary data # 球员薪资数据   SeriesPost - post-season series information # 季后赛系列信息   AwardsManagers - awards won by managers # 经理获得的奖项   AwardsPlayers - awards won by players # 球员获得的奖项   AwardsShareManagers - award voting for manager awards # 经理奖的投票情况   AwardsSharePlayers - award voting for player awards # 球员奖项的投票情况   Appearances - details on the positions a player appeared at # 球员出场位置的详细信息   Schools - list of colleges that players attended # 球员就读的大学列表   CollegePlaying - list of players and the colleges they attended # 球员名单及其就读的大学   Parks - list of major league ballparks # 大联盟的球场列表   HomeGames - Number of homegames played by each team in each ballpark # 每支球队在每个球场所进行的主场比赛的数量

For more detailed information, see the docs online.

更详细的信息,请见在线文档。


(二)、Writing Queries

We've provided a skeleton solution file, proj1.sql, to help you get started. In the file, you'll find a CREATE VIEW statement for each part of the first 4 questions below, specifying a particular view name (like q2i) and list of column names (like playerid, lastname). The view name and column names constitute the interface against which we will grade this assignment. In other words, don't change or remove these names. Your job is to fill out the view definitions in a way that populates the views with the right tuples.

我们提供了一个骨架解决方案文件,proj1.sql,以帮助你开始工作。在该文件中,你会发现下面前4个问题的每一部分都有一个CREATE VIEW语句,指定一个特定的视图名称(如q2i)和列名列表(如playerid、lastname)。视图名称和列名构成了界面,我们将据此对这项作业进行评分。换句话说,不要改变或删除这些名称。你的工作是填写视图的定义,以便用正确的图元来填充视图。

image-20230105200112189


For example, consider Question 0: "What is the highest era (earned run average) recorded in baseball history?".

In the proj1.sql file we provide:

例如,考虑问题0:"在棒球历史上记录的最高时代(自责分率ERA, earned run average)是什么?"。

在proj1.sql文件中,我们提供:

CREATE VIEW q0(era) AS     SELECT 1 -- replace this line ;

You would edit this with your answer, keeping the schema the same:

你可以用你的答案编辑这个,保持模式不变。

-- solution you provide CREATE VIEW q0(era) AS  SELECT MAX(era)  FROM pitching ;

To complete the project, create a view for q0 as above (via copy-paste), and for all of the following queries, which you will need to write yourself.

You can confirm the test is now passing by running python3 test.py -q 0

为了完成这个项目,为上面的q0创建一个视图(通过复制粘贴),并为下面所有的查询创建一个视图,你需要自己编写。

你可以通过运行python3 test.py -q 0来确认测试是否通过。

image-20230105200732326

More details on testing can be found in the Testing section.

关于测试的更多细节可以在测试部分找到。

1. Changes from Lecture

课件变化

SQLite doesn't support every SQL feature covered in lecture, specifically:

SQLite并不支持讲座中所涉及的每一个SQL特性,特别是:

  • There is support for LEFT OUTER JOIN but not RIGHT OUTER or FULL OUTER.

    • To get equivalent output to RIGHT OUTER you can reverse the order of the tables (i.e. A RIGHT JOIN B is the same as B LEFT JOIN A.

    • While it isn't required to complete this assignment, the equivalent to FULL OUTER JOIN can be done by UNIONing RIGHT OUTER and LEFT OUTER

  • There is no regex match (~) tilde operator. You can use LIKE instead.

  • There is no ANY or ALL operator.

  • 支持LEFT OUTER JOIN,但不支持RIGHT OUTER或FULL OUTER。

  • 为了得到与RIGHT OUTER相当的输出,你可以将表的顺序颠倒过来(即 A RIGHT JOIN B与B LEFT JOIN A相同。

  • 虽然这不是完成这项任务的必要条件,但可以通过UNION RIGHT OUTER和LEFT OUTER来实现与FULL OUTER JOIN的等价。

  • 没有regex匹配(~)的tilde操作符。你可以用LIKE代替。

  • 没有ANY或ALL操作符。


(三)、Tasks

1. Task 1: Basics

任务1:基础

i. In the people table, find the namefirst, namelast and birthyear for all players with weight greater than 300 pounds.

在people表中,找出所有体重超过300磅的球员的 namefirst, namelastbirthyear

ii. Find the namefirst, namelast and birthyear of all players whose namefirst field contains a space. Order the results by namefirst, breaking ties with namelast both in ascending order

查找所有 namefirst字段包含空格的球员的 namefirst, namelastbirthyear 。  将结果按namefirst排序,打破与namelast的并列关系,均按升序排列。

iii. From the people table, group together players with the same birthyear, and report the birthyear, average height, and number of players for each birthyear. Order the results by birthyear in ascending order.

从people表中,将具有相同出生年份的球员分组,并查询出每个出生年份的球员的出生年份、平均身高和人数。将结果按出生年份升序排列。

Note: Some birth years have no players; your answer can simply skip those years. In some other years, you may find that all the players have a NULL height value in the dataset (i.e. height IS NULL); your query should return NULL for the height in those years.

注意:有些出生年份没有球员;你的答案可以直接跳过这些年份。在其他一些年份,你可能会发现所有的球员在数据集中的身高值都是空的(即身高是空的);你的查询在这些年份的身高应该返回空值。

iv. Following the results of part iii, now only include groups with an average height > 70. Again order the results by birthyear in ascending order.

按照第iii部分的结果,现在只包括平均身高 > 70 的群体。再次将结果按出生年份以升序排列。

2. Task 2: Hall of Fame Schools

名人堂的学校

i. Find the namefirst, namelast, playerid and yearid of all people who were successfully inducted into the Hall of Fame in descending order of yearid. Break ties on yearid by playerid (ascending).

找到所有成功入选名人堂的人的namefirst、namelast、playerid 和 yearid ,按yearid降序排列。按playerid(升序)打破与yearid的并列关系。

ii. Find the people who were successfully inducted into the Hall of Fame and played in college at a school located in the state of California. For each person, return their namefirst, namelast, playerid, schoolid, and yearid in descending order of yearid. Break ties on yearid by schoolid, playerid (ascending). For this question, yearid refers to the year of induction into the Hall of Fame.

  • Note: a player may appear in the results multiple times (once per year in a college in California).

查找成功入选名人堂并在位于加利福尼亚州的学校打过大学的人。对于每个人,按年份降序返回他们的名字(namefirst)、名字(namelast)、球员(playerid)、学校(schoolid)和年份(yearid)。在yearid的基础上,通过schoolid、playerid(升序)打破并列关系。在这个问题上,yearid指的是入选名人堂的年份。

iii. Find the playerid, namefirst, namelast and schoolid of all people who were successfully inducted into the Hall of Fame -- whether or not they played in college. Return people in descending order of playerid. Break ties on playerid by schoolid (ascending). (Note: schoolid should be NULL if they did not play in college.)

查找所有成功进入名人堂的人的 "playerid"、"namefirst"、"namelast "和 "schoolid",无论他们是否在大学打球。按照 "playerid "的降序排列来返回人。按照schoolid(升序)打破playerid上的并列。(注意:如果他们没有在大学打球,schoolid应该是NULL。)

3. Task 3: SaberMetrics

棒球记录统计分析

i. Find the playerid, namefirst, namelast, yearid and single-year slg (Slugging Percentage) of the players with the 10 best annual Slugging Percentage recorded over all time. A player can appear multiple times in the output. For example, if Babe Ruth’s slg in 2000 and 2001 both landed in the top 10 best annual Slugging Percentage of all time, then we should include Babe Ruth twice in the output. For statistical significance, only include players with more than 50 at-bats in the season. Order the results by slg descending, and break ties by yearid, playerid (ascending).

找出有史以来10个最佳年度猛击率记录的球员的playerid、namefirst、namelast、yearid和单年slg(猛击率)。一个球员可以在输出中出现多次。例如,如果贝比-鲁斯在2000年和2001年的击球率都进入了历史上年度最佳击球率的前10名,那么我们应该在输出中包括贝比-鲁斯两次。为了统计学上的意义,只包括在该赛季有超过50次击球的球员。将结果按slg降序排列,并按yearid、playerid(升序排列)打破并列关系。

  • Baseball note: Slugging Percentage is not provided in the database; it is computed according to a simple formula you can calculate from the data in the database.

  • SQL note: You should compute slg properly as a floating point number---you'll need to figure out how to convince SQL to do this!

  • Data set note: The online documentation batting mentions two columns 2B and 3B. On your local copy of the data set these have been renamed H2B and H3B respectively (columns starting with numbers are tedious to write queries on).

  • Data set note: The column H o f the batting table represents all hits = (# singles) + (# doubles) + (# triples) + (# home runs), not just (# singles) so you’ll need to account for some double-counting

  • If a player played on multiple teams during the same season (for example anderma02 in 2006) treat their time on each team separately for this calculation

  • 棒球说明:数据库中没有提供猛击率;它是根据一个简单的公式计算出来的,你可以根据数据库中的数据来计算。(XXX在2006年球季535 AB打数中,击出94支一垒安打 1B、33支二垒安打 2B、1支三垒安打 3B以及49支本垒打HR。他的Slugging Percentage即为:[(1×94)+(2×33)+(3×1)+(4×49)]÷535=359÷535=0.671)

  • SQL注意:你应该把slg正确地计算成一个浮点数--你需要弄清楚如何说服SQL来做这件事。

  • 数据集说明:在线文档中提到了两列2B和3B。在你本地的数据集副本中,这两列已经分别被重新命名为H2B和H3B(以数字开头的列在编写查询时很繁琐)。

  • 数据集说明:击球表的H列代表所有击球数=(#单打)+(#双打)+(#三打)+(#全垒打),而不仅仅是(#单打),所以你需要考虑到一些重复计算。

  • 如果一名球员在同一赛季在多支球队效力(例如2006年的anderma02),在此计算中应分别处理他们在每支球队的时间

ii. Following the results from Part i, find the playerid, namefirst, namelast and lslg (Lifetime Slugging Percentage) for the players with the top 10 Lifetime Slugging Percentage. Lifetime Slugging Percentage (LSLG) uses the same formula as Slugging Percentage (SLG), but it uses the number of singles, doubles, triples, home runs, and at bats each player has over their entire career, rather than just over a single season.

根据第一部分的结果,找出终身打击率排名前10的球员的playerid、namefirst、namelast和lslg(Lifetime Slugging Percentage)。终身重击率(LSLG)使用的公式与重击率(SLG)相同,但它使用的是每个球员在整个职业生涯中的单打、二打、三打、全垒打和击球数,而不仅仅是一个赛季。

Note that the database only gives batting information broken down by year; you will need to convert to total information across all time (from the earliest date recorded up to the last date recorded) to compute lslg. Order the results by lslg (descending) and break ties by playerid (ascending)

  • Note: Make sure that you only include players with more than 50 at-bats across their lifetime.

请注意,数据库只给出了按年份划分的击球信息;你需要转换为所有时间的总信息(从最早的记录日期到最后的记录日期)来计算lslg。将结果按lslg排序(降序),并按球员ID打破平局(升序)。

  • 注意:确保你只包括一生中拥有超过50次击球的球员。

iii. Find the namefirst, namelast and Lifetime Slugging Percentage (lslg) of batters whose lifetime slugging percentage is higher than that of San Francisco favorite Willie Mays.

找出终身击球率高于旧金山最受欢迎的威利-梅斯的击球手的名字(namefirst)、名字(namelast)和终身击球率(lslg)。

You may include Willie Mays' playerid in your query (mayswi01), but you may not include his slugging percentage -- you should calculate that as part of the query. (Test your query by replacing mayswi01 with the playerid of another player -- it should work for that player as well! We may do the same in the autograder.)

  • Note: Make sure that you still only include players with more than 50 at-bats across their lifetime.

你可以在你的查询中包括威利-梅斯的球员ID(mayswi01),但是你不能包括他的击球率--你应该把它作为查询的一部分来计算。(将mayswi01替换成另一个球员的球员ID来测试你的查询--它对那个球员也应该有效!)。我们可以在autograder中也这样做)。

  • 注意:确保你仍然只包括在其一生中拥有超过50次击球的球员。

Just for fun: For those of you who are baseball buffs, variants of the above queries can be used to find other more detailed SaberMetrics, like Runs Created or Value Over Replacement Player. Wikipedia has a nice page on baseball statistics; most of these can be computed fairly directly in SQL.

Also just for fun: SF Giants VP of Baseball Operations, Yeshayah Goldfarb, suggested the following:

Using the Lahman database as your guide, make an argument for when MLBs “Steroid Era” started and ended. There are a number of different ways to explore this question using the data.

(Please do not include your "just for fun" answers in your solution file! They will break the autograder.)

只是为了好玩。对于那些爱好棒球的人来说,上述查询的变体可以用来寻找其他更详细的SaberMetrics,比如说Runs Created或Value Over Replacement Player。维基百科有一个很好的关于棒球统计的页面;其中大部分可以在SQL中直接计算出来。

也只是为了好玩。旧金山巨人队的棒球运营副总裁Yeshayah Goldfarb建议如下。

以Lahman数据库为指导,为MLB(美国职业棒球大联盟)的 "类固醇时代 "的开始和结束时间做一个论证。有许多不同的方法可以利用这些数据来探讨这个问题。

(请不要在你的解决方案文件中包括你的 "只是为了好玩 "的答案! 它们会玩坏自动评分器)。


4. Task 4: Salaries

薪资

i. Find the yearid, min, max and average of all player salaries for each year recorded, ordered by yearid in ascending order.

找到每一年记录的所有球员工资的yearid, min, max and average,按yearid升序排列。

ii. For salaries in 2016, compute a histogram. Divide the salary range into 10 equal bins from min to max, with binids 0 through 9, and count the salaries in each bin. Return the binid, low and high boundaries for each bin, as well as the number of salaries in each bin, with results sorted from smallest bin to largest.

对于2016年的工资,计算一个柱状图。将工资范围从最小到最大分为10个相等的bin,binid为0到9,并计算每个bin中的工资。返回每个binid,每个bin的低和高边界,以及每个bin中的工资数量,结果从最小的bin到最大的bin排序。

  • Note: binid 0 corresponds to the lowest salaries, and binid 9 corresponds to the highest. The ranges are left-inclusive (i.e. [low, high)) -- so the high value is excluded. For example, if bin 2 has a high value of 100000, salaries of 100000 belong in bin 3, and bin 3 should have a low value of 100000.

  • Note: The high value for bin 9 may be inclusive).

  • Note: The test for this question is broken into two parts. Use python3 test.py -q 4ii_bins_0_to_8 and python3 test.py -q 4ii_bin_9 to run the tests

  • Hidden testing advice: we will be testing the case where a bin has zero player salaries in it. The correct behavior in this case is to display the correct binid, low and high with a count of zero, NOT just excluding the bin altogether.

注意:binid 0对应的是最低工资,binid 9对应的是最高工资。这些范围是左边包含的(即[低,高])--所以高值被排除。例如,如果bin 2的高值是100000,那么100000的工资就属于bin 3,而bin 3的低值应该是100000。

注意:9号仓的高值可能是包括在内的)。

注意:本题的测试分为两部分。使用python3 test.py -q 4ii_bins_0_to_8和python3 test.py -q 4ii_bin_9来运行测试

隐藏的测试建议:我们将测试一个bin中的玩家工资为零的情况。在这种情况下,正确的行为是显示正确的binid,低位和高位,计数为0,而不是完全排除这个bin。

Some useful information:一些有用的信息:

  • In the lahman.db, you may find it helpful to use the provided helper table binids, which contains all the possible binids. Get a feel of what the data looks like by running SELECT * FROM binids; in a sqlite terminal. We'll only be testing with these possible binids (there aren't any hidden tests using say, 100 bins) so using the hardcoded table is fine

  • If you want to take the floor of a positive float value you can do CAST (some_value AS INT)

在lahman.db中,你可能会发现使用提供的辅助表binids很有帮助,它包含了所有可能的binids。通过在sqlite终端运行SELECT * FROM binids; 来感受一下数据的样子。我们只用这些可能的binids进行测试(没有任何使用100个bin的隐藏测试),所以使用硬编码表就可以了。

如果你想取一个正的浮点数的下限,你可以做CAST (some_value AS INT)

iii. Now let's compute the Year-over-Year change in min, max and average player salary. For each year with recorded salaries after the first, return the yearid, mindiff, maxdiff, and avgdiff with respect to the previous year. Order the output by yearid in ascending order. (You should omit the very first year of recorded salaries from the result.)

现在让我们来计算最小、最大和平均球员工资的年际变化。对于第一年之后的每一年的工资记录,返回相对于前一年的yearid、mindiff、maxdiff和avgdiff。按照yearid的升序来排列输出。(你应该从结果中省略第一年的工资记录)。

iv. In 2001, the max salary went up by over $6 million. Write a query to find the players that had the max salary in 2000 and 2001. Return the playerid, namefirst, namelast, salary and yearid for those two years. If multiple players tied for the max salary in a year, return all of them.

在2001年,最高工资增加了600多万美元。写一个查询,找到在2000年和2001年拥有最高工资的球员。返回这两年的球员ID、姓名第一、姓名第二、工资和年份ID。如果有多名球员在某年获得了最高工资,则返回所有的球员。

  • Note on notation: you are computing a relational variant of the argmax for each of those two years.

注意符号:你正在计算这两年中每一年的argmax的关系变体。

v. Each team has at least 1 All Star and may have multiple. For each team in the year 2016, give the teamid and diffAvg (the difference between the team's highest paid all-star's salary and the team's lowest paid all-star's salary).

每支球队至少有1名全明星,可能有多名。对于2016年的每支球队,请给出球队名称和diffAvg(球队最高薪酬的全明星球员的工资和球队最低薪酬的全明星球员的工资之间的差异)。

  • Note: Due to some discrepancies in the database, please draw your team names from the All-Star table (so use allstarfull.teamid in the SELECT statement for this).

注意:由于数据库中存在一些差异,请从全明星表中抽取你的球队名称(所以在SELECT语句中使用allstarfull.teamid来实现)。

You're done!

Rerun python3 test.py to see if you're passing tests. If so, follow the instructions in the next section to submit your work.

重新运行python test.py,看看你是否通过了测试。如果是,请按照下一节的说明提交你的工作。

五、My-answer


RookieDB_Project 1: SQL的评论 (共 条)

分享到微博请遵守国家法律