Hands-On Machine Learning on Google Cloud Platform

上QQ阅读APP看书，第一时间看更新

Using a graphical web UI

BigQuery can be accessed through a web graphical user interface that can be used to load and export data, perform queries, and perform other user and management tasks in the browser. The web user interface can be run in any browser, even if Google recommends using Chrome web browser as it produces maximum performance.

To access BigQuery via web UI, go to the following link:
https://bigquery.cloud.google.com

Once you have logged in, you will see the BigQuery console as shown in the following screenshot:

In the BigQuery console a welcome message is displayed. In it, various activities are proposed to us:

Read the BigQuery Quickstart guide
Run a query—on sample data already available to practice—by clicking "Compose Query"
Create a new dataset and load data into a table using the menu on the left
Discover the cost control options in BigQuery
Finally, refer to the BigQuery web user interface guide for more information about the user interface

To analyze in detail the BigQuery console, we will choose the second option, then click on the COMPOSE QUERY button at the top left. In this way, the window shown in the following screenshot is displayed:

In the previous screenshot two main sections are highlighted:

Navigation bar: Starting from the top, it contains a list of elements that describes what action you want to perform: compose a query, consult the query history, or consult the job history. Continuing down, we can identify a list of datasets in the current project that has read access, plus a public dataset called public data. This list shows a series of public databases made available to us for practicing. To be able to use this data, just click on the icon of the dataset expansion next to any dataset or the name of the dataset, so the link will be extended and we will be able to view the tables within that dataset.
Query box: Represents a box where you can type a query in the SQL syntax. This is the main part of the window as it represents the place where we formulate our queries. As anticipated in order to formulate a query it is necessary to use the SQL syntax that we have at least partly seen in the previous sections. Of course, in the examples that we will propose we will have the opportunity to deepen the topic. After correctly formulating the query to execute it, just click on the Run Query button at the bottom of the query box.
Dataset details: Represents the section that contains a summary of the data we have selected in the navigation bar. The name of the dataset, a brief description of its content, a series of details, and finally the tables contained in the dataset are proposed. To get a preview of the contents of a single table, simply click on the table name.

Let's take an example to understand the ease of use of the BigQuery console. We use one of the public datasets available:

bigquery-public-data.new_york.tlc_yellow_trips_2015.

This dataset is collected by the NYC Taxi and Limousine Commission (TLC) and includes trip records from all trips completed by yellow and green taxis in New York City from 2009 to present, and all trips in for-hire vehicles (FHV) from 2015 to present. Records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

In the following table are listed the several fields contained in the dataset with a brief description of the content:

Field name	Brief description
`vendor_id`	A code indicating the TPEP provider that provided the record. 1=Creative Mobile Technologies, LLC; 2=VeriFone Inc.
`pickup_datetime`	The date and time when the meter was engaged.
`dropoff_datetime`	The date and time when the meter was disengaged.
`passenger_count`	The number of passengers in the vehicle. This is a driver-entered value.
`trip_distance`	The elapsed trip distance in miles reported by the taximeter.
`pickup_longitude`	Longitude where the meter was engaged.
`pickup_latitude`	Latitude where the meter was engaged.
`rate_code`	The final rate code in effect at the end of the trip. 1=Standard rate, 2=JFK, 3=Newark, 4=Nassau or Westchester, 5=Negotiated fare, 6=Group ride.
`store_and_fwd_flag`	This flag indicates whether the trip record was held in the vehicle's memory before sending to the vendor, also known as store and forward, because the vehicle did not have a connection to the server. Y=store and forward trip; N=not a store and forward trip.
`dropoff_longitude`	Longitude where the meter was disengaged.
`dropoff_latitude`	Latitude where the meter was disengaged.
`payment_type`	A numeric code signifying how the passenger paid for the trip. 1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided trip.
`fare_amount`	The time-and-distance fare calculated by the meter.
`extra`	Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
`mta_tax`	$0.50 MTA tax that is automatically triggered based on the metered rate in use.
`tip_amount`	Tip amount—this field is automatically populated for credit card tips. Cash tips are not included.
`tolls_amount`	Total amount of all tolls paid in trip.
`imp_surcharge`	$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
`total_amount`	The total amount charged to passengers. Does not include cash tips.

To start, let's formulate a simple query to the database. How many trips did Yellow taxis take each month in 2015? This query must return monthly trip totals for all Yellow taxis in 2015. Someone may think that we are starting with operations that are too simple; in fact, it seems trivial to count the taxi calls made each month. This is true for most small cities but not for a metropolis like New York. In fact, we are talking about a database of dimensions equal to 18.1 GB and with a number of observations equal to 146,112,989.

Let's see then the SQL code to insert into the query box:

#standardSQL
SELECT
 TIMESTAMP_TRUNC(pickup_datetime,
    MONTH) month,
  COUNT(*) trips
FROM
  `bigquery-public-data.new_york.tlc_yellow_trips_2015`
GROUP BY
  1
ORDER BY
  1

Let's explain it line by line to understand the meaning of each command used. Let's start from the first line:

#standardSQL

The first line is inserted to let BigQuery know you want to use SQL standard. We can enable standard SQL for a query, so you do not have to insert this tag into your SQL. To do so, just click on the Show Options button located immediately below the query box. In the section that opens uncheck the Use Legacy SQL checkbox. Let's move forward in the analysis of the query:

SELECT
 TIMESTAMP_TRUNC(pickup_datetime, MONTH) month,
  COUNT(*) trips

In this piece of code we use the SELECT statement, which as already anticipated in the previous sections, retrieves data from a database. In the first part of the statement, you specify which data to retrieve. We have said that the number of trips for each month is in us. To do this we will count the number of lines for each month. Recall that each line in the database corresponds to a trip. The field containing this information is pickup_datetime. To return the data in an easily readable format, we used the TIMESTAMP_TRUNC function which truncates to TIMESTAMP value (the return value is of type TIMESTAMP). Then, we use the count() function that returns the number of rows that match a specified criteria. Let's move forward:

FROM
  `bigquery-public-data.new_york.tlc_yellow_trips_2015`

With the FROM clause, we select the table in which to search. Finally the last two lines:

GROUP BY
  1
ORDER BY
  1

These lines have been inserted to group and sort the data. The results are shown in the following table:

Row	month	trips
1	2015-01-01 00:00:00.000 UTC	12748986
2	2015-02-01 00:00:00.000 UTC	12450521
3	2015-03-01 00:00:00.000 UTC	13351609
4	2015-04-01 00:00:00.000 UTC	13071789
5	2015-05-01 00:00:00.000 UTC	13158262
6	2015-06-01 00:00:00.000 UTC	12324935
7	2015-07-01 00:00:00.000 UTC	11562783
8	2015-08-01 00:00:00.000 UTC	11130304
9	2015-09-01 00:00:00.000 UTC	11225063
10	2015-10-01 00:00:00.000 UTC	12315488
11	2015-11-01 00:00:00.000 UTC	11312676
12	2015-12-01 00:00:00.000 UTC	11460573

As you can see, the number of trips in each month ranges from a minimum of 11,130,304 to a maximum of 13,351,609. The following screenshot shows the results in the BigQuery console:

In the previous screenshot two details are highlighted:

A report on calculation times and data processed
A series of results storage options

In the first case, BigQuery tells us that to complete the operation, the Google resources have been committed to 4.5 s, and that 1.09 GB of data was processed during this time. This information will be useful for calculating the cost of the transaction.

In the previous image we highlighted the available storage options. Four options are offered to us:

Download as CSV
Download as JSON
Save as table
Save to Google Sheets

If you recall, the JSON syntax is a subset of the JavaScript syntax, while Google Sheets is an online spreadsheet app that lets users create and format spreadsheets and simultaneously work with other people.

In the previous screenshot, it is possible to notice that at the top of the results there are two tabs: Results and Details. So far, we have seen what is returned in the Results tab; let's see what we see if we click on the Details tab instead. A series of statistics on the operation performed are returned to us, as shown in the following:

Job ID   progetto-1-191608:bquijob_1d181029_1614bc7198f
Creation Time  Jan 31, 2018, 11:33:14 AM
Start Time     Jan 31, 2018, 11:33:15 AM
End Time Jan 31, 2018, 11:33:18 AM
User     xxxxxx@gmail.com
Bytes Processed      1.09 GB
Bytes Billed   1.09 GB
Slot Time (ms) 153 K
Destination Table    progetto-1-191608:_b6e2bd761c7590ee099d343a7b87889c01400431.anond9ac14f20bd65f3658af2aa65b7b8847b7d677be
Use Legacy SQL false

This information refers once again to the query we performed and to the results obtained.