Professional-Data-Engineer Exam Dumps - Google Professional Data Engineer Exam

Go to page:

<< First
Prev
1
2
3
4
5
6
7
8
9
10
Next
Last >>

Question # 73

You created an analytics environment on Google Cloud so that your data scientist team can explore data without impacting the on-premises Apache Hadoop solution. The data in the on-premises Hadoop Distributed File System (HDFS) cluster is in Optimized Row Columnar (ORC) formatted files with multiple columns of Hive partitioning. The data scientist team needs to be able to explore the data in a similar way as they used the on-premises HDFS cluster with SQL on the Hive query engine. You need to choose the most cost-effective storage and processing solution. What should you do?

Import the ORC files lo Bigtable tables for the data scientist team.

Import the ORC files to BigOuery tables for the data scientist team.

Copy the ORC files on Cloud Storage, then deploy a Dataproc cluster for the data scientist team.

Copy the ORC files on Cloud Storage, then create external BigQuery tables for the data scientist team.

Full Access

Answer:

Explanation:

The requirements are:

Explore ORC formatted files with Hive partitioning.

Mimic the SQL on Hive query engine experience.

Cost-effective storage and processing.

Avoid impacting the on-premises Hadoop solution.

Let's analyze the options:

Option A (Import to Bigtable):Bigtable is a NoSQL database, not suited for SQL-based exploration of ORC files or Hive-style partitioning directly. This would require significant data transformation and a different query paradigm. Not cost-effective for this use case.

Option B (Import to BigQuery native tables):Importing data into BigQuery native storage is an option. BigQuery can load ORC files. This provides excellent query performance. However, it involves an ETL step (importing) and storage costs for the datawithin BigQuery, which might be higher than keeping it in its original format on Cloud Storage if query patterns are exploratory and not extremely frequent on all data.

Option C (Copy to Cloud Storage, deploy Dataproc):Dataproc allows you to run Hadoop/Spark (and thus Hive) clusters on Google Cloud. This would provide a very similar experience ("SQL on the Hive query engine"). However, running a persistent Dataproc cluster incurs compute costs for the cluster nodes, even when not actively querying. While ephemeral clusters are possible, it adds operational overhead for exploratory queries. Storage on Cloud Storage is cost-effective.

Option D (Copy to Cloud Storage, create external BigQuery tables):This is often the most cost-effective and straightforward solution for this scenario.

Cost-effective Storage:Cloud Storage is a low-cost option for storing files like ORC.

SQL Interface:BigQuery provides a familiar SQL interface.

External Tables:BigQuery can query data directly from Cloud Storage (including ORC files) using external tables. This avoids the need to load data into BigQuery's managed storage, saving on storage costs and ETL effort.

Hive Partitioning:BigQuery external tables support Hive partitioning layouts. When you define the external table, you can specify the partitioning scheme, and BigQuery will use partition pruning to scan only relevant partitions, improving performance and reducing costs for queries that filter on partition keys. This directly mimics the Hive experience.

Processing Cost:You only pay for the data scanned by BigQuery queries, which aligns with exploratory analysis.

Comparing D with B: External tables are generally more cost-effective for storage and initial setup if the data is already in ORC and an ETL process into BigQuery native storage is to be avoided. Query performance might be slightly less than native tables but is often sufficient for exploration, especially with partitioning. Comparing D with C: BigQuery external tables are serverless, meaning no cluster to manage or pay for when idle. Dataproc requires managing and paying for a cluster. For exploration, the serverless nature of BigQuery is usually more cost-effective.

Therefore, copying ORC files to Cloud Storage and using BigQuery external tables is the most cost-effective solution that meets all requirements.

[Reference:, Google Cloud Documentation: BigQuery > External data sources > Querying Cloud Storage data. "You can query data in Cloud Storage by using external tables or federated queries. External tables are tables that read data directly from files in Cloud Storage.", Google Cloud Documentation: BigQuery > External data sources > Supported formats and compression types. ORC is a supported format., Google Cloud Documentation: BigQuery > Creating and using tables > Creating external tables. "External tables let you query data stored in Cloud Storage as if it were a standardBigQuery table. You can use external tables to query data in various formats, including... ORC...", Google Cloud Documentation: BigQuery > Creating and using tables > Querying partitioned external tables. "You can create an external table that is partitioned on Hive partitioning keys. When you query a Hive partitioned external table, BigQuery performs partition pruning to skip reading unnecessary partitions." This directly addresses the "Hive partitioning" and "explore data in a similar way" requirements., Google Cloud Blog: "Choosing the right data processing option on GCP: BigQuery vs. Dataproc" (and similar articles) often highlight BigQuery external tables as a cost-effective way to query data in place on Cloud Storage, especially for data lake scenarios., , , ]

Question # 74

You are building a streaming Dataflow pipeline that ingests noise level data from hundreds of sensors placed near construction sites across a city. The sensors measure noise level every ten seconds, and send that data to the pipeline when levels reach above 70 dBA. You need to detect the average noise level from a sensor when data is received for a duration of more than 30 minutes, but the window ends when no data has been received for 15 minutes What should you do?

Use session windows with a 30-mmute gap duration.

Use tumbling windows with a 15-mmute window and a fifteen-minute. withAllowedLateness operator.

Use session windows with a 15-minute gap duration.

Use hopping windows with a 15-mmute window, and a thirty-minute period.

Full Access

Answer:

Explanation:

The key requirements for the windowing strategy are:

A window groups data for a specific sensor.

A window should contain data spanningat least30 minutes ("duration of more than 30 minutes" implies activity for this period).

A window for a sensorendswhen no data has been received from that sensor for 15 minutes (this is a gap).

This scenario perfectly describessession windows.

Session Windows:Session windows group elements (per key, e.g., per sensor ID) that arrive within a certain "gap duration" of each other. A new session starts if data for a key arrives after the gap duration has passed since the last data point for that key.

In this case, if data stops arriving for a sensor for 15 minutes, the current session for that sensor closes. This matches "the window ends when no data has been received for 15 minutes."

The "duration of more than 30 minutes" requirement is a condition you would applyafterthe session window closes. You'd calculate the duration of the data within the closed session window and only compute the average if that session's duration (span of event times within it) exceeds 30 minutes. Session windows themselves don't have a fixed duration; their duration is determined by data activity and the gap.

Let's analyze why other options are less suitable:

A (Hopping windows with a 15-minute window, and a thirty-minute period):Hopping windows have a fixed size and a fixed period. They create overlapping windows. This doesn't align with the dynamic nature of sessions ending based on inactivity. A 30-minute period with a 15-minute window means windows like [0:00-0:15], [0:15-0:30], [0:30-0:45]. If activity is continuous, a 30-minute activity span would be covered, but the window closing is not based on a 15-minute gap of inactivity.

B (Tumbling windows with a 15-minute window and a fifteen-minute .withAllowedLateness operator):Tumbling windows are fixed-size, non-overlapping windows. .withAllowedLateness deals with late data arriving for a window that has already passed its end time, not with defining the window based on activity gaps.

C (Session windows with a 30-minute gap duration):This would mean a session ends only if there's a 30-minute gap of inactivity. The requirement is a 15-minute gap.

Therefore, session windows with a 15-minute gap duration (Option D) correctly model the requirement for windows to close after 15 minutes of inactivity from a sensor. The subsequent filtering for sessions lasting more than 30 minutes is a downstream operation.

[Reference:, Apache Beam Programming Guide > Windowing > Windowing functions > Session windows. "Session windowing assigns elements to windows that represent sessions of activity. A session window starts when the first element arrives for a key. If another element arrives for that key within the specified gap duration, that element is included in the existing session window. If an element arrives after the gap duration, a new session window starts for that element... Session windows are useful for data that is irregularly distributed with respect to time, such as user activity data.", This directly matches the sensor data behavior: data arrives when noise is high, and a period of no data for 15 minutes should close the analysis window for that sensor., , ]

Question # 75

Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers. Which cloud-native service should you use to orchestrate the entire pipeline?

Cloud Dataflow

Cloud Composer

Cloud Dataprep

Cloud Dataproc

Full Access

Question # 76

You use a dataset in BigQuery for analysis. You want to provide third-party companies with access to the same dataset. You need to keep the costs of data sharing low and ensure that the data is current. Which solution should you choose?

Create an authorized view on the BigQuery table to control data access, and provide third-party companies with access to that view.

Use Cloud Scheduler to export the data on a regular basis to Cloud Storage, and provide third-party companies with access to the bucket.

Create a separate dataset in BigQuery that contains the relevant data to share, and provide third-party companies with access to the new dataset.

Create a Cloud Dataflow job that reads the data in frequent time intervals, and writes it to the relevant BigQuery dataset or Cloud Storage bucket for third-party companies to use.

Full Access

Go to page: