September 2021 1 24 Report
Read the text and write a short brief using the following phrases:

The article is devoted …
The article deals (is concerned) with …
The article touches upon the issues of…
The article is about…
The purpose (aim) of the article is…
Much attention is given to…
It is reported that..
It is spoken in detail about…
The article gives a detailed analysis of..
The following conclusions are drawn…



To avoid a data swamp, set out to build a data reservoir, based on a systematic approach, sound architecture and a set of best practices.

THE SYSTEMATIC APPROACH
How do most people approach the creation of a data lake? They say, “We’ll figure it out as we go along.” But, what happens is that as soon as the word gets out about a data lake’s existence, employees start adding data to the lake. Data will come in very rapidly, and each user will do things his or her own way. Before you know it, you will have the proverbial data swamp.
A better approach is one in which major problems are anticipated, solutions are defined in advance and users work together.
Let’s take the example of sharing data. Imagine you want to have a copy of all the social media tweets relevant to your business. If the tweets are obtained for one employee, you don’t want someone else to have to go out to another vendor and buy them again. Effective data sharing is fundamental to the business value of the data reservoir.
It’s also important for users to be able to find what data is already present in the data lake and learn about the data to tell if it’s suitable for use. This effort requires an architecture and best practices that take data sharing into account. You will also need tools to make it easy to search data and their associated metadata.
Organizations should anticipate the importance of data sharing from the beginning of a data lake project so that data is shared and reused throughout the organization. All things considered, a successful data lake approach will identify fundamental issues up front and address them with an integrated architecture and essential best practices.

SOUND ARCHITECTURE
As mentioned above, one component of data reservoir architecture is the management of metadata to encourage and support data reuse. Here are other key capabilities:
Data ingestion: It must be straightforward and efficient to bring data from a new source into the data reservoir. In particular, custom coding and scripting should be avoided.
Archiving data as sourced: Many data reservoirs require that a copy of the data as originally received is available for audit, traceability, reproducibility and some data science techniques. Thus, an automatic and efficient way to archive a copy of the source data, usually with lossless compression and sometimes in encrypted form is required.
Data transformation: To prepare data for analytics, the set of all necessary transformations in a given data reservoir is likely to be large. A minimum amount of custom programming should be the goal.
Data publication: Data ingestion, data transformation and metadata capture should be completed so that data can be used. The act of “publication” makes data ready for a specific class of reports, dashboards or queries.
Security: A data reservoir should manage access to data objects, certain s services (such as Hive or HBase), specific applications and to the Hadoop cluster itself. A security architecture and strategy should protect the perimeter, handle identification and authorization of users, control access to data, address needs for encryption, masking, and tokenization, and comply with requirements for logging, reporting and auditing.
Operations and management: When a data reservoir is operating at full scale, there will be many data pipelines in operation at the same time, each ingesting, transforming and publishing data. There will also be processes for consuming data, either via extract and download or via interactive reporting and query.

BEST PRACTICES
In the creation of data pipelines – ingestion, transformation and publication – a best practice is correctly performing the necessary steps to make data consistently usable. This means capturing metadata at every touch point, handling exceptions correctly at every step, completing appropriate data quality checks, efficiently handling incorrect data as well as correct data, and performing any data conversions or normalizations according to specified standards.
---->продолжение статьи в комментарии
Please enter comments
Please enter your name.
Please enter the correct email address.
You must agree before submitting.

Answers & Comments


Copyright © 2024 SCHOLAR.TIPS - All rights reserved.