Sunday, December 11, 2022
HomeData ScienceUnit Exams for SQL Scripts with Dependencies in Dataform | by 💡Mike...

Unit Exams for SQL Scripts with Dependencies in Dataform | by 💡Mike Shakhomirov | Dec, 2022


picture by creator

Do you unit check your information warehouse scripts?

I’m going to speak about unit assessments for advanced SQL queries which could encompass a number of operations (actions).

I attempted it earlier than utilizing BigQuery scripting:

Nonetheless, right here is one other (and possibly higher) technique to do it.

Dataform is a superb free instrument for SQL information transformation. It helps to maintain your information warehouse clear and well-organized. It has good dependency graphs explaining information lineage and it serves as a single supply of reality for the whole lot that occurs there.

  • It ought to check anticipated vs precise outcomes
  • Ought to describe the script’s logic corresponding to make use of circumstances.
  • It must be automated.
  • Be Unbiased (assessments mustn’t do setup or teardown for each other)
  • It must be straightforward to implement.
  • Be Repeatable: Anybody ought to be capable to run it in any surroundings.
  • As soon as it’s written, it ought to stay for future use.

Dataform helps SQL unit assessments for views and you may examine it in Dataform docs [1]. It’s certainly easy.

Let’s think about we have now a desk:

Contemplate this view and picture you want it at all times to return three columns which implies you’d wish to unit check it:

In Dataform unit testing the view is straightforward:

picture by creator

So the file definition for unit check would appear like this:

Nonetheless, typically it isn’t sufficient.

You would possibly wish to run and check a customized SQL operation / a script having a number of statements and multiple output.

Certainly, it turns into difficult if you wish to unit check a SQL script that is dependent upon another actions, i.e. views, scripts, tables, and so forth. On this case you’d wish to maintain it atomic and run one check for all of them every time you run it once more.

Let’s think about we carry out incremental updates on our desk daily / hour and you’d wish to unit check that script:

I beforehand wrote about merge updates right here:

  1. I’ll use Dataform to create inputs in *_tests schemas of every dependency, i.e. any desk or a view from the principle script I’m going to check. Relying on every desk’s schema it can create an enter, i.e. production_tests.reputation_data for manufacturing.reputation_data, and so forth.
  2. I’ll ask my unit check to run the principle operation we’re going to check which is user_reputation (which is a script) and save the precise output in *_tests schema, i.e. in production_tests.user_reputation.
  3. I’ll examine my anticipated output (which I’ll provide in unit check) in opposition to that precise output I acquired earlier.

Let’s write the unit check for user_reputation script:

A unit check fails if the precise output from the dataset just isn’t equal to the anticipated output.

It implies that:

  • the variety of output rows should match
  • the variety of output columns and their names should match
  • the contents of every row should match

Let’s run it:

  • Run actions with tag unit_tests. Make certain your scripts has it.
  • Embody dependencies.
  • Add schema suffix (*_tests.any_table).
picture by creator

picture by creator

As end result we are going to see a Move for our unit check:

picture by creator

So what has simply occurred? Let’s check out the dependency graph:

picture by creator

Dataform ran all of the dependencies however created “faux” outputs in my new production_tests.* schema so the ultimate script (a unit check script) might use them as inputs.

Let’s see what occurs if somebody decides to alter the logic of any of those dependencies in that pipeline. I’ll change the reputation_data_v barely:

I’ll run the unit check ( test_user_reputation.sqlx) once more:

picture by creator
picture by creator

So that is how we run unit assessments for SQL scripts in Dataform:

  • Run actions with tag unit_tests. Make certain your scripts has it.
  • Embody dependencies.
  • Add schema suffix (*_tests.any_table).

Dataform has a command line interface so you may run your venture’s actions from command line:

Learn extra about it right here: 2

Dataform additionally has a docker picture so that you would possibly wish to setup a Gitflow pipeline to run unit_tests actions every time you create a Pull Request:

picture by creator

So if you push adjustments to your repository it can run the checks

… in your pipelines to see if any logic has been affected:

picture by creator
picture by creator

Dataform is a superb instrument for information modeling. It’s like DBT however written in JavaScript. It has Git and CI/CD options which makes it extraordinarily highly effective. It paperwork SQL script robotically and creates good dependency graphs which makes it actually helpful as a single supply of reality for everybody who’s going to make use of it.

In the meanwhile new Dataform signups are closed as the corporate was quietly acquired by Google. Dataform is now accessible in Preview mode in Google Cloud Platform which implies that it’s going to develop into accessible with full listing of options in a few months. I’ve already tried a few issues in Preview however nonetheless choose to make use of the legacy Internet UI and console bundle domestically. If I wish to run it as a micro service Dataform has a Docker picture that permits to try this. I could not discover Dependency graphs and another necessary options in Preview model in GCP.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments