Data Model Features
The bippLang data model supports various features that allow users to define reusable data structures in an easy but efficient manner.
Autogenerated datasets from Data source
The default dataset definition in bippLang is autogenerated whenever tables from a datasource are added to the project. Definitions for the original tables and columns are already available. For example, the definition for a table stock_goods
from a MySQL database warehouse_operations
would automatically be available in the data model as follows. The default dataset name will be same as the project name and may be changed to an appropriate name.
project Operations
dataset default
table stock_goods
table stock_goods
data_source MySQLDoc
sql `warehouse_operations`.`stock_goods`
//column definitions would follow here
This can subsequently be edited, to make it more suitable for generating reports.
Column Types
The data type of columns in the source database is mapped to its equivalent bippLang data type when generating a model. The mapping between different SQL data types and bippLang data types is as follows.
bippLang Data Type | SQL Data Types |
---|---|
Integer | TinyInt, SmallInt, Int, Int2, Int4, Integer, BigInt,UInt8, UInt16, UInt32, UInt64, Int8, Int16, Int32, Int64 |
Float | Decimal, Double, Float, Float32, Float64, Numeric, Real |
DateTime | Time, TimeStamp, Date, DateTime |
Bool | Bool, Boolean |
String | Char, Varchar, Character, String, Text, FixedString, Any other data type not defined above. |
Derived Tables
Derived tables can be defined as subsets or extensions of tables available from the data source. These consist of the SQL statement that would be used to retrieve the data from the data source and definitions of the columns retrieved. Following example illustrates the definition for a derived table. It consists of just two columns from the original table but with date-format and aggregation operations applied on them.
table shipping_ledger_view
data_source MySQLDoc
sql """ SELECT DATE_FORMAT(ShipDate, "%m, %Y") AS monthAndYear,
DATE_FORMAT(ShipDate, "%b, %Y") AS monthAndYearName,
Count(*) AS num_anomalies
FROM `warehouse_operations`.`shipping_ledger` WHERE ShipDate IS NOT NULL GROUP BY 1, 2 """
//column definitions for retrieved columns would follow here
Dynamic Tables
Dynamic tables are not based on any table from the data source. They may be included by the data analyst to provide additional metadata to the model. Following is an example of a dynamic table.
table indicator_names
data_source MySQLDoc
sql """ (select 'SE Indicator' as indicator_name)
union all (select 'DE Indicator' as indicator_name)
union all (select 'Overall Performance Indicator' as indicator_name) """
column indicator_name
sql indicator_name
type string
Table Joins
As seen in the project structure described previously, joins allow you to define the relationship between any two or more tables that are included in the data model. In case data is being retrieved from these two tables, the defined join would be used when fetching the data. The following example shows the data model definition with included tables and joins.
project Operations
dataset OpsDS
table efficiency_data shipping_statement_view
join shipping_statement_view join efficiency_data
on shipping_statement_view.WeekYear = efficiency_data.WeekYear
When visualizing data in the data viewer, columns may be selected from the tables efficiency_data
and shipping_statement_view
. In this situation, bipp will automatically generate the appropriate SQL using the JOIN definition given above.
Derived columns
Columns may be used as-is from the data source or may be defined as operations on existing columns. These operations could be defined using one or many other columns and include any SQL operation supported on the database. Following example shows one direct and one derived column.
column ArrivalTime //user defined column name. May or may not be the same as DB column name.
sql ArrivalTime //DB column name
type string
column ArrivalDate
sql """ STR_TO_DATE(ArrivalTime,'%m/%e/%Y %k:%i') """ //SQL Expression
type datetime
bippLang intelligently evaluates the given SQL expression as atomic or non-atomic.
In the first example above, sql ArrivalTime
, the SQL is atomic and bippLang interprets it to be a column in the backend database (“db namespace”).
The SQL in the second example for ArrivalDate is non-atomic and bippLang interprets that the expression refers to the users namespace. So ArrivalTime in the expression in the second example refers to the ArrivalTime in the first example which is in the users namespace.
bippLang also allows users to create constant columns or columns which have the same value in all rows. Due to the above interpretation for atomic and non-atomic sql, for a constant column with sql defined as sql 42
, bippLang would search for a column named 42 in the database which does not exist. A workaround for this would be to use the expression sql (42)
instead of sql 42
.
Parameter Based Columns
bippLang table definitions may include parameters with allowed values. The actual value of the parameter would be entered by the user while visualizing the dataset. Additionally columns based on these parameters may be defined such that the column value will be calculated based on the selected value for the parameter. The following code shows the definition for a parameter FromYear
and a column Age
that depends on this parameter. Parameters are defined at the same level as the column.
Parameter Definition
parameter FromYear
allowed_value 2000
allowed_value 2010
allowed_value 2020
default_value 2020
Column Definition
column Age
sql " (case when ($FromYear - Year > 20) then 'Old' else 'Recent' end) "
type string
Column Properties
Column definitions may be enhanced by using the following formatters which are usually required in reports or charts.
Label
This can be used to provide an optional label to the column and may include spaces. Following is an example of a column with label specified.
column SLIndicator
label SLIndicator in %
Format
This can be used to indicate how the column data should be formatted. bippLang number formatting is based on Excel number formats. Custom formatting of numbers may be achieved using the following format specification.
"prefix###000.000###,suffix"
Here, “#” indicates optional digits and “0”:indicates the must have digits.
For example decimal formatting for currencies may be applied as follows.
column SEIndicator
sql SEIndicator
type float
format $0.00
Using this expression, a column value of 1.122e2
would be displayed as $112.20
. Similarly the following format expressions would yield the corresponding results.
Value | Format | Result |
---|---|---|
123456789 | 0.00,, M | 123.45 M |
1.122e12 | $#,##0.00,, M | $1,122,000.00 M |
123456789 | 0.00##,, M | 123.4567 M |
1234.56 | $0.00, K | $1.23 K |
Format may be applied to date data types as well. Date format may be specified using a combination of letters to indicate days, month, year etc. Following are the options available
- d: Day of the month as digits; no leading zero for single-digit days.
- dd: Day of the month as digits; leading zero for single-digit days.
- ddd: Day of the week as a three-letter abbreviation.
- dddd: Day of the week as its full name.
- m: Month as digits; no leading zero for single-digit months.
- mm: Month as digits; leading zero for single-digit months.
- mmm: Month as a three-letter abbreviation.
- mmmm: Month as its full name.
- yy: Year as last two digits; leading zero for years less than 10.
- yyyy: Year represented by four digits.
- h: Hours; no leading zero for single-digit hours (12-hour clock).
- hh: Hours; leading zero for single-digit hours (12-hour clock).
- H: Hours; no leading zero for single-digit hours (24-hour clock).
- HH: Hours; leading zero for single-digit hours (24-hour clock).
- M: Minutes; no leading zero for single-digit minutes.
- MM: Minutes; leading zero for single-digit minutes.
- s: Seconds; no leading zero for single-digit seconds.
- ss: Seconds; leading zero for single-digit seconds.
- l: Milliseconds; gives 3 digits.
- L: Milliseconds; gives 2 digits.
- t: Lowercase, single-character time marker string: a or p.
- tt: Lowercase, two-character time marker string: am or pm.
- T: Uppercase, single-character time marker string: A or P.
- TT: Uppercase, two-character time marker string: AM or PM.
Following are some examples for using the date format
Value | Format | Result |
---|---|---|
2020-04-06T19:05:02Z | ddd mmm dd yyyy HH:MM:ssM | Mon Apr 06 2020 19:05:02 |
2020-04-06T19:05:02Z | mmmm d, yyyy | April 6, 2020 |
2020-04-06T19:05:02Z | HH:MM:ss | 19:05:02 |
2020-04-06T19:05:02.456Z | HH:MM:ss L | 19:05:02 46 |
Color Format
This property may be used to format the data cell when data is presented in the data viewer. In its most basic form, the property can be used to apply color based on the value of data in the cell. It provides multiple advanced options to define the format as follows.
- Format conditions may be based on a fixed value or range of possible values of data.
- The color to be applied may also be specified as a single color or a range of hex codes for different colors.
- The color is applied to the text by default but may be applied to the background instead.
- The value of the data itself may be treated as a color to be applied.
- This property can be also used to pass a fixed text value that should be used to fill the cell.
Following are the sub-functions or attributes used to define the color_format
-
val: This function is used to define the data values to be compared. Function argument may be a fixed value or range of values. The “-” sign may be used at one end if the range does not have an upper or lower limit. Following are a few examples for defining the
val
function.- val(1000) : Condition satisfied if data cell value = 1000
- val(1001 to 5000) : Condition satisfied if data cell value is between 1001 and 5000
- val(100 to -) : Condition satisfied if data cell value is greater than 99
- val(- to 600) : Condition satisfied if data cell value is less than 601
-
color: This function is used to define the color or range of colors to be used for formatting. When specified as a range, the color to be applied is calculated based on how far the actual value of data is from the minimum value specified in the
val
function. For example, if color format is defined ascolor_format """val(100 to 500) color(#00FF00 to #0000FF)"""
In this case, if the data value is 450, the color chosen will be closer in shade to #0000FF than it would be to #00FF00
-
background: By default the color identified by the color function will be applied to the cell text. In order to apply it to the cell background instead of the text, the background attribute may be specified.
-
datacolor: If this attribute is specified, bippLang ignores the
val
andcolor
functions and assumes that the value in the data cell itself is a color value and should be applied to the cell. -
transpose: This function is used to define a text value that should be written into the cell if the criteria specified in the val function is satisfied. It may be used to specify certain symbols or constant text that needs to be displayed in the cell.
Following is a simple example of a column definition with color_format specified.
column SEIndicator
sql """ SUM(TimeDiff)/COUNT(DISTINCT PL) """
type float
color_format """ val(0.0 to 34.0) color(red) """
color_format """ val(35.0 to 44.0) color(greenyellow) """
color_format """ val(45.0 to -) color(green) """
Following examples combine the functions and attributes described above to achieve the results as described.
color_format definition | Meaning |
---|---|
"”” val(100) color(green) background "”” | If value is 100, the background of the containing data cell is green |
"”” val(10001 to 20000) color(#FF0000) transpose(⬤) "”” | If value is between 10001 to 20000, change the text in the cell to ⬤ |
"”” datacolor "”” | Read the data cell value and apply the corresponding text color to it |
"”” transpose(▇) "”” | Change all data cell values to ▇ |
Hidden
This may be used to hide specific columns in the data viewer and should be set to true to hide columns. The hidden columns will not be available for selection in the data viewer.
column ArrivalTime
sql ArrivalTime
type string
hidden true
Default Aggregate
By default integer columns are aggregated in the viewer as a sum of values. This aggregate may be changed to other functions like count, max, min or avg. Default aggregate may also be disabled in the model or in the viewer. Following example shows a column with default_aggregate of “max”.
column SourceHours
sql SourceHours
type int
default_aggregate max
Drilldown
Datasets often contain hierarchical relationships between data elements. These could be used to represent geographical hierarchy or organizational hierarchy. The drilldown property of columns can be used to indicate this relation between any two columns. By defining such a relation, users can drilldown data for these columns when viewing this dataset.
Consider a table casedata which contains the geographical distribution of disease cases by region, country and state. You want to view the total number of cases at the region level with the ability to drilldown to the number of cases at the country or state level. The drilldown property may be included in the columns Region and Country as follows.
column Region
sql Region
type string
drilldown dl_region
table casedata
column Country
column Country
sql Country
type string
drilldown dl_region
table casedata
column State
column State
sql State
type string
column Cases
sql Cases
type int
Support for Multiple Databases
bippLang allows connection to different types of databases or data sources. Following source databases are supported by bippLang.
- BigQuery
- BQPub
- RedShift
- SQL Server 2017+
- MySQL 5.7+
- Impala 5.x+
- Teradata 15.x+
- Hive 2.x.x+
- Postgres 9.x+
- Snowflake 3.27+
- Oracle 12c+
- mongoDB 3.2+
- MS Excel
Database specific SQL Syntax.
bippLang enforces that, SQL syntax used in queries and operations should be valid for the type of database selected for that project. For example consider the following column definition for a project using MySQL data source.
column StDate
sql """ STR_TO_DATE(StartDate, '%m/%d/%Y') """
type datetime
Here STR_TO_DATE is a MySQL function used with an appropriate syntax.
Another project that uses a SQL Server data source, would require the following column definition in a similar scenario.
column StDate
sql """ CONVERT(datetime, StartDate, 101) """
type datetime
Here, Convert
is the corresponding MSSQL function and 101 is the format specifier corresponding to %m/%d/%y
format used in the MySQL function.
Note: Even if different tables in the model have different data sources specified, ideally they should all reference the same type of database to enable joins between the tables.
However, this need not be the case, when Virtualization is turned On
. When Virtualization is turned on, all the database types are “virtualized” into a common database called BippDb.