Jump to content

Data Dictionary File

Data Dictionary File

Column Name Description Examples
name Name of the column. No spaces. ind_id, site_id, etc.
unit Units of the data collected in the column. For a column weight, the units can be pounds, kilograms, etc.
type Data type of the information collected in the column. It can be string, integer, decimal and can be combined with fixed_set, encoded, multi_fixed_set, or multi_encoded.
type example of corresponding values column in dictionary

string

 
string, encoded* M=Male|F=Female|U=Unknown
string, fixed_set* schizophrenia, disorganized|schizophrenia, paranoid|unknown
string, multi_encoded* M=Monday|W=Wednesday|F=Friday
string, multi_fixed_set* M|W|F
integer  
integer, encoded* 1=Male|2=Female|0=Unknown
integer, fixed_set* 1|2|0
integer, multi_encoded* 1=Monday|3=Wednesday|5=Friday
integer, multi_fixed_set* 1|3|5
decimal  
decimal, encoded* 1.0=One or more episodes|0.0=No reported episodes
decimal, fixed_set* 1.0|0.0
decimal, multi_encoded* 0.01=Penny|0.05=Nickel|0.10=Dime|0.25=Quarter
decimal, multi_fixed_set* 0.01|0.05|0.10|0.25
date 1900-11-11 (ISO 8601 format)

* For fixed-set, encoded, multi_fixed_set, and multi_encoded values, the corresponding values in the file must match EXACTLY, in both case and content. Using the "string, fixed_set" example above, "schizophrenia,disorganized" or "schizophrenia, Disorganized" will NOT be recognized as "schizophrenia, disorganized"

 

min

Valid for numeric data. The minimum value allowed for the data.

You may use pre-defined variables in min/max columns, which will be substituted by the system. Available variables are current_year or current_date.

name type min
due_date date {current_date}
max Valid for numeric data. The maximum value allowed for the data.

You may use pre-defined variables in min/max columns, which will be substituted by the system. Available variables are current_year or current_date.

name type max
year_of_death integer {current_year}
min_length

Valid for string data. The data value must have a minimum length of min_length

This column will be ignored if the entered "type" is integer or decimal

name type min_length
ssn string 10
max_length

Valid for string data. The data value must have a maximum length of max_length

This column will be ignored if the entered "type" is integer or decimal

name type min_length max_length
postal_code string 5 10
unique

Assign a one or more labels (separated by a pipe '|') to identify a column, or a set of columns whose values should be unique.

Note: Unique columns allow null values.

For a column ind_id which should contain unique data the value can be set to u_ind_id

For a set of columns site_id, family_id, subject_id which combined should contain unique data the value can be set to u_sfs_id. The same label u_sfs_id should be assigned for all three columns

 

name unique
ind_id u_ind_id
site_id u_sfs_id
family_id u_sfs_id
subject_id u_sfs_id

 

Example: Desired effect (user, col_a), and (user, col_b) combinations should be unique.

name unique Comment
user u_ua | u_ub Separating labels by a pipe '|'
col_a u_ua  
col_b u_ub  
mandatory

Columns which do not allow null values.

Value Description
No value or n A value for this field is not required.
y A value for this field is always required.

<expression>

A value for this field is required if the expression evaluates to True, otherwise the value is optional.


Expression Grammar

Access value of a field using format, c["<column-name>"], i.e. c["sex"], etc.

Constant string should be enclosed in single or double quotes, i.e "a" or 'a'

Constant integers should not be quoted, i.e. 159, 35.0

Constant booleans should be either True or False

Constant date's must be reprsented as date( "<date>"), where <date> is date in IS0 8601 format


Supported Operations Description
Compare Operation

<left> <cmp-op> <right>

<cmp-op> can be >, >=, <, <=, ==, !=, is, is not, in, not in

Example a > 1, b != 5, x is None

Arithmetic Operation

<left> <arithmetic-op> <right>

<arithmetic-op> can be +, -, /, *, % (Modulo)

Example

1 + 1 => 2

4 % 3 => 1

"A" + "B" => "AB"

Boolean Operation

<left> <bool-op> <right>

<bool-op> can be and, or.

Example a > 1 and a < 5

Unary Operation

<unary-op> <right>

<unary-op> can be +, -, not

Example +1, -257, etc.

Supported Functions Description
min(v1,.., vn)

Returns smallest element from a list

Example min(1, 2, 3) => 1

max(v1,.., vn)

Returns largest element from a list

Example max(1, 2, 3) => 3

lower/upper

For strings, returns upper/lower cased version of string

Example "A".lower() => "a"

strip, lstrip, rstrip

For string, removes leading and/or trailing whitespace

Example

" A ".lstrip() => "A "

" A ".rstrip() => " A"

" A ".strip() => "A"

str(v)

Convert int/boolean to string

Example

str(1) => "1"

str(True) => "True"

int(v)

Convert string to integer

Example int("1") => 1

bool(v)

Convert string to boolean

Example int("1") => True

date(v)

Convert string in ISO 8601 format to date

Example date("1980") => 1980-01-01

abs(v)

For numbers, returns the absolute value

Example

abs(-100) => 100

abs(-100.67) => 100.67

range(start)

range(start, stop)

range(start, stop, step=1)

Returns a list of all numbers in range

Example

range(1, 3) => [1, 2]

range(1, 5, 2) => [1, 3]

pow(v, n)

Returns v to power of n

Example pow(2, 2) => 4

len(v)

Return length of string or list

Example

len( "ABC" ) => 3

len( [1, 3] ) => 2


Examples

Expression Description
c["subject_type"] != "D" Evaluates to true, if value for subject_type field does not equal to string D
c["yod"] > 2000 Evalues to true, if value for yod field is greater than integer 2000
c["sex"] == "M" or c["sex"] == "m" Evalues to true, if value of sex field is either string M or string m
c["sex"] in ["M", "m"] Evalues to true, if value of sex field is either string M or string m
c["sex"].strip().lower() == "m" Evalues to true, if value of sex field is either string M or string m
primary_key For column, or set of columns which combined uniquely identify records in the phenotypic file set this to Y

If site_id, family_id, subject_id uniquely identify records in a file


name primary_key
site_id y
family_id y
subject_id y
resolution Valid for floating point values.

Data contained in the column will be truncated to n digits after the decimal point.

If the data is 4.1212, and resolution for column is specified to 2 the data will be truncated to 4.12.

values Valid when the type column contains fixed_set, encoded, multi_fixed_set, or multi_encoded value.

Fixed Set *

Specifies the set of values which are allowed in the column of the corresponding phenotypic file.

Example: For a column Gender the valid values are Male, Female, or Unknown

Consequently, the values column should contain Male|Female|Unknown

Format: <value_1>|<value_2>|..|<value_n> separated by a pipe '|'.

NOTE: In the corresponding phenotypic file, no more than one value may be entered per cell. That is, the defined options are mutually exclusive.

Multi Fixed Set *

Specifies the set of values which are allowed (either alone or in combination) in the column of the corresponding phenotypic file.

Example: For a column TherapyDays, the valid values may be any combination of Monday, Wednesday, or Friday.

Consequently, the values column should contain Monday|Wednesday|Friday

Format: <value_1>|<value_2>|..|<value_n> separated by a pipe '|'.

NOTE: In the corresponding phenotypic file, any combination of these defined values may be entered per cell. Multiple values should be separated by a pipe '|'.

Encoded *

Specifies the set of values which are allowed in the column of the corresponding phenotypic file. These values are paired with their respective meanings.

Example: For a column Gender the valid values are M, F, or U. The corresponding meanings are Male, Female, and Unknown, respectively.

Consequently, the values column should contain M=Male|F=Female|U=Unknown

Format: <value_1>=<meaning_1>|<value_2>=<meaning_2>|..|<value_n>=<meaning_n> separated by a pipe '|'.

NOTE: In the corresponding phenotypic file, no more than one value may be entered per cell. That is, the defined options are mutually exclusive.

Multi Encoded *

Specifies the set of values which are allowed (either alone or in combination) in the column of the corresponding phenotypic file. These values are paired with their respective meanings.

Example: For a column TherapyDays, the valid values may be any combination of M, W, or F. The corresponding meanings are Monday, Wednesday, and Friday, respectively.

Consequently, the values column should contain M=Monday|W=Wednesday|F=Friday

Format: <value_1>=<meaning_1>|<value_2>=<meaning_2>|..|<value_n>=<meaning_n>

NOTE: In the corresponding phenotypic file, any combination of these defined values may be entered per cell. Multiple values should be separated by a pipe '|'.

* For fixed_set, encoded, multi_fixed_set, and multi_encoded values, the corresponding values in the file must match EXACTLY, in both case and content. Using the "Encoded" example above, "m" will NOT be recognized as "M".

description

Detailed text describing the information contained in this column.

This field cannot be left empty.

 

Advanced Examples

Consider a column Age which contains integer values between 0 and 120. The column has specially designated values -1 for Missing, -2 for Not collected

name type min max values
age integer,encoded 0 120 -1=Missing|-2=Not Collected

Consider a column Twins which can contain only the following values Monozygotic, or Dizygotic.

name type values
Twins string,fixed_set Monozygotic|Dizygotic

Consider a column TherapyDays that represents which days a subject may receive theraupeutic intervention.

name type mandatory values
TherapyDays string,multi_encoded n M=Monday|W=Wednesday|F=Friday

Since this field is not mandatory, "NULL" is an acceptable value if a given subject receives no therapy. But note that a "NULL" entry may only appear by itself - that is, it may not be piped (concatenated) to any of the defined values (M, W, or F). For example, if a subject receives therapy on Monday and Wednesday but not Friday, the correct entry would be "M|W" and not "M|W|NULL".

In addition, defined values may appear no more than once in a given entry. For example, if a subject receives therapy twice on Monday and once on Friday, the correct entry is "M|F" and not "M|M|F".