Phase 1 - Data Preprocessing

Python Code Details

Below are the codes involved in this phase for the data cleaning, coordination, and merging processes of the Electronic Health Data from IDR.

> Run_pipeline.py

This file contains the run_IDR_pipeline function, which is called from the main.py file.

WORKFLOW

Trigger run_IDR_pipeline function

This script is the entry point for the entire code. The run_IDR_pipeline function is run with necessary arguments. This function is responsible for running the entire pipeline, including setting up the project, running phases I through III, coordinating Phase IV, and generating variables for batches.

> phase_I_through_III_coordinator.py

This script handles the structured process for executing a three-phase data operation, starting with initialization and conditional checks for pre-completion. This file contains functions related to the coordination of phases I, II, and III of the pipeline. It includes functions like run_phase_I, run_phase_II_and_III, and run_phases_I_II_III, which handle various tasks such as creating lookup tables, cleaning files, labeling files with merged IDs, and running these phases in parallel.

WORKFLOW

Initialization

Import required modules and tools
Load custom modules/functions necessary for coordinating phases I, II, and III.
Decision: Check if merging has already been completed. If Completed then Skip to the end; Else, Proceed to "Retrieve Batches."

Retrieve Batches

Get batches from the directory using get_batches_from_directory function.
Execute run_phase_I function

Execute run_phase_I function

This function internally calls the function (or functions) associated with Phase I processing that likely involves data cleaning, processing, and saving operations. Any errors or conditions will be logged and handled appropriately.

Execute run_phase_II function

This function internally calls the function (or functions) associated with Phase II processing that include more specific data transformations, validations, etc. Any errors or conditions will be logged and handled appropriately.

Execute run_phase_III function

This function internally Calls the function (or functions) associated with with Phase III processing that handles data refining, aggregations, and other operations. Any errors or conditions will be logged and handled appropriately.

Ensure Completion Status of all Phases

After the completion of execution of all the above functions, this decision check will ensure the completion status of all phases.

If all Phases (I, II, III) are complete then it will write a success file indicating successful completion of all phases.
If any phase is incomplete or encountered errors then it will log specific errors or issues related to the failed phase and exit without writing the success file.

> clean_data_phase1.py

This script automates the initial phase of data cleaning by loading, verifying, and formatting data sets, then executes a series of defined transformations and checks to ensure data integrity before saving the sanitized results for further processing.

WORKFLOW

Initialization

Import required modules and tools specific to Phase 1
Load custom modules/functions necessary
Decision: Check if lookup table generation is done. If Completed then Skip this pre-processing phase(phase 1); Else, proceed with next cleanup steps in the phase.

Check IDR Data Transfer Completion Status

Decision: Check if IDR Data Transfer is complete. If Yes then
- Calculate the number of batches.
- Determine the optimal number of workers using calculate_optimal_workers.
- Prepare arguments for parallel execution.
  Else, Log Error and Exit

Execute Data Cleaning

Retrieve Data
Format Identifiers
Merge Data
Check Completeness
Save Data

Parallel Execution of clean_data_phase_I function

Execute clean_data_phase_I function in parallel using the optimal number of workers.

Check completion status of individual batch processes

If All batch processes are complete:

Write a success file for phase I

Else, Log an error and exit the program

> clean_internal_external_stations.py

This script handles fucntions to format, clean, and reconcile data from internal and external hospital stations, handling complexities such as patient transfers, overlapping data, and OR schedules integration, ultimately ensuring data integrity and usability for subsequent healthcare analytics.

WORKFLOW

Initialization

Import required modules and tools
Set global variables and Configure logging settings
Formatting of datetime columns and renaming certain column prefixes is done

Load Data

Load internal and external stations data with data integrity checks
Organize data from the internal stations, handling patient transfer data within the hospital

Data Cleaning

Resolve overlaps and inconsistencies in transfer data, correct duplications or errors
Integrate OR case schedules with station data to track patient movements

Consistency Checks

Check specific columns like STATIONTYPE for discrepancies
Handle scenarios related to patient admissions, creating or modifying data rows as needed

Data Saving

Impute missing data, particularly exit times from stations, using discharge information
Use utility functions for data mapping, null assignments, and classifying station priorities

Finalize Clean Data

Arrange the cleaned data for accessibility and comprehension
Sort data by encounter IDs and entry times, moving certain columns for readability
Save the cleaned data and write a success file to indicate completion

> clean_or_encounters_3.py

This script handles the functions to clean and validate Operating Room (OR) encounter data, employing specialized utility functions to ensure accurate surgery timelines, and preparing the data for seamless integration with broader healthcare datasets.

WORKFLOW

Initialization

Import necessary modules and configurations
Load any required custom functions for OR encounter data

Data Extraction

Load data for OR encounters
Validate the structure and content of the data
Utilize utility functions for timestamp and percentage comparisons

Data Cleaning

Standardize, format, and sanitize the data using functions such as _fill_proposed_start_stop_times to determine accurate surgery start and end times
Convert times from integer to proper datetime formats with _convert_integer_time_to_datetime for precise time calculations

Data Validation

Ensure data meets specific criteria, utilizing comparison functions to assess the start and end times of surgeries
Cross-verify data with existing reference tables or sources, ensuring consistency and accuracy

Data Merging/Integration

Combine or integrate the cleaned OR encounters data with other datasets, aligning all related encounter information

Data Saving

Store the cleaned and merged OR encounters data after applying all cleaning and formatting rules through clean_or_case_schedule, ensuring the data is ready for analysis.

Completion Checks

Validate if all cleaning and processing steps were successful.
If Yes then write a success file, else Log errors and specify the nature of any discrepancies.

> merge_encounters.py

This script streamlines healthcare data management by merging overlapping patient encounters into a cohesive record and updating encounter IDs, ensuring dataset consistency and integrity for subsequent analysis. It employs a series of functions for loading, cleaning, prioritizing, and labeling encounter data effectively.

WORKFLOW

Initialization

Load dataframes or paths from files including billing accounts and admit discharge stations
Configure logging settings, format datetime columns, and perform initial data cleaning

Check and Load DataFrames

Load the primary encounter dataframe and lookup dataframe
Apply initial filters to exclude non-inpatient admissions, and handle OR cases as specified

Data Processing

Rename columns for consistency
Use connectivity graph logic from merge_encounters to identify and merge overlapping encounters into a single consistent record
Assign priorities with functions like _highest_priority_patient_type and _highest_priority_encounter_type to ensure the most relevant information is preserved

Updating IDs and Creating Lookup Table

Generate new encounter IDs with update_encounter_ids for uniquely identifying merged encounters
Create a lookup table mapping old encounter IDs to the new merged ones with create_encounter_lookup_table, a crucial step for data consistency

Data Labeling and Saving

Label datasets with the new merged encounter IDs using label_df_with_merged_encounter_id, adaptable to various data structures and file types
Save the processed data, now labeled and formatted correctly

Completion Checks

Validate if all merging, labeling, and processing steps were successful.
If successful, write a success file; otherwise, log errors and exit

Phase 1 Comprehensive Flowchart

Below is the comprehensive high level flowchart of entire Phase 1 processes

note

To get more details about low-level flow of Phase 1, access the flowchart at Detailed flowchart-Phase-I

Below is the sequence flowchart to show the hierarchy of flow between files of Phase 1

Python Code Details​

> Run_pipeline.py​

WORKFLOW​

> phase_I_through_III_coordinator.py​

WORKFLOW​

> clean_data_phase1.py​

WORKFLOW​

> clean_internal_external_stations.py​

WORKFLOW​

> clean_or_encounters_3.py​

WORKFLOW​

> merge_encounters.py​

WORKFLOW​

Phase 1 Comprehensive Flowchart

Python Code Details

> Run_pipeline.py

WORKFLOW

> phase_I_through_III_coordinator.py

WORKFLOW

> clean_data_phase1.py

WORKFLOW

> clean_internal_external_stations.py

WORKFLOW

> clean_or_encounters_3.py

WORKFLOW

> merge_encounters.py

WORKFLOW