Healthcare Analytics: Preparing Data

Cohort selection
Start/end time
Diseases
Biomarker measurements
Age/gender/insurance types
Assessment points
Fixed time
Discharge
First diagnosis
Fixed time after first diagnosis
Treatment episodes
23 trang | Chuyên mục: Huyết Học và Miễn Dịch | Chia sẻ: tuando | Lượt xem: 557 | Lượt tải: 0
Tóm tắt nội dung Healthcare Analytics: Preparing Data, để xem tài liệu hoàn chỉnh bạn click vào nút "TẢI VỀ" ở trên
Healthcare Analytics: Preparing DataTrần Thế TruyềnCenter for Pattern Recognition and Data AnalyticsDeakin University, AustraliaEmail: truyen.tran@deakin.edu.auURL: truyen.vietlabs.comHUST, VN, Dec 2013Jim Gray’s rule of 20 queriesBorn January 12, 1944Lost at sea January 28, 2007 Declared deceased May 16, 2012.Queries for predictive tasksCohort selectionStart/end timeDiseasesBiomarker measurementsAge/gender/insurance typesAssessment pointsFixed timeDischargeFirst diagnosisFixed time after first diagnosisTreatment episodesOutcome typesReadmissionMortalityLOSQOLAsset list (patient, doctor, medication, procedure, code, care package)Outcome filtersBy certain conditionsTime-resolutionAny events?Fixed intervals (e.g., week)A predictive frameworkUsing intuition and domain knowledgeIntuition is important.There are infinite number of hypothesesWe need to search for some highly probably ones!But it can be deadly wrong!A recently discharged patient can be readmitted right away (just like not treated).A good doctors can be associated with high rate of mortality and readmission.Using intuition and domain knowledge (2)Domain knowledge is criticalCheck the literature!Do the home work, e.g., pregnancy diabetes.But, a lot of data can support dump tricks ..., for the Heritage Health Prize, I use second-order interactions without knowing anything about the true meaning of features.EMR CharacteristicsTime-stamped multiple databasesCausal relationshipsDisease → diseaseDisease → intervention → outcomeSparse, noisy, episodic, irregularOrganized around illness/treatment episodesCoding can be wrong or missingNo recorded events do not mean absence of illnessAbsence of evidence ≠ Evidence of absenceMeasurements are simply not enteredMeasurements are made when doctors feel need to.Free texts (discharge summary, clinical/carer notes)Time-stamped eventsCould be episodic with a clear start/endCould be regularEvery 10 mins for a visit – multiple segments if needed.Could be just wrongPlanned intervention time recorded instead of actual timeRepeated measurementsDiscrete: binary/categorical/ordinalContinuousProblem: missing!!!Biases: measure to find risk – potentially biased Which measurements should we use?Data preparation for analysisOrganize data so that it supports patient-level processingStatic: age, gender, religionTime-stamped: everything elseFor programming, finding a right data structure is 50% of the game!Data structure already suggests the most suitable algorithms.Warning: leakage!Make sure the patients are counted AFTER first diagnosisOften, we have future data as wellRetrospective natureNever use outcomes to do anything, except for training the modelOur early suicide attempt classification from assessments was a form of leakage:Any attempt in history is considered as an outcome. BUT:Previous attempts were accounted in current assessment already!Data tables (1)Patient tableUR/Patient IDDemographicsAdmission tableUR/Patient IDTime stampInsuranceSource of admissionEmergency attendance tableUR/Patient IDTime stampThe main ICD codeTransfer statusData tables (2)Code tableAdmission IDCodeCode type (ICD, Proc, DRG)InsuranceIs primaryLab testUR/Patient IDTime stampTest nameTest valuePrimitive data structuresArray - 1D, 2D, ..In Perl: @a; $a[$i] = ‘j’; $aa[$i][$j] = ‘k’;Dictionary/hash/associative array/struct.In Perl: %b; $b{i} => ‘j’;Pass by reference. In Perl: $aref = \@; $bref = \%b;Save memoryPossibly fasterSparse networks/matricesAdjacent matrices: (1,10,2)Inverted-index file (key → document list)Important in sparse settingsEnable fast search, e.g., disease-patientICD codesTime intervalsInternal data-structureAn array of patientsID/UR+→ admission listTime stampLength-of-stayPointer to code listA structure of {name, type, severity, primary/secondary}Pointer to medication listA structure of {name, class, dose, frequency}+ → ED listTime stampLength-of-stayICD codeTransfer statusDealing with time (1)IntervalsFirst diagnosesFirst treatmentsTrend analysisDistribution/concept driftingChanges in intervention proceduresChanges in disease prevalenceChanges in policy and incentivesDealing with time (2)Aggregating over a periodGaussian smoothing, very similar to image denoising.Motivation behind our filter-bank technique.Time-resolution – chronic versus acuteModel transition directlyLongitudinal models3 steps in end-to-end systemsDictionary buildingInitial constructionPeriodic updating, e.g., new procedures & drugsModel developmentPrototype buildingPeriodic re-training – dynamic in diseases and treatmentsModel deploymentTalk to databaseCommunicate with GUIExplain the predictionBuilding dictionariesBasis for feature engineeringDiagnostic codesProcedure codesDGR codesMorphological codesTest namesMedication classesData normalizationDrugsDrug companies offer different brand names of the essentially the same drugDDD/ATC is the central register for the medication classes, maintained by WHOTestsSeveral test names may be the sameDictionary compressionIt may not be robust to use the original “vocabularies”Tens of thousands of ICD-codes, thousands of procedures, hundreds of DRGs, thousands of medication classesCodes are usually organized in hierarchyChoosing the right hierarchy is statistical issueSimple statisticsCounts – rare features are very noisyDocument frequency – very frequent features are not very helpfulFrequency per class – the basis for t-statistics feature selection.Odds-ratio – the key for individual contributions.Robustness controlExtreme values, e.g., LOS > 96% percentile = 16 daysIf no outcomes – keep only those frequently seenEverything else of the same type put into the same bucket labelled “Rare”. E.g., rare diseases.If outcomes, in training data, can optionally keep those seen in positive events (e.g., readmissions, deaths). But this may lose information.Robustness control (2)Missingness can be a feature itself.Imputation can be done, but its helpfulness can be hard to justified.Rule of thumbsNumber of events should be >= 10 number of features.Training AUC and validation AUC shouldn’t be too far from each other....
File đính kèm:
healthcare_analytics_preparing_data.pptx