Healthcare Analytics: Feature Engineering

Reminder: ≠ evidence of absence.

Method 1: collect a statistic within a period of time (e.g., 1 year ago): Max, Min, Mean, Count

Max, min & mean are examples smoothing

Method 2: borrow from the population

Method 3: borrow from the neighbourhood

Method 4: imputation from a proper model of data, e.g., EM sort of algorithms.

21 trang | Chuyên mục: Huyết Học và Miễn Dịch | Chia sẻ: tuando | Lượt xem: 607 | Lượt tải: 0

Tóm tắt nội dung Healthcare Analytics: Feature Engineering, để xem tài liệu hoàn chỉnh bạn click vào nút "TẢI VỀ" ở trên

Healthcare Analytics: Feature EngineeringTrần Thế TruyềnCenter for Pattern Recognition and Data AnalyticsDeakin University, AustraliaEmail: truyen.tran@deakin.edu.auURL: truyen.vietlabs.comHUST, VN, Dec 2013Why feature engineering?(Domingos, 2012) FE consumes “most of the effort in a machine learning project”. of the project’s time goes into feature engineering20% goes towards figuring out what comprises a proper and comprehensive evaluation of the algorithm10% goes into algorithm selection and tuning.Just ask Google for “feature engineering”!Issue: Absence of evidencesReminder: ≠ evidence of absence.Method 1: collect a statistic within a period of time (e.g., 1 year ago): Max, Min, Mean, CountMax, min & mean are examples smoothingMethod 2: borrow from the populationMethod 3: borrow from the neighbourhoodMethod 4: imputation from a proper model of data, e.g., EM sort of algorithms.Issue: Weak predictors and noise -Pre-filtering techniquesOccurrence thresholdPrevalence thresholdCorrelation thresholdt-statisticTop-K, e.g., using rule of thumb – keep K=[N/10] best features, where N is the number of (positive) events.Univariate p-value policeIssue: OutliersCheck carefully the reported rangesKeep values within 0.5-99.5 percentiles. Use truncation if needed.Discretization (1)Popular in data mining in the 2000sA need for Bayesian network classifiers (so the Monash group)Can be useful if nonlinearity is apparent – this is a type of simple space partitioning (e.g., decision/regression trees).How many categories?Should be treated as unordered or ordered categories?Several approaches:Equal frequency (percentiles)MDLAlveolar echinococcosis in Chinese people. Adapted from Giraudoux et al. (2013).Discretization (2)Not preferred in advanced clinical modelsLoss of informationMultiple hypothesesAssumption of homogeneity within groupData driven discretization is not generalizableTransformation/spline can be better alternativesIf linearity holds, then DON’T!But everyone does soEasier to compute prevalence and odd-ratiosEasier for doctors to reasonCategory collapsingWhich age interval is the most appropriate?No brainer: every decadeWhich cancer type generality?Simple criteria: enough support (e.g., > 20 occurrences), or enough prevalence (e.g., > 1%).Again, rare categories can be grouped into “rare-category”.Result focused: just test level 2, level 3, level 4, in the ICD-10 tree.Feature transformationPolynomialFor counts, a square root is good to kill outliersUsing thresholdsDifference from the mean, in either directionsDifference from a known threshold (e.g., Blood glucose level >= 7.0)Spline functionsTemporal resolutionsLast hospitalizationAcute conditions: days – weeksAcute but longer term: weeks – monthsChronic conditions: years - lifetimeExploiting time-stamped eventsInformation from episodes (with a clear start/end)Collapsing regular measures, e.g., every 10 mins for a visit.Be aware of planned intervention time recorded instead of actual timeLook for causal relationshipsOrdering is sometimes important: e.g., first visit -> treatment -> discharge -> revisitProactively look out for trends.Ideas borrowed from statistics, signal processing, computer visionIntensitySalient pointsConvolution operatorsEntropy/variancePulse and burstFinite response filter(Source: Wikipedia)Ideas borrowed from marketing & NLPRFM: Recency, Frequency, Monetary ValueSeverity (medicine-specific) “design features where the likelihood of a certain class goes up monotonically with the value of the field.”N-gram/phrase/collocationTemplate/regular expressionClass 1Class 2PrevalenceAgeOur own frameworkRecorded history as a temporal event image.The image is convoluted with an one-side filter bank of multiple scales and locations.Pooling operators are applied to achieve mid-level features.Leave feature screening to the model developerFilter bank – an example of Finite Filter ResponseOne-side Convolutional FiltersGaussian kernel:Uniform kernel:Filter response:Check for linearityThe most popular clinical model is in linear formWe need to check for linearityTransform if neededPiecewise linearity if neededAlveolar echinococcosis in Chinese people. Adapted from Giraudoux et al. (2013).More is more and more is lessThe current practice in machine learning is to extract all possible features we can think ofThere are evidences that very weak features can also be useful when used together.If additivity is the main property, then more is more.However, knowing one feature may exclude the effect of knowing another featureMore could be lessMore advanced ideasHaar waveletsDelay and trendingOur filter is a special case(Source: Wikipedia)Higher-order interactionsMostly product form, e.g., male & heart failure; female & breast cancer(Friedman & Popescu, 2005)More on higher-order interactionsA simple Apriori-style rule to keep away from the exponential explosion:Compute a n-order feature only if its subcomponents (n-1)-order features have enough support/prevalence/correlation.Filtering/Lasso is almost always needed to prevent overfitting.Can use many shallow decision trees for the jobRuleFit by J. FriedmanSimple random subspaces + subsampling can be used as well.Later: the idea behind my Random Learning algorithm.Final thoughtsFeature engineering as an integrated part of a full analytics cycle.Make the cycle really fast and iterate.Prototyping languages are the key.Don’t tune ML algorithms, their performances are almost the same with decent amount of data.Famous paper: Classifier technology and the illusion of progress (Hand, 2006)Look into the data for patterns and noise.Visualization helps.Be skeptical, especially when discovered features violate biomedical understanding.See a Kaggle winner’s talk:

File đính kèm:

healthcare_analytics_feature_engineering.pptx