Abstract
The accelerated rate of technological innovation in the 21st century has yielded many large data sets requiring new methods and approaches for analysis. In the past, data collection was mostly done after the development of a hypothesis-based experimental design and project plan (“primary data-based studies”). In contrast, data from secondary
... read more
data-based studies are large data sets, potentially big data, collected before the analysis’ method or goal is determined. While data from secondary data-based studies can be limited by the nature of their collection method, they can compensate with their large sample sizes, and the increased detail from larger numbers of variables. At the same time, large secondary data-based studies are more prone to result in imperfect data challenges including: rare events, high-dimensionality, missing data, multilevel data, and undefined outcomes. Routinely collected animal health and production data are good examples of secondary data commonly used in veterinary epidemiology. Epidemiology has a strong foundation in using statistical methods for data analysis. However, epidemiology has recently followed trends in data science such as the emergence of data mining and machine learning. Unlike statistics, there is the growing movement towards automation in data science. Automation of data analysis has the potential to increase the amount of output, and improve the resulting model performance. We can assume that some portions of veterinary epidemiological research will continue to follow the trends in data sciences towards automation due to the desire to develop real-time surveillance and prediction capabilities in epidemiology. However, epidemiology will always need to be focused on biological relevance and meaningful interpretation of results. Therefore, epidemiology needs to prepare for the trend towards automatic data analyses by adapting or developing methods that can be automated in the future and that do not remove the focus of the research from the biological relevance and interpretation of the results. In this dissertation, the goal was to develop methods for analyzing imperfect data and solutions for the systematic integration, comparison and selection of statistical methods. A large variety of data sets were used from large, secondary data-based studies that were available for analysis. A range of methods to address imperfect data challenges were applied, and new methods and systematic protocols were developed all while being focused on biological relevance and sound interpretation of the results. The studies had diverse goals that fit into three main areas: parameter estimation modeling, prediction modeling, and pattern discovery. This work is a steppingstone towards ensuring that future data analyses are not hindered by large data imperfections. Additionally, this work ensures that the many possible methods available for analyzing data becomes an advantage and not a hurdle as we move towards automation of data analysis. The methods described and used in this dissertation will allow access to more data for analysis while gaining the benefit of a large data set. Finally, this work suggests that automation of data analysis can coincide with a focus on biological relevance and sound interpretation.
show less