Posts

Showing posts with the label Data Quality

2023-01-02: Survey of Data Interfaces and Archived Data

Image
Data, Data Everywhere   Data is all around us and comes in many forms. Statistical, descriptive, discrete, encoded, decoded, geocoded, qualitative, quantitative, quantum 🤯! Wow, there's a lot! Data, often statistical, forms the bedrock for informed decision making within scientific, political, academic, and economic fields to name a few. Public figures and publications have even gone so far as to coin data as "the new oil". The wellspring of data being generated and mined during the COVID-19 pandemic further espouses this idea of data as oil. During the Coronavirus pandemic we saw an explosion in interest and data generated from policy makers in government, scientists across the world, and even citizen scientists using tools such as Folding@Home . The National Institute for Health alone lists 36 different COVID-19 related datasets with a further 28 listed on the World Bank's website . Fig 1. World Health Organization: Coronavirus dashboard   From China...

2022-06-17: StreamingHub - Building Reusable, Reproducible Workflows

Image
As researchers, we often create artifacts (i.e., data and code) that others (including ourselves) might reuse in the future. Especially during early stages of research (i.e., the exploration phase), we hack bits and pieces of code together to test out different hypotheses. Upon discovering few hypotheses that work, our focus shifts towards rigorous testing, academic writing, and publication. At which stage should we ensure the reusability of those artifacts? Ideally, as early as possible; yet this is easier said than done. For instance, at exploration stage, it's often impractical to allocate time for data/code reusability. After all, at this point, it's quite unclear whether the hypotheses would even pan out. Likewise, at every stage that follows, it's quite easy to get consumed with testing, academic writing, and publication. Such circumstances have, and continue to, push reusability down to an afterthought. On the flip side, having to work with "difficult...

2022-01-18: Evaluating Trust in User-Data Networks: What Can We Learn from Waze?

Image
One of the research thrusts I'm currently pursuing is evaluating how much we can trust crowd sourced data provided by users in a large multi-user network with social components. As a first step I've been considering some popular user-data networks and current efforts in literature in order to gain an understanding of some of the general dynamics of such systems. As you'll see, these dynamics can be best described, from a very high level, as a dependency on the interactions between users or, in some cases, autonomous agents, and their behavior in the system. The ultimate goal is to consider the quality of data the users/agents provide as their "behavior." These networks need not necessarily be social networks but can include social-centric services such as Waze  (a navigation application that uses user provided data on current traffic conditions and locations), Glassdoor  (a job search and company information site that uses user provided data on salaries and benefi...