Tips about what, how and where data should be collected
How data quality can take us obtain response to our questions Obtaining information from data can only be achieved if it is good, quality data. […]
How data quality can take us obtain response to our questions
Obtaining information from data can only be achieved if it is good, quality data. The more quality and data you have, the more information you can get.
Therefore, in order to achieve this, we give you 23 tips to allow your organization to start, or definitely get, the most value from data.
3 factors to consider when deciding where to save data
- The amount of data you will collect.
- The frequency that data will be queried.
- The urgency to obtain data when requested.
Why should you contemplate these three factors?
3 tips to know where to save data
- If retrieving data fast is not critical, store the data on storage tiers which are slower, but cheaper (“cold” storage), so volume doesn’t impact cost that much.
- Store in high-speed tiers (“hot” storage) data that is requested more frequently or whose urgency in obtaining is high (it will depend on how important it is to get your data on time).
- Architect your data solution so the rotation between “hot” and “cold” tiers adapts to your use case. It won’t only save you costs, but will also optimize performance in most scenarios.
Moving on to the most important part, we are talking about what data you need to store. This will depend on whether your data is quality data or not, so we recommend reading it several times until it is completely clear.
12 tips to know what data need to be stored
- Store non-urgent data on devices with high capacity and slow access times. These data can be referenced with pointers located on faster devices.
- In the same situation as the previous one, compressing data is also a good idea.
- Do not worry about defining different schemas according to the users of the data (the most common case is usually for privacy reasons) since you will create logical schemas from data already saved.
- Study relationships between concepts, since running a query may have to involve recovering a lot of unnecessary data. Try to think of the operations that will be performed more frequently and create few intermediate relationships between related concepts.
- The more numerical data you store, the better.
- If you categorize data, for example, people of legal age you assign 1, and those who do not, 0, always save how those values were calculated somewhere, in this case storing the raw age value as well.
- Save dates and times for operations and transactions.
- There are cases where you can get missing data from your data source. Keep in mind the logical default value for those that are missing.
- Know the encoding in which you will save your data (¿UTF-8?).
- Write validations to prohibit those data values that do not make sense. When some values violate it, think about whether you should reject inserting that data, or whether you should set default values.
- When you have a high urgency to get certain data that requires a lot of computationally expensive operations, you may well need to pre-calculate that data and store it redundantly so that it can be retrieved faster.
- Date type is much slower than integer type, so if you have no storage constraints, it is preferable to have separate values for the date and time components (year, month, hour, etc.) to perform operations on numeric data instead of date type.
8 Tips on how to start taking data
- Always start with your most important assets and processes. Do not take much into account your current goals or KPIs, focus on getting data from them. If your goals change overtime, you will need to completely change the schema of data obtained so far and your data has little value (if it has any).
- Start taking data little by little, do not try to create a big data schema at the beginning, start small and increase your schema as you get comfortable (be agile).
- The ingestion process must be periodic and methodical. We mean that the frequency with which data is obtained must be stipulated according to the time that data needs to be collected.
- Be realistic, getting data until you have an automated process can be laborious, be aware of the resources you have available to devote to it.
- Follow standard formats, it will save you time and money.
- Make regular backups. Errors can always occur, so we consider it is extremely important that you create some decentralized, redundant backups.
- Give importance to metadata (data that describes data), such as timestamps, etc.
- Once you have all of the above, then you can start thinking about having more data, for example, you can start getting data about the weather of a specific day (if that’s important to you), data about your competition, about your market share, etc. In other words, data that is not directly part of your business, but that has an impact on it.
Important: Results and benefits that data provides will not be obtained until a certain time. So in order to obtain relevant information, you need a minimum amount of substantial data (be constant).
We hope with all our heart that these tips will help you get the most out of your data. However, if you still do not know where to start, contact us and we will help you make your organization Data Driven.