In our previous post, we provided an overview of the key focus areas for getting started in data analysis. This post will take a deeper look at the area of data storage for analytics.
We will discuss the common platforms used for storing and accessing data for analysis.
Data storage for analytics – Text Files
One of the most common data storage for analytics is the simple text file! It is sometimes referred to as CSV (Comma Separated Values) file.
The text file is simply a file that you can open in any text editor and contains a data dump of all your records.
Your records are usually separated by a special character or space such as | ~ or , . Hence, the name comma sparated values. Many applications that deal with data allow you to export the data from the application to a text file. See the screenshot of how the data looks like:
In the above text file or CSV file, we have the data fields or headers on the first line, followed by the data values separated by commas for each record on each line.
There is no limit to how much data a text file can hold, except for the size of your physical disk. However, it is still just a data dump and to analyse it, you will need to load it onto an analytics application.
When are text files used?
Text files are usually used when data analytics tools aren’t able to directly access the data on a system. As data analytics tools are able to read data from text files, we address this issue by exporting the system data to a text file and get the tool to read it.
Data storage for analytics – Spreadsheets – Excel / Google Sheets
Spreadsheets are widely used and hence, it is common to find data stored in spreadsheets like Excel and Google Sheets.
Your data is stored in rows and columns within worksheets.
See example below:
You are also able to read text file data into worksheets.
The good thing about spreadsheets is that you can:
- analyse your data via pivot tables
- run calculations on your data using spreadsheet formulas
- use spreadsheet formulas to create new data columns and cells to re-label your data
- create charts from your data
However, there are serious drawbacks with using spreadsheets as data storage for analytics :
- there is a limitation to the number data rows you can store (approximately 1 million for Excel)
- it is slow and inefficient if you are looking to join or merge different tables of data
- one can freely edit the data values on a worksheet and thus, makes it open to error if someone were to wrongly edit a value
- while you can run calculations on your data, it starts becoming inefficient and slow when a spreadsheet has many data rows
- spreadsheets have limited multi-user capability. You can’t use it simultaneously.
Storing and accessing data in databases
If you have large amounts of data, databases are the obvious choice to store the data.
Databases have the following advantages:
- the tables in a database can store large amounts of data
- allow for simlutaneous use (different people can access the data at the same time)
- you can summarise and run calculations on your data
- you can efficiently merge and join different database tables
- ability to maintain data integrity by only allowing specific data types for specific columns of a database or ensuring that there are no duplicate rows in a table
Most common databases are SQL (Structured Query Language) databases. You write SQL code to extract, merge, run calculations and summarise your data.
Data collection is growing at a rapid pace and you will likely, encounter the need to use a database. A common workflow in data analysis is to:
- query your required data from a database
- filter or summarise your data to the level required for analysis
- extract the filtered data and load it to a analytics tool to conduct your detailed analysis
It is therefore, crucial to have some SQL database knowledge.
Here is an example of a database table called transactions from Google’s BigQuery database. As you can see, the table has columns of specific data types. This maintains data integrity.
Here is an example of the record details stored in the table:
Visit our knowledge base, to find out more about databases and gain access some commonly used SQL code.