Data Sources

Learn about different types of databases and data sources.

In data analytics, we work with a variety of data, which can be stored in many locations. The places where data comes from are data sources.

There are different types of data sources, including but not limited to databases (e.g., SQL and NoSQL), file shares, Software as a Service (SaaS) applications, devices, logs, and even social interactions. Any place where we can get data for potential analysis can be considered a data source.

Let’s consider what data sources can represent and which type of data can be acquired from them.

SQL and NoSQL databases

Databases are places where data is stored in an organized way, which allows users to access information easily and quickly. We describe two types of databases based on the way data is stored: SQL and NoSQL databases.

SQL databases, also known as relational databases, store information in the form of tables. The tables can be connected by specific relationships. SQL stands for Structured Query Language and is used to add, remove, update, and retrieve data in relational databases.

SQL example

Suppose there is a database table students that contains the columns student_id, full_name, and email.

Suppose there is another table courses with the columns course_id, course_name, and description.

A student might have several courses, and a course might have many enrolled students. We can create a third database table students_to_courses that captures this many-to-many relationship between student_id and course_id.

An illustrative SQL code for creating the three tables and their defined columns is given below. We set up primary and foreign keys as ways to validate data within the tables.

CREATE TABLE students (
student_id INT NOT NULL AUTO_INCREMENT,
full_name VARCHAR(200) NOT NULL,
email VARCHAR(200) NOT NULL,
PRIMARY KEY (student_id));
CREATE TABLE courses (
course_id INT NOT NULL AUTO_INCREMENT,
course_name VARCHAR(200) NOT NULL,
description VARCHAR(800) NOT NULL,
PRIMARY KEY (course_id));
CREATE TABLE students_to_courses (
student_id INT NOT NULL,
course_id INT NOT NULL,
FOREIGN KEY (student_id) REFERENCES students(student_id),
FOREIGN KEY (course_id) REFERENCES courses(course_id));

Once the SQL tables have been created, we can populate them using INSERT INTO statements.

INSERT INTO students (full_name, email)
VALUES ('Happy Dwarf', 'happy_dwarf@magic_kingdom.com');
INSERT INTO courses (course_name, description)
VALUES ('Geography', 'Learn where all the dwarf groups live');

We can also use SQL to update existing data in a table.

UPDATE students
SET full_name = 'Happy Dwarf II'
WHERE email = 'happy_dwarf@magic_kingdom.com';

Note: SQL keywords such as UPDATE, SET, and WHERE aren’t case sensitive. However, table and column names are case sensitive. A SQL statement can appear on one line instead of multiple lines.

NoSQL example

The term NoSQL is relatively new, becoming prevalent in the early 2010s as people looked for new ways of storing data that could work better for large-scale web applications with specific usage patterns.

In contrast to SQL databases, where database tables have predefined columns and relationships, NoSQL databases don’t require this same pre-definition. NoSQL databases are organized in other forms, such as key values, documents, or graphs. Our previous SQL example with students and courses could be stored in JavaScript Object Notation (JSON) in a NoSQL database.

{
'student_id': 1,
'full_name': 'Happy Dwarf”,
'email': 'happy_dwarf@magic_kingdom.com',
'courses': ['laughing', 'walking', 'reading']
}

NoSQL approaches have their advantages and disadvantages. Even though NoSQL databases such as Amazon DynamoDB don’t require a database schema (as defined by CREATE TABLE statements in SQL), properly setting up NoSQL databases isn’t a simple task.

To realize potential performance benefits for an application, NoSQL databases have to be set up carefully and with a deep understanding of how the application will store and retrieve the data. For those already familiar with SQL, it can be challenging (and unnecessary) to redesign existing applications so that they work well with NoSQL databases.

There are many NoSQL database options currently available within AWS and outside of AWS. The following table is a brief list of available SQL and NoSQL database options.

SQL Databases

NoSQL Databases

MySQL

MongoDB

Amazon Aurora

Amazon DynamoDB

Amazon RDS

Cloud Bigtable

PostgreSQL

Cassandra

Azure SQL

Azure Cosmos DB

If we're not sure which database to choose for a new web application, we recommend starting with a relational (SQL) database. SQL databases have had many decades of research and development, are continuing to improve, and are relatively easy to use.

When a web application hits a scale (such as with Amazon.com) that SQL database options are no longer sufficient, it can be a good idea to experiment with a NoSQL database and see how well it can work for specific use cases. It’s possible to use both types of databases within a large-scale software system.

File sharing

File sharing allows us to access documents and other data objects from a private or public network.

Some examples by which file data can be shared include:

  • Peer-to-peer networks.

  • Private networks of organizations or homes.

  • Online services, such as Google Drive or Dropbox.

  • Public web sources, such as GitHub public repositories.

File Transfer Protocol (FTP) is one of the earliest communication protocols for sending files across networks. FTP was first introduced in 1971 and was a common way to transfer files for publication on the web in the 1990s.

Devices, logs, and social media

Devices that we use every day can also be data sources. There’s a lot of data on mobile phones, laptops, computers, and even electronic watches. We can analyze device data to derive insights.

Logs are stored events related to a system. They can be traced and analyzed to identify errors and help maintain the system. For example, there can be logs of every request that developers make to an Application Programming Interface (API). Every user visit to a website is also logged. A service such as Google Analytics is able to process these events and display relevant insights for website administrators.

Social media has become increasingly important as social networks have gained popularity on the internet. Even before digital social networks such as Facebook and Twitter, people were already collecting useful social data—for example, by talking to friends and colleagues or by conducting surveys and interviews. Such offline data can be stored digitally for further analysis.

SaaS applications

SaaS is a software model based on delivering an application via the web without requiring users to maintain the software. SaaS users typically license the software on a subscription basis (e.g., monthly or annually).

SaaS applications are often hosted on cloud-computing platforms and accessed through web browsers. They’re widely used today because of their ease of use and other benefits.

SaaS applications differ from software that people previously bought in shrink-wrapped packages from physical stores. They also differ from downloadable applications that can’t automatically support updates.

Here are some examples of SaaS applications:

  • Google Workspace: For a monthly fee, organizations can use business versions of Gmail, Calendar, and Docs. Google hosts these applications using centralized servers. Free consumer versions of these apps are also available.

  • Netflix: Consumers around the world are familiar with this subscription streaming service to watch movies and TV shows. Netflix uses AWS for hosting and other behind-the-scenes functionality. Prior to streaming content from the Cloud, Netflix shipped DVDs via physical mail.

  • Zoom: Zoom’s video-conferencing tool has downloadable software clients that automatically update when new versions become available. It also uses cloud computing to run its service. Consumers and businesses can subscribe to monthly or annual plans.

  • AWS: The over 200 services included in AWS can be considered SaaS products for developers. Additionally, AWS also powers a great number of other SaaS applications (including consumer apps). AWS fees are typically based on usage, a more variable pricing model than other SaaS applications.

Many SaaS applications used by businesses generate or store data that can be useful in other contexts. Therefore, SaaS applications can become data sources for further analysis.