Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NoSQL

NoSQL

RDBMS - SQL

  • RDBMS = Relational database management system
  • SQL = Structured Query Language

Problems with RDBMS in development

Problems with RDBMS in production

  • Changes in schema (migration and rollback)
  • Data size
  • Speed

Scaling vertically

  • Expensive
  • Limited to the biggest machine available.
  • Fixed location (network latency for users)

Scaling horizontally

  • Replicating to spread the read-load. "replication lag"
  • Some kind of partitioning.
  • Putting different tables on different servers.
  • Splitting a table horizontally or vertically.

  • Scaling horizontally means our data is not consistent any more.

CAP Theorem of distributed data storage

  • Consistency
  • Availability
  • Partition Tolerance

CAP Theorem

  • Pick any 2.
  • In the face of networks split you need to decide if you prefer Availability or Consistency.

ACID

  • Atomic
  • Consistent
  • Isolated
  • Durable

RDBMS are usually ACID. We expect ACID. With scaling it might not work at all.

RDBMS - data is normalized

  • In a relational database everything is flat.
  • We want to normalize our data.
  • DRY = Don't repeat yourself

NoSQL common characteristics

  • Non-relational (or no tables)
  • Cluster-friendliness (mostly)
  • open-source (mostly)
  • "web 2.0" / "web scale"
  • Schema-less

CRUD

  • Create = insert
  • Read = find
  • Update = update
  • Delete = remove

CRUD

Features

  • Horizontally Scalable Architectures: Sharding and Replication.
  • Some have Map-reduce.
  • Full text index.
  • "Stored procedures" simple JavaScript.

Limitations

  • Giving up on joins and complex transactions.

NoSQL Data Models

  • Document Store
  • Key-value (tuple) store
  • Wide Column Store
  • Graph (GIS, Spatial)
  • Object Databases
  • Grid and Cloud Database
  • XML Databases
  • Multimodel Databases
  • List of NoSQL Databases (more than 225)

Document Store

Each document is usuall a JSON data structure. Mo explicit schema, but there is implicit schema when we retrive data we know (or expect) certain fields with certain data-types to be in every document.

MongoDB CLI - insert

$ mongo
> show dbs
admin  (empty)
local  0.078GB

> use test
switched to db test

> db.people.insert({ "name" : "Foo Bar", "email" : "foo@bar.com",
  "count" : 0, "scores" : [17, 23] })
WriteResult({ "nInserted" : 1 })
> db.people.insert({ "name" : "Moon Go", "email" : "moon@go.com", "count" : 0 })
WriteResult({ "nInserted" : 1 })

MongoDB CLI - find

> db.people.find()
{ "_id" : ObjectId("5948198866ad1950f69fe0ae"),
    "name" : "Foo Bar", "email" : "foo@bar.com",
    "count" : 0, "scores" : [ 17, 23 ] }
{ "_id" : ObjectId("594819be66ad1950f69fe0af"),
    "name" : "Moon Go", "email" : "moon@go.com", "count" : 0 }

> db.people.find({ "name" : "Foo Bar" })
{ "_id" : ObjectId("5948198866ad1950f69fe0ae"),
    "name" : "Foo Bar", "email" : "foo@bar.com",
    "count" : 0, "scores" : [ 17, 23 ] }

MongoDB CLI - update $inc

> db.people.update({ "name" : "Foo Bar" }, { $inc : { "count" : 1 } })
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

> db.people.find({ "name" : "Foo Bar" })
{ "_id" : ObjectId("5948198866ad1950f69fe0ae"),
    "name" : "Foo Bar", "email" : "foo@bar.com",
    "count" : 1, "scores" : [ 17, 23 ] }

MongoDB CLI - update $push

> db.people.update({ "name" : "Foo Bar" }, { $push : { "scores" : 42 } })
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

> db.people.find({ "name" : "Foo Bar" })
{ "_id" : ObjectId("5948198866ad1950f69fe0ae"),
    "name" : "Foo Bar", "email" : "foo@bar.com",
    "count" : 1, "scores" : [ 17, 23, 42 ] }

Key-value store

Like a persistent hash-map. The key is usually a string. The value can be

The line between document stores and key-value stores is blurry, Martin Fowler calls them "Aggregate-oriented" databases. A data-store is a lot more flexible in what can we search on.

Redis CLI

redis-cli

$ redis-cli
> set name foo
> get name
> set name "foo bar"
> get name

> set a 1
> get a
> incr a
> get a

> set b 1
> keys *
> del b

Wide Column Store families

Hadoop commands

  • hadoop fs -put file.txt (split into 64Mb chunks replicated 3 times)
  • hadoop jar some/long/path.java -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input inputdir -output outputdir
  • hadoop fs -get outputdir/results.txt

Cassandra

  • CQL = Cassandra Query Language (like SQL, but no joins, and need to have index (clustering) for each field in the WHERE clause)
  • RF = Replication Factor (how many times each piece of data is replicated)
  • CL = Consistency Level (ONE, QUORUM, ALL) (How many reads or writes need to happen before we consider the read or write "done")

Graph

When to use NoSQL

  • Easier development
  • Large scale data

When not to use NoSQL

  • Complex transactions (Banking, accounting etc)
  • Data Warehousing
  • Where you can't avoid complex joins

Cassandra

Cassandra: Replication

  • RF = Replication Factor (how many times each piece of data is replicated)
  • CL = Consistency Level (ONE, QUORUM, ALL) (How many reads or writes need to happen before we consider the read or write "done")
  • ONE= 1 machine
  • QUORUM = majority of the RF machines
  • ALL = all of the RF machines

Cassandra: Resources

Hadoop

Hadoop notes

  • Big Data Hadoop, Map Reduce
  • HDFS (Hadoop Distributed File System) to store data on nodes.
  • MapReduce to process data on the nodes.
  • Files are split into 64 Mb blocks and each block is stored on a separate node.
  • HDFS creates 3 copies of each block for redundancy.
  • There is a namenode holding the metadata of where the files are split and duplicated.
  • There can also be a stand by copy of the namenode to avoid having problem if the active namenode goes down.
  • Volume, Variety, Velocity  (= Generating and recording a lot of data, in various formats very fast)

Hadoop ecosystem

(e.g. hive and pig, impala, sqoop, flume, hbase, hue, oozie, mahout) Cloudera (CDH a distribution of all the parts)

Hadoop commands

  • hadoop fs -ls
  • hadoop fs -put file.txt
  • hadoop fs -get file.txt
  • hadoop fs -tail file.txt
  • hadoop fs -cat file.txt
  • hadoop fs -mv
  • hadoop fs -rm
  • hadoop fs -mkdir
  • hadoop jar some/long/path.java  -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input inputdir -output outputdir

Hadoop entities

  • job tracker (separate machine)
  • task tracker (daemon on every node)

Hadoop streaming allows us to write our map and reduce code in any language.

Hadoop entities