NoSQL
NoSQL
RDBMS - SQL
- RDBMS = Relational database management system
- SQL = Structured Query Language
Problems with RDBMS in development
- Tables in database vs Objects in code
- Object-relational Impedance Mismatch
- ORMs - Object Relational Mappings
- Representing Inheritance?
Problems with RDBMS in production
- Changes in schema (migration and rollback)
- Data size
- Speed
Scaling vertically
- Expensive
- Limited to the biggest machine available.
- Fixed location (network latency for users)
Scaling horizontally
- Replicating to spread the read-load. "replication lag"
- Some kind of partitioning.
- Putting different tables on different servers.
- Splitting a table horizontally or vertically.
-
- Scaling horizontally means our data is not consistent any more.
CAP Theorem of distributed data storage
- Consistency
- Availability
- Partition Tolerance
- Pick any 2.
- In the face of networks split you need to decide if you prefer Availability or Consistency.
ACID
- Atomic
- Consistent
- Isolated
- Durable
RDBMS are usually ACID. We expect ACID. With scaling it might not work at all.
RDBMS - data is normalized
- In a relational database everything is flat.
- We want to normalize our data.
- DRY = Don't repeat yourself
NoSQL common characteristics
- Non-relational (or no tables)
- Cluster-friendliness (mostly)
- open-source (mostly)
- "web 2.0" / "web scale"
- Schema-less
CRUD
- Create = insert
- Read = find
- Update = update
- Delete = remove
Features
- Horizontally Scalable Architectures: Sharding and Replication.
- Some have Map-reduce.
- Full text index.
- "Stored procedures" simple JavaScript.
Limitations
- Giving up on joins and complex transactions.
NoSQL Data Models
- Document Store
- Key-value (tuple) store
- Wide Column Store
- Graph (GIS, Spatial)
- Object Databases
- Grid and Cloud Database
- XML Databases
- Multimodel Databases
- List of NoSQL Databases (more than 225)
Document Store
Each document is usuall a JSON data structure. Mo explicit schema, but there is implicit schema when we retrive data we know (or expect) certain fields with certain data-types to be in every document.
- MongoDB (10gen/MongoDB)
- CouchDB (Apache)
- RavenDB (Hibernating Rhinos)
- Elasticsearch (Elasticsearch)
- RethinkDB (RethinkDB)
MongoDB CLI - insert
$ mongo
> show dbs
admin (empty)
local 0.078GB
> use test
switched to db test
> db.people.insert({ "name" : "Foo Bar", "email" : "foo@bar.com",
"count" : 0, "scores" : [17, 23] })
WriteResult({ "nInserted" : 1 })
> db.people.insert({ "name" : "Moon Go", "email" : "moon@go.com", "count" : 0 })
WriteResult({ "nInserted" : 1 })
MongoDB CLI - find
> db.people.find()
{ "_id" : ObjectId("5948198866ad1950f69fe0ae"),
"name" : "Foo Bar", "email" : "foo@bar.com",
"count" : 0, "scores" : [ 17, 23 ] }
{ "_id" : ObjectId("594819be66ad1950f69fe0af"),
"name" : "Moon Go", "email" : "moon@go.com", "count" : 0 }
> db.people.find({ "name" : "Foo Bar" })
{ "_id" : ObjectId("5948198866ad1950f69fe0ae"),
"name" : "Foo Bar", "email" : "foo@bar.com",
"count" : 0, "scores" : [ 17, 23 ] }
MongoDB CLI - update $inc
> db.people.update({ "name" : "Foo Bar" }, { $inc : { "count" : 1 } })
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.people.find({ "name" : "Foo Bar" })
{ "_id" : ObjectId("5948198866ad1950f69fe0ae"),
"name" : "Foo Bar", "email" : "foo@bar.com",
"count" : 1, "scores" : [ 17, 23 ] }
MongoDB CLI - update $push
> db.people.update({ "name" : "Foo Bar" }, { $push : { "scores" : 42 } })
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.people.find({ "name" : "Foo Bar" })
{ "_id" : ObjectId("5948198866ad1950f69fe0ae"),
"name" : "Foo Bar", "email" : "foo@bar.com",
"count" : 1, "scores" : [ 17, 23, 42 ] }
Key-value store
Like a persistent hash-map. The key is usually a string. The value can be
- MemcacheDB (not developed any more)
- Riak (Basho)
- Redis
The line between document stores and key-value stores is blurry, Martin Fowler calls them "Aggregate-oriented" databases. A data-store is a lot more flexible in what can we search on.
Redis CLI
$ redis-cli
> set name foo
> get name
> set name "foo bar"
> get name
> set a 1
> get a
> incr a
> get a
> set b 1
> keys *
> del b
Wide Column Store families
- Hadoop (Apache)
- HBase (Apache)
- Cassandra (Apache)
- Hypertable (Hypertable)
Hadoop commands
- hadoop fs -put file.txt (split into 64Mb chunks replicated 3 times)
- hadoop jar some/long/path.java -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input inputdir -output outputdir
- hadoop fs -get outputdir/results.txt
Cassandra
- CQL = Cassandra Query Language (like SQL, but no joins, and need to have index (clustering) for each field in the WHERE clause)
- RF = Replication Factor (how many times each piece of data is replicated)
- CL = Consistency Level (ONE, QUORUM, ALL) (How many reads or writes need to happen before we consider the read or write "done")
Graph
-
Graphs/trees/hierarchies/etc.
-
Neo4j (Neo Technology)
When to use NoSQL
- Easier development
- Large scale data
When not to use NoSQL
- Complex transactions (Banking, accounting etc)
- Data Warehousing
- Where you can't avoid complex joins
Cassandra
Cassandra: Replication
- RF = Replication Factor (how many times each piece of data is replicated)
- CL = Consistency Level (ONE, QUORUM, ALL) (How many reads or writes need to happen before we consider the read or write "done")
- ONE= 1 machine
- QUORUM = majority of the RF machines
- ALL = all of the RF machines
Cassandra: Resources
Hadoop
Hadoop notes
- Big Data Hadoop, Map Reduce
- HDFS (Hadoop Distributed File System) to store data on nodes.
- MapReduce to process data on the nodes.
- Files are split into 64 Mb blocks and each block is stored on a separate node.
- HDFS creates 3 copies of each block for redundancy.
- There is a namenode holding the metadata of where the files are split and duplicated.
- There can also be a stand by copy of the namenode to avoid having problem if the active namenode goes down.
- Volume, Variety, Velocity (= Generating and recording a lot of data, in various formats very fast)
Hadoop ecosystem
(e.g. hive and pig, impala, sqoop, flume, hbase, hue, oozie, mahout) Cloudera (CDH a distribution of all the parts)
Hadoop commands
- hadoop fs -ls
- hadoop fs -put file.txt
- hadoop fs -get file.txt
- hadoop fs -tail file.txt
- hadoop fs -cat file.txt
- hadoop fs -mv
- hadoop fs -rm
- hadoop fs -mkdir
- hadoop jar some/long/path.java -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input inputdir -output outputdir
Hadoop entities
- job tracker (separate machine)
- task tracker (daemon on every node)
Hadoop streaming allows us to write our map and reduce code in any language.