NoSQL

Each document is usuall a JSON data structure. Mo explicit schema, but there is implicit schema when we retrive data we know (or expect) certain fields with certain data-types to be in every document.

MongoDB (10gen/MongoDB)
CouchDB (Apache)
RavenDB (Hibernating Rhinos)
Elasticsearch (Elasticsearch)
RethinkDB (RethinkDB)

MongoDB CLI - insert

$ mongo
> show dbs
admin  (empty)
local  0.078GB

> use test
switched to db test

> db.people.insert({ "name" : "Foo Bar", "email" : "foo@bar.com",
  "count" : 0, "scores" : [17, 23] })
WriteResult({ "nInserted" : 1 })
> db.people.insert({ "name" : "Moon Go", "email" : "moon@go.com", "count" : 0 })
WriteResult({ "nInserted" : 1 })

MongoDB CLI - find

> db.people.find()
{ "_id" : ObjectId("5948198866ad1950f69fe0ae"),
    "name" : "Foo Bar", "email" : "foo@bar.com",
    "count" : 0, "scores" : [ 17, 23 ] }
{ "_id" : ObjectId("594819be66ad1950f69fe0af"),
    "name" : "Moon Go", "email" : "moon@go.com", "count" : 0 }

> db.people.find({ "name" : "Foo Bar" })
{ "_id" : ObjectId("5948198866ad1950f69fe0ae"),
    "name" : "Foo Bar", "email" : "foo@bar.com",
    "count" : 0, "scores" : [ 17, 23 ] }

MongoDB CLI - update $inc

> db.people.update({ "name" : "Foo Bar" }, { $inc : { "count" : 1 } })
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

> db.people.find({ "name" : "Foo Bar" })
{ "_id" : ObjectId("5948198866ad1950f69fe0ae"),
    "name" : "Foo Bar", "email" : "foo@bar.com",
    "count" : 1, "scores" : [ 17, 23 ] }

MongoDB CLI - update $push

> db.people.update({ "name" : "Foo Bar" }, { $push : { "scores" : 42 } })
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

> db.people.find({ "name" : "Foo Bar" })
{ "_id" : ObjectId("5948198866ad1950f69fe0ae"),
    "name" : "Foo Bar", "email" : "foo@bar.com",
    "count" : 1, "scores" : [ 17, 23, 42 ] }

Key-value store

Like a persistent hash-map. The key is usually a string. The value can be

MemcacheDB (not developed any more)
Riak (Basho)
Redis

The line between document stores and key-value stores is blurry, Martin Fowler calls them "Aggregate-oriented" databases. A data-store is a lot more flexible in what can we search on.

Redis CLI

redis-cli

$ redis-cli
> set name foo
> get name
> set name "foo bar"
> get name

> set a 1
> get a
> incr a
> get a

> set b 1
> keys *
> del b

Wide Column Store families

Hadoop (Apache)
HBase (Apache)
Cassandra (Apache)
Hypertable (Hypertable)

Hadoop commands

hadoop fs -put file.txt (split into 64Mb chunks replicated 3 times)
hadoop jar some/long/path.java -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input inputdir -output outputdir
hadoop fs -get outputdir/results.txt

Cassandra

CQL = Cassandra Query Language (like SQL, but no joins, and need to have index (clustering) for each field in the WHERE clause)
RF = Replication Factor (how many times each piece of data is replicated)
CL = Consistency Level (ONE, QUORUM, ALL) (How many reads or writes need to happen before we consider the read or write "done")

Graph

Graphs/trees/hierarchies/etc.
GIS Databases (geographic information systems)
Spatial database systems
Neo4j (Neo Technology)
Onyx

When to use NoSQL

Easier development
Large scale data

When not to use NoSQL

Complex transactions (Banking, accounting etc)
Data Warehousing
Where you can't avoid complex joins

Cassandra

Cassandra: Replication

RF = Replication Factor (how many times each piece of data is replicated)
CL = Consistency Level (ONE, QUORUM, ALL) (How many reads or writes need to happen before we consider the read or write "done")
ONE= 1 machine
QUORUM = majority of the RF machines
ALL = all of the RF machines

Cassandra: Resources

DS220 Data Modeling

Hadoop

Hadoop notes

Big Data Hadoop, Map Reduce
HDFS (Hadoop Distributed File System) to store data on nodes.
MapReduce to process data on the nodes.
Files are split into 64 Mb blocks and each block is stored on a separate node.
HDFS creates 3 copies of each block for redundancy.
There is a namenode holding the metadata of where the files are split and duplicated.
There can also be a stand by copy of the namenode to avoid having problem if the active namenode goes down.
Volume, Variety, Velocity (= Generating and recording a lot of data, in various formats very fast)

Hadoop ecosystem

(e.g. hive and pig, impala, sqoop, flume, hbase, hue, oozie, mahout) Cloudera (CDH a distribution of all the parts)

Hadoop commands

hadoop fs -ls
hadoop fs -put file.txt
hadoop fs -get file.txt
hadoop fs -tail file.txt
hadoop fs -cat file.txt
hadoop fs -mv
hadoop fs -rm
hadoop fs -mkdir
hadoop jar some/long/path.java -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input inputdir -output outputdir

Hadoop entities

job tracker (separate machine)
task tracker (daemon on every node)

Hadoop streaming allows us to write our map and reduce code in any language.

Hadoop entities

Intro to Hadoop and MapReduce

Keyboard shortcuts

NoSQL