[Getting and Cleaning data] Week 2

  • Week 2
  • Reading data from MySQL
  • Reading data from HDF5
  • Reading data from website
  • Reading data from API


  • For more detail, down the html file here.

    Week 2


    Reading data from MySQL


    What is MySQL? SQL is short for Structured Query Language and MySQL is the world’s most popular database.(Further information can be found in wiki page and mysql page)
  • Free and widely used open source database software
  • Widely used in internet bases application
  • Data are structured in
  • Databases
  • Tables within databases
  • Fields with tables

  • Each row is called a record

  • And why using it is the important point to keep in mind here is that as a data scientist what role that you will have is likely to collect data from a database, and maybe later you’re going to put some data back in it. But usually, the basic data collection has already been formed before you get there, so you usually be handed a database and trying having to get data out of it.
    Now we will focus on how to access MySQL database using R. Firstly you need to install R package RMySQL . The instruction can be found in my blog How to install RMySQL package on Windows. FOn a Mac, general way install.packages("RMySQL") is OK. Then we will access the database and collect some information about it.
  • step 1: connect a database using dbConnect function.
  • library(RMySQL)
    ucscDb "genome", host="genome-mysql.cse.ucsc.edu")
  • step 2: apply a query to the database using dbGetQuery function. (Here the result contains all the databases in this sever.)
  • result "show databases;")
  • step 3: disconnect the connection using dbDisconnect function.(It is very important that whenever you’ve done analysis data or collecting data from MySQL server that you disconnect from the server.)
  • dbDisconnect(ucscDb)

    Or you can use dbSendQuery function in step 2. The difference for dbGetQuery and dbSendQuery is that
    1 dbSendQuery only submits and synchronously executes the SQL statement to the database engine. It does not extracts any records — for that you need to use the function dbFetch, and then you must call dbClearResult when you finish fetching the records you need. 2 dbGetQuery comes with a default implementation that calls dbSendQuery , then if dbHasCompleted is TRUE, it uses fetch to return the results. on.exit is used to ensure the result set is always freed by dbClearResult . Subclasses should override this method only if they provide some sort of performance optimisation.
    The above codes gives all databases in the connection. If we want to focus on a specific database, we need to use the argument dbname for the name of database.
  • step 1: connect a specific database using dbConnect function with argument dbname .
  • hg19 "genome", dbname="hg19", host="genome-mysql.cse.ucsc.edu")
  • step 2: list all tables in this database hg19
  • allTables 
  • step 3: View the length and head rows of hg19
  • length(allTables) # NO. of tables or data frames in database hg19
    allTables[1:4]
  • step 4: list the column name (fields) of a specific table ‘affyU133Plus2’.
  • dbListFields(hg19, "affyU133Plus2") 
  • step 5: give the number of records in the table ‘affyU133Plus2’
  • num "select count(*) from affyU133Plus2")
    num
  • step 6: get a full data frame out of the table (Too much data for R)
  • oldw "warn")
    options(warn = -1)
    affyData "affyU133Plus2")
    head(affyData)
    options(warn = oldw)
  • step 7: sent query using dbSentQuery function.
  • oldw "warn")
    options(warn = -1)
    query "select * from affyU133Plus2 where misMatches between 1 and 3")
    options(warn = oldw)
  • step 8: retrieve the results via fetch.(subset via feature dimension)
  • affyMis 
  • step 9: automatically queries and retrieves the last 10 entries. note: you must clear the query afterwards.(subset via observation dimension.)
  • affyMisSmall 10) 
    dbClearResult(query)
  • step 10: close the connection
  • dbDisconnect(hg19)

    Reading data from HDF5


    What is HDF? HDF stands for hierarchical data format and HDF5 is a data model, library, and file format for storing and managing data.(More details can br found in here)
  • Used for storing large data sets
  • Supports storing a range of data types
  • Heirarchical data format
  • groups containing zero or more data sets and metadata
  • have a group header with group name and list of attributes
  • have a group symbol table with a list of objects in group

  • datasets multidimensional array of data elements with metadata
  • have a header with name, datatye, dataspace, and storage layout
  • have a data array with the data


  • Now we begin to play with HDF5.
  • step 1: Install rhdf5 package, which is installed through bioconductor.(This will install packages from Bioconductor, which is used for genomics but also has good “big data” packages)
  • #source("http://bioconductor.org/biocLite.R")
    #biocLite("rhdf5")
  • step 2: create a new hdf5 file
  • library(rhdf5)
    created "./data/example.h5")
  • step 3: create groups(dataset + metadata) in the hdf5 file
  • created "./data/example.h5", "foo")
    created "./data/example.h5", "baa")
    created "./data/example.h5", "foo/foobaa")
  • step 3: list the groups of hdf5 file
  • h5ls("./data/example.h5")
  • step 4: write content to groups
  • # matrix
    A 1:10, 5, 2)
    # write the matrix to a particular group
    h5write(A, "./data/example.h5", "foo/A")
    # multidimension array
    B 0.1, 2, by=0.2), dim = c(5, 2, 2))
    # add attributes
    attr(B, "scale") "liter"
    # write the array to a particular group
    h5write(B, "./data/example.h5", "foo/foobaa/B")
  • step 5: list the content in hdf5 file
  • h5ls("./data/example.h5")
  • step 6: write a data set in the top level group
  • # data frame
    df 1L:5L, seq(0, 1, length.out = 5), c("ab", "cde", "fghi", "a", "s"), stringsAsFactors = FALSE)
    # write the data frame to a particular group
    h5write(df, "./data/example.h5", "df")
    h5ls("./data/example.h5")
  • step 7: read content from hdf5 file
  • # read HDF5 data
    readA "./data/example.h5", "foo/A")
    readB "./data/example.h5", "foo/foobaa/B")
    readdf "./data/example.h5", "df")
    readA
  • step 8: change content in hdf5 file
  • h5write(c(12, 13, 15), "./data/example.h5", "foo/A", index = list(1:3, 1))
    h5read("./data/example.h5", "foo/A")

    Reading data from website


    Webscraping: Programatically extracting data from HTML code of website
  • it can be a great way to get data
  • Many website have information you may want to programatically read
  • In some cases this is against the terms of service for the website
  • Attempting to read too many pages too quickly can get your IP address blocked

  • (1)readLinesfunction
  • step 1: open a connection of a url using usl function
  • con "http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en")
  • step 2: read out the data from website
  • htmlCode 
  • step 3: close connection
  • close(con)
  • step 4: view the content
  • #htmlCode

    (2)XMLpackage
  • step 1: give url to parse
  • library(XML)
    url "http://scholar.google.com/citations?user=HI I6C0AAAAJ&hl=en"
  • step 2: parse the url and use the internal nodes to get the complete structure out
  • html 
  • step 3: get the title of the page
  • xpathSApply(html, "//title", xmlValue)
  • step 4: get the number of citation
  • xpathSApply(html, "//td[@id='col-citedby']", xmlValue)

    (maybe the last script is no longer right, but it gives us a good view to see how to extract information from website.)
    (3)httrpackage
  • step 1: get url
  • library(httr)
    url "http://scholar.google.com/citations?user=HI I6C0AAAAJ&hl=en"
    html2 
  • step 2: extract the content from that page
  • content2 as = "text")
  • step 3: parse out the content
  • parseHtml 
  • step 4: extract the title of the page
  • xpathSApply(parseHtml, "//title", xmlValue)
  • step 5: accessing website via password
  • pg1 "http://httpbin.org/basic-auth/user/passwd")
  • step 6: view
  • pg1
  • step 7: via password
  • pg2 "http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
    pg2
  • step 8: get name of it
  • names(pg2)

    Reading data from API


    What is API? API stands for application programming interfaces through which softwares can interact with each other. For example, most internet companies, like twitter or Facebook will have an application programming interface, where you can download data. For example, you can get data about which users are tweeting or what they are tweeting about, where you can get information about what people are posting on Facebook.
  • step 1: create an account, not a user account but an account with the api or with the development team out of each particular organization. Then you can create a new application which will give you some number being used to authenticate the application through R and access data later.(on website)
  • step 2: start the autherization processing for your application
  • library(httr)
    myapp oauth_app("twitter",
                       key="YourConsumerKey",
                     secret="YourConsumerSecret")
  • step 3: sign in the application
  • sig = sign_oauth1.0(myapp, 
                        token="YourAccessToken", 
                        token_secret="YourAccessTokenSecret")
  • step 4: get content
  • homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig)
    # use the content function to extract the information
    json1 1,1:4]

    좋은 웹페이지 즐겨찾기