[Getting and Cleaning data] Week 2

15016 단어 statistics R coursera data science

Week 2

Reading data from MySQL

Reading data from HDF5

Reading data from website

Reading data from API

For more detail, down the html file here.

Week 2

Reading data from MySQL

What is MySQL? SQL is short for Structured Query Language and MySQL is the world’s most popular database.(Further information can be found in wiki page and mysql page)

Free and widely used open source database software

Widely used in internet bases application

Data are structured in

Databases

Tables within databases

Fields with tables

Each row is called a record

And why using it is the important point to keep in mind here is that as a data scientist what role that you will have is likely to collect data from a database, and maybe later you’re going to put some data back in it. But usually, the basic data collection has already been formed before you get there, so you usually be handed a database and trying having to get data out of it.
Now we will focus on how to access MySQL database using R. Firstly you need to install R package RMySQL . The instruction can be found in my blog How to install RMySQL package on Windows. FOn a Mac, general way install.packages("RMySQL") is OK. Then we will access the database and collect some information about it.

step 1: connect a database using dbConnect function.

library(RMySQL)
ucscDb "genome", host="genome-mysql.cse.ucsc.edu")

step 2: apply a query to the database using dbGetQuery function. (Here the result contains all the databases in this sever.)

result "show databases;")

step 3: disconnect the connection using dbDisconnect function.(It is very important that whenever you’ve done analysis data or collecting data from MySQL server that you disconnect from the server.)

dbDisconnect(ucscDb)

Or you can use dbSendQuery function in step 2. The difference for dbGetQuery and dbSendQuery is that
1 dbSendQuery only submits and synchronously executes the SQL statement to the database engine. It does not extracts any records — for that you need to use the function dbFetch, and then you must call dbClearResult when you finish fetching the records you need. 2 dbGetQuery comes with a default implementation that calls dbSendQuery , then if dbHasCompleted is TRUE, it uses fetch to return the results. on.exit is used to ensure the result set is always freed by dbClearResult . Subclasses should override this method only if they provide some sort of performance optimisation.
The above codes gives all databases in the connection. If we want to focus on a specific database, we need to use the argument dbname for the name of database.

step 1: connect a specific database using dbConnect function with argument dbname .

hg19 "genome", dbname="hg19", host="genome-mysql.cse.ucsc.edu")

step 2: list all tables in this database hg19

allTables

step 3: View the length and head rows of hg19

length(allTables) # NO. of tables or data frames in database hg19
allTables[1:4]

step 4: list the column name (fields) of a specific table ‘affyU133Plus2’.

dbListFields(hg19, "affyU133Plus2")

step 5: give the number of records in the table ‘affyU133Plus2’

num "select count(*) from affyU133Plus2")
num

step 6: get a full data frame out of the table (Too much data for R)

oldw "warn")
options(warn = -1)
affyData "affyU133Plus2")
head(affyData)
options(warn = oldw)

step 7: sent query using dbSentQuery function.

oldw "warn")
options(warn = -1)
query "select * from affyU133Plus2 where misMatches between 1 and 3")
options(warn = oldw)

step 8: retrieve the results via fetch.(subset via feature dimension)

affyMis

step 9: automatically queries and retrieves the last 10 entries. note: you must clear the query afterwards.(subset via observation dimension.)

affyMisSmall 10) 
dbClearResult(query)

step 10: close the connection

dbDisconnect(hg19)

Reading data from HDF5

What is HDF? HDF stands for hierarchical data format and HDF5 is a data model, library, and file format for storing and managing data.(More details can br found in here)

Used for storing large data sets

Supports storing a range of data types

Heirarchical data format

groups containing zero or more data sets and metadata

have a group header with group name and list of attributes

have a group symbol table with a list of objects in group

datasets multidimensional array of data elements with metadata

have a header with name, datatye, dataspace, and storage layout

have a data array with the data

Now we begin to play with HDF5.

step 1: Install rhdf5 package, which is installed through bioconductor.(This will install packages from Bioconductor, which is used for genomics but also has good “big data” packages)

#source("http://bioconductor.org/biocLite.R")
#biocLite("rhdf5")

step 2: create a new hdf5 file

library(rhdf5)
created "./data/example.h5")

step 3: create groups(dataset + metadata) in the hdf5 file

created "./data/example.h5", "foo")
created "./data/example.h5", "baa")
created "./data/example.h5", "foo/foobaa")

step 3: list the groups of hdf5 file

h5ls("./data/example.h5")

step 4: write content to groups

# matrix
A 1:10, 5, 2)
# write the matrix to a particular group
h5write(A, "./data/example.h5", "foo/A")
# multidimension array
B 0.1, 2, by=0.2), dim = c(5, 2, 2))
# add attributes
attr(B, "scale") "liter"
# write the array to a particular group
h5write(B, "./data/example.h5", "foo/foobaa/B")

step 5: list the content in hdf5 file

h5ls("./data/example.h5")

step 6: write a data set in the top level group

# data frame
df 1L:5L, seq(0, 1, length.out = 5), c("ab", "cde", "fghi", "a", "s"), stringsAsFactors = FALSE)
# write the data frame to a particular group
h5write(df, "./data/example.h5", "df")
h5ls("./data/example.h5")

step 7: read content from hdf5 file

# read HDF5 data
readA "./data/example.h5", "foo/A")
readB "./data/example.h5", "foo/foobaa/B")
readdf "./data/example.h5", "df")
readA

step 8: change content in hdf5 file

h5write(c(12, 13, 15), "./data/example.h5", "foo/A", index = list(1:3, 1))
h5read("./data/example.h5", "foo/A")

Reading data from website

Webscraping: Programatically extracting data from HTML code of website

it can be a great way to get data

Many website have information you may want to programatically read

In some cases this is against the terms of service for the website

Attempting to read too many pages too quickly can get your IP address blocked

(1)readLinesfunction

step 1: open a connection of a url using usl function

con "http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en")

step 2: read out the data from website

htmlCode

step 3: close connection

close(con)

step 4: view the content

#htmlCode

(2)XMLpackage

step 1: give url to parse

library(XML)
url "http://scholar.google.com/citations?user=HI I6C0AAAAJ&hl=en"

step 2: parse the url and use the internal nodes to get the complete structure out

html

step 3: get the title of the page

xpathSApply(html, "//title", xmlValue)

step 4: get the number of citation

xpathSApply(html, "//td[@id='col-citedby']", xmlValue)

(maybe the last script is no longer right, but it gives us a good view to see how to extract information from website.)
(3)httrpackage

step 1: get url

library(httr)
url "http://scholar.google.com/citations?user=HI I6C0AAAAJ&hl=en"
html2

step 2: extract the content from that page

content2 as = "text")

step 3: parse out the content

parseHtml

step 4: extract the title of the page

xpathSApply(parseHtml, "//title", xmlValue)

step 5: accessing website via password

pg1 "http://httpbin.org/basic-auth/user/passwd")

step 6: view

pg1

step 7: via password

pg2 "http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
pg2

step 8: get name of it

names(pg2)

Reading data from API

What is API? API stands for application programming interfaces through which softwares can interact with each other. For example, most internet companies, like twitter or Facebook will have an application programming interface, where you can download data. For example, you can get data about which users are tweeting or what they are tweeting about, where you can get information about what people are posting on Facebook.

step 1: create an account, not a user account but an account with the api or with the development team out of each particular organization. Then you can create a new application which will give you some number being used to authenticate the application through R and access data later.(on website)

step 2: start the autherization processing for your application

library(httr)
myapp oauth_app("twitter",
                   key="YourConsumerKey",
                 secret="YourConsumerSecret")

step 3: sign in the application

sig = sign_oauth1.0(myapp, 
                    token="YourAccessToken", 
                    token_secret="YourAccessTokenSecret")

step 4: get content

homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig)
# use the content function to extract the information
json1 1,1:4]

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

SPSS Statistics 27에서 "효과량"출력

최근의 학술논문에서는 실험에서 유의한 차이가 있는지 여부를 나타내는 p-값뿐만 아니라 그 차이에 얼마나 효과가 있는지를 나타내는 효과량의 제시가 요구되고 있다. 일반적으로 두 가지 차이점은 효과량을 계산할 때 분산을...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

iOS로 인스타그램 스토리에 동영상 공유

MongoDB add sharding -- Just a note

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다