Welcome to Highline BI348

class video number 12. If you want to download this

Excel file BI348 Chapter 02 Start or the finished one or

the PowerPoint BI348 Chapter 02, click on the link

below the video. We’re in chapter two. And in this chapter,

we want to talk about descriptive statistics. Now before we can

actually start learning how to chart and do

numerical measures, we want to talk about data. So I’m going to go

to slide number six. And we want to define, right

from the beginning, what is raw data. It is simply data stored

in its smallest size. No, this is not going to work. What we really want is this. So next week, when we learn how

to import clean and transform data, anytime we’re

given data like this, we’re actually going

to have to transform it into a proper data

set with field names at the top and records in rows. Why? Because over here, if we

had customer addresses and some sales

number, and we wanted to summarize by zip code, that’s

flat out just going to be hard. Whereas here, if we have

sales number over here, we can simply

summarize by zip code, because it’s got its own column. Why? Because it’s easier

to analyze data when it’s stored in its

smallest parts. Now data. Data, here straight from the

textbook, facts or figures collected, analyzed, and

summarized for presentation and interpretation. Another way to

think of data, it’s all the unorganized raw

data in the proper data set. We have our field

names at the top. Each record or observation

has four bits of raw data. Now before we go on to

define a proper data set, I actually want to

go over to Excel and talk about data types and

default alignment in Excel. And we’re going to jump

over to Excel for this. We’re on the sheet

Excel Data Types. And it’s very important

to know in Excel we do not have as many data

types as most database systems or even a feature in

Excel called Power Pivot. We are given text

numbers, Boolean– which is true or false– errors,

and empty cells is not really a data type, but

it’s something we have to be aware of,

because it can cause trouble when we’re doing data analysis. Now if I have text like

Excel, the default alignment is to the left. If I have a number, the default

alignment is to the right. If I have a Boolean

like true or false, it’s always going to be

capitalized in center. If I have an error, it’s

always going to be centered. And finally, there’s

an empty cell. Now this is important,

especially these two right here, because we are

dealing with data. And we’re going to import

data into Excel all the time. So here we simply want

to add Alt-equals. And I’m going to

redirect it so I don’t have an empty cell

right there and enter. No problem, the sum

function is programmed to understand

numbers and add them. But over here,

Alt-equals, it didn’t even see these as numbers. If I highlight them,

not knowing anything about the default alignment

in Excel, and hit Enter, I’m going to be baffled

that I see a zero here. This is a problem. Many times when we import data,

it will be imported as text. Now Excel really does

see this as text. And sometimes when

you import data, numbers will be stored as text. But watch this. If I come over here and on

the Home ribbon and I do left alignment, of course

that’s just alignment. And it’s going to have

nothing to do with the fact that those are still numbers. You don’t want to

left align numbers like that, because then we

lose our visual cue that they are, in fact, numbers. Now I’m going to Control-Z.

Visual cue very important. And we’ll learn next chapter how

to deal with bad data like this and convert it back

to proper numbers. Now why did the sum not add it? Because it’s programmed

to ignore text. Now I want to show you a

couple of other examples. Now here’s a data set. And I want to show you one of

the things that often plagues real world Excel spreadsheets. People love to do this. They love to go up to home

and alignment and center everything. I cannot tell you how many

spreadsheets out there when I’m doing consulting I see,

and everything’s centered. And they email, or you

go into their office and you try and track down

what’s causing the trouble. Here they’re

baffled, because they don’t understand why

the sum function isn’t seeing both of them. Well, if we had avoided

using alignment center– and I’m going to

Control-Z to undo that, Control-Z– we

would automatically have our visual cue. There it is, that’s

being considered text, and we’d have to fix it. So we want to avoid going

away from our default left for text, right for numbers. Now there are

exceptions, of course. Here’s an example of some

stocks in the industry. And now notice that we

have a letter there. So this letter means billion. So it’s sitting with a

number, so of course it’s going to be text. But here is an actual number. This is sort of confusing,

if you’re consuming it, looking at it as a report. So in this case, when

it’s the final product and you’re printing

it, then it sometimes makes sense to go up

and do your centering. But in general, if you can

avoid adding any extra alignment and keeping the

default alignment, things are made easier in Excel. Now I want to jump back

over to our PowerPoint. There is a summary slide there. I’m going to go to our next

slide, slide number nine. Now a proper data

set, 1, 2, 3, 4. Fields in the first

row, no empty cell. So you’ve got to have

field names or column headers– sometimes call

variables– in the first row. None of our features are going

to work like a Pivot table unless we have our field

names in the first row. Records or observations–

each one of these is a record or observation–

are going to be in rows. Empty cells or Excel

column row headers have to be all the way

around our data set. That’s particular to Excel. In other systems– we

don’t have to worry, because tables are kept

completely separate. But in Excel, we have

to make sure we either have the Excel row headers–

like 1, 2, 3, 4 and ABC– and empty cells all the way

around or our data analysis features, like Pivot table,

Sorting Filter, won’t work. And try not to have empty

cells in the data set. Again, if we have an empty

cell for a date or a number, when we go to use

a Pivot table, we get into some potential trouble. And we’ll see that later. Now in the working world, you

can’t always avoid empty cells in the data set. Now I want to go to the

next slide, number 10. It’s important that we

have the correction terms. These are called Fields. That is a databasing term that’s

been around for a long time. It’s inconsistently

used in Excel. In a Pivot table, it

will call these Fields. Some other features, it

will call them headers. But no problem, we’re

going to call them Fields. Another important term

is Record or Observation. In general these are called

Records in databasing. Our textbook calls

them Observations. Another important

term is Element. The definition of

an element, entities on which data are collected. Now our element in this data

set is transaction number. So for this transaction number,

we have three bits of raw data that we’re collecting. We need to know the date, the

sale, and the sales rep name. Transaction number

is the element. Now in databasing, this would

be called a primary key. We would call all of these

fields and our Pivot table– we’ll call these all

fields, for example. But a primary key is simply

a list of unique elements. Now if you think about

Highline College’s database, if you’re a student,

it has your student ID in a column that

says Student ID. Only one student

ID, and then they collect data on

that one student ID, so all the raw data gets

associated with you. Now in the textbook, they

call these variables. This first column is a

list of unique elements– in databasing,

called a primary key. Everything else is

called a variable. We’re pretty much always

going to call them fields. That’s the databasing term. Now, proper data set. Not all proper data sets for

us, as data analysts in Excel, are going to have a primary key

or a list of unique elements. Let’s go to slide number

12, that’s two slides ahead. Here’s a proper data

set with a primary key or list of unique elements. Boom, no problem, we

just saw that one. Let’s go to our next slide. This is a proper data

set with no primary key, no list of unique elements. This is the kind of data

we often get in Excel. And what are we going to do? Well, we can create a

unique list of elements by using a Pivot table. So we took this, dumped

it into the Pivot table, dragged product down to the row

area, and our sales over here. Instantly, we get a unique

list from our Product column. The Pivot table did that for us. So we’re going to have

proper data sets in Excel, as long as there’s Field Names

at the top, empty cells all the way around, records

in rows, and try not to have empty cells,

we’re good to go for analyzing our data in Excel. Now I want to go two slides

ahead to slide 15 and talk about variation. Now variation in our

prerequisite class, Business 210, we talked about all

sorts of cool calculations, like variation, standard

deviation, interquartile range. But variation is very important. It’s the difference

in the variable measured over observations. So these are

observations or records. It could be a

difference over time, like this data set right here. We are really interested in

our Yahoo adjusted close price, and how it’s changing over time. We’re interested

in the variation. Probably, if we’re

owning a stock, we would like it to go up

consistently over time, right? But oftentimes

that’s not the case. So we’ll learn later how

to measure this variation. We can also have

variation like differences between customers or products. So if I go back two slides,

let’s see, slide number 13, we might be interested

in the Quad boomerang. And for each one

of these records, there’s a different number here. So we might be interested

in the variation amongst our different products. Go back to slide number 15. The role of

descriptive statistics is often going to be to

collect past observed values for variables. Here’s a variable, or a field,

and we’re collecting past data. And then we’re going to analyze

that data to gain a better understanding of the

variation and its impact on a business

setting or situation. Another example of variation

in a business setting, one of our examples we’ll do

when we’re calculating standard deviation

in this chapter is we’ll look at two

different suppliers and how many days it takes them

to deliver our products to us. If one supplier has

much more variation, the other one has

much less variation. We’re probably going to want

the supplier with the smaller variation, because it’s going

to help us in our business situation to plan more easily. Now I want to go to our

next slide, slide number 16. Here’s something we

already are fluent in, the definition of

population and sample. Population is all the

elements, sample is a subset. And oftentimes, we’re going

to be dealing with samples. Next slide, 17. Terminology– quantitative

data, categorical data. We need to know these terms,

because these different types of data require that we use

different types of charts or numerical measures or

statistical techniques. Now quantitative data

just means number data. Categorical data just

means not number data. Quantitative could

be number of units. Categorical could

be product name. An example of a

numerical measure, where the two different

calculations we make are different is average. For quantitative data,

number, we use mean. Add them up, divide

by the count. For categorical data, when

we’re calculating an average, we use the mode, the one

that occurs more frequently. For charting

quantitative, often if we have continuous quantitative

data, the chart we would use is a histogram. That’s column chart

with no gap width. We’re asked for

categorical data. We need to visually

indicate that it’s categorical by using a column

chart with no gap width. Now within quantitative,

there’s discrete. That’s counting–

1, 2, 3– and then continuous quantitative data. That means there’s no

gaps between numbers, like weight, time, and money. And the actual number depends

on the measurement instrument. Now this is important,

because also we’ll have different calculations,

like in charting. For continuous

quantitative data, if we want to visually

portray frequency, we’re going to use a histogram. But wait a second. If we have discrete,

there’s actual gaps, so we will use a

column chart with gaps. So terminology, very important. Now last slide,

slide number 18, data terminology–

cross-sectional time series. Quite easy,

cross-sectional means we have several different

elements that we’re collecting various data points

on at the same or near the same point in time. So here we have various

stocks in the industry, on this particular day. Time series needs

no explanation. It means we’re collecting

data points over time. Here is Yahoo historical

price over a long time period. All right, that’s it for our

introduction to data and data sets and data terminology. Our next video, number

13, we will briefly look at sorting, filtering,

and Pivot tables for raw data. All right, we’ll

see you next video.

Question about termonology for you.

You call "raw data" the "data stored in its smallest form"

I've called "raw data" the "data as it sits in the database"

I bring this up because I often ask clients for "raw data" or "a data dump" that I then turn into the smallest form.

The data might sit in the database with middle initials mixed with last names, preventing the ability to sort by last name. Or, regular zip codes and zip+4 in the same column.

I'm dealing with a project now, and one column of data comes in like:

ChefsChoice:Veggie

ChefsChoice:Meat

SoupFor4:Meat

SoupFor2:Vegan

Bread

Salad

SoupFor2:Veggie

What I call "raw data" returns that list, which isn't immediately useful. But that's how it comes out of the database.

Part of my role is to identify the colon as a delimiter and get everything in the smallest form, in 2 different columns.

Do you make a distinction between the 2 types of data?

Great video! Never to much talking about proper data sets!

I am also using the term raw data when I consult.

The raw data is always the most useful because we can get anything we want out of it 🙂

Raw data rules!

please tell me which video editor software do you use?

Thanks for your sharing!! May I know what formula makes the red words show after cell F19 is typed? Thanks!!

Mike, how do you clear data from the data model after it changes at the source. I have used several queries using the raw data and adde to the data model. Now the data has changed the data model does not want to remove old data after I have refreshed every possible way.

Thanks