October 15, 2019
Basic Excel Business Analytics #12: Raw Data, Data, Proper Data Sets and Data Terminology

Basic Excel Business Analytics #12: Raw Data, Data, Proper Data Sets and Data Terminology


Welcome to Highline BI348
class video number 12. If you want to download this
Excel file BI348 Chapter 02 Start or the finished one or
the PowerPoint BI348 Chapter 02, click on the link
below the video. We’re in chapter two. And in this chapter,
we want to talk about descriptive statistics. Now before we can
actually start learning how to chart and do
numerical measures, we want to talk about data. So I’m going to go
to slide number six. And we want to define, right
from the beginning, what is raw data. It is simply data stored
in its smallest size. No, this is not going to work. What we really want is this. So next week, when we learn how
to import clean and transform data, anytime we’re
given data like this, we’re actually going
to have to transform it into a proper data
set with field names at the top and records in rows. Why? Because over here, if we
had customer addresses and some sales
number, and we wanted to summarize by zip code, that’s
flat out just going to be hard. Whereas here, if we have
sales number over here, we can simply
summarize by zip code, because it’s got its own column. Why? Because it’s easier
to analyze data when it’s stored in its
smallest parts. Now data. Data, here straight from the
textbook, facts or figures collected, analyzed, and
summarized for presentation and interpretation. Another way to
think of data, it’s all the unorganized raw
data in the proper data set. We have our field
names at the top. Each record or observation
has four bits of raw data. Now before we go on to
define a proper data set, I actually want to
go over to Excel and talk about data types and
default alignment in Excel. And we’re going to jump
over to Excel for this. We’re on the sheet
Excel Data Types. And it’s very important
to know in Excel we do not have as many data
types as most database systems or even a feature in
Excel called Power Pivot. We are given text
numbers, Boolean– which is true or false– errors,
and empty cells is not really a data type, but
it’s something we have to be aware of,
because it can cause trouble when we’re doing data analysis. Now if I have text like
Excel, the default alignment is to the left. If I have a number, the default
alignment is to the right. If I have a Boolean
like true or false, it’s always going to be
capitalized in center. If I have an error, it’s
always going to be centered. And finally, there’s
an empty cell. Now this is important,
especially these two right here, because we are
dealing with data. And we’re going to import
data into Excel all the time. So here we simply want
to add Alt-equals. And I’m going to
redirect it so I don’t have an empty cell
right there and enter. No problem, the sum
function is programmed to understand
numbers and add them. But over here,
Alt-equals, it didn’t even see these as numbers. If I highlight them,
not knowing anything about the default alignment
in Excel, and hit Enter, I’m going to be baffled
that I see a zero here. This is a problem. Many times when we import data,
it will be imported as text. Now Excel really does
see this as text. And sometimes when
you import data, numbers will be stored as text. But watch this. If I come over here and on
the Home ribbon and I do left alignment, of course
that’s just alignment. And it’s going to have
nothing to do with the fact that those are still numbers. You don’t want to
left align numbers like that, because then we
lose our visual cue that they are, in fact, numbers. Now I’m going to Control-Z.
Visual cue very important. And we’ll learn next chapter how
to deal with bad data like this and convert it back
to proper numbers. Now why did the sum not add it? Because it’s programmed
to ignore text. Now I want to show you a
couple of other examples. Now here’s a data set. And I want to show you one of
the things that often plagues real world Excel spreadsheets. People love to do this. They love to go up to home
and alignment and center everything. I cannot tell you how many
spreadsheets out there when I’m doing consulting I see,
and everything’s centered. And they email, or you
go into their office and you try and track down
what’s causing the trouble. Here they’re
baffled, because they don’t understand why
the sum function isn’t seeing both of them. Well, if we had avoided
using alignment center– and I’m going to
Control-Z to undo that, Control-Z– we
would automatically have our visual cue. There it is, that’s
being considered text, and we’d have to fix it. So we want to avoid going
away from our default left for text, right for numbers. Now there are
exceptions, of course. Here’s an example of some
stocks in the industry. And now notice that we
have a letter there. So this letter means billion. So it’s sitting with a
number, so of course it’s going to be text. But here is an actual number. This is sort of confusing,
if you’re consuming it, looking at it as a report. So in this case, when
it’s the final product and you’re printing
it, then it sometimes makes sense to go up
and do your centering. But in general, if you can
avoid adding any extra alignment and keeping the
default alignment, things are made easier in Excel. Now I want to jump back
over to our PowerPoint. There is a summary slide there. I’m going to go to our next
slide, slide number nine. Now a proper data
set, 1, 2, 3, 4. Fields in the first
row, no empty cell. So you’ve got to have
field names or column headers– sometimes call
variables– in the first row. None of our features are going
to work like a Pivot table unless we have our field
names in the first row. Records or observations–
each one of these is a record or observation–
are going to be in rows. Empty cells or Excel
column row headers have to be all the way
around our data set. That’s particular to Excel. In other systems– we
don’t have to worry, because tables are kept
completely separate. But in Excel, we have
to make sure we either have the Excel row headers–
like 1, 2, 3, 4 and ABC– and empty cells all the way
around or our data analysis features, like Pivot table,
Sorting Filter, won’t work. And try not to have empty
cells in the data set. Again, if we have an empty
cell for a date or a number, when we go to use
a Pivot table, we get into some potential trouble. And we’ll see that later. Now in the working world, you
can’t always avoid empty cells in the data set. Now I want to go to the
next slide, number 10. It’s important that we
have the correction terms. These are called Fields. That is a databasing term that’s
been around for a long time. It’s inconsistently
used in Excel. In a Pivot table, it
will call these Fields. Some other features, it
will call them headers. But no problem, we’re
going to call them Fields. Another important term
is Record or Observation. In general these are called
Records in databasing. Our textbook calls
them Observations. Another important
term is Element. The definition of
an element, entities on which data are collected. Now our element in this data
set is transaction number. So for this transaction number,
we have three bits of raw data that we’re collecting. We need to know the date, the
sale, and the sales rep name. Transaction number
is the element. Now in databasing, this would
be called a primary key. We would call all of these
fields and our Pivot table– we’ll call these all
fields, for example. But a primary key is simply
a list of unique elements. Now if you think about
Highline College’s database, if you’re a student,
it has your student ID in a column that
says Student ID. Only one student
ID, and then they collect data on
that one student ID, so all the raw data gets
associated with you. Now in the textbook, they
call these variables. This first column is a
list of unique elements– in databasing,
called a primary key. Everything else is
called a variable. We’re pretty much always
going to call them fields. That’s the databasing term. Now, proper data set. Not all proper data sets for
us, as data analysts in Excel, are going to have a primary key
or a list of unique elements. Let’s go to slide number
12, that’s two slides ahead. Here’s a proper data
set with a primary key or list of unique elements. Boom, no problem, we
just saw that one. Let’s go to our next slide. This is a proper data
set with no primary key, no list of unique elements. This is the kind of data
we often get in Excel. And what are we going to do? Well, we can create a
unique list of elements by using a Pivot table. So we took this, dumped
it into the Pivot table, dragged product down to the row
area, and our sales over here. Instantly, we get a unique
list from our Product column. The Pivot table did that for us. So we’re going to have
proper data sets in Excel, as long as there’s Field Names
at the top, empty cells all the way around, records
in rows, and try not to have empty cells,
we’re good to go for analyzing our data in Excel. Now I want to go two slides
ahead to slide 15 and talk about variation. Now variation in our
prerequisite class, Business 210, we talked about all
sorts of cool calculations, like variation, standard
deviation, interquartile range. But variation is very important. It’s the difference
in the variable measured over observations. So these are
observations or records. It could be a
difference over time, like this data set right here. We are really interested in
our Yahoo adjusted close price, and how it’s changing over time. We’re interested
in the variation. Probably, if we’re
owning a stock, we would like it to go up
consistently over time, right? But oftentimes
that’s not the case. So we’ll learn later how
to measure this variation. We can also have
variation like differences between customers or products. So if I go back two slides,
let’s see, slide number 13, we might be interested
in the Quad boomerang. And for each one
of these records, there’s a different number here. So we might be interested
in the variation amongst our different products. Go back to slide number 15. The role of
descriptive statistics is often going to be to
collect past observed values for variables. Here’s a variable, or a field,
and we’re collecting past data. And then we’re going to analyze
that data to gain a better understanding of the
variation and its impact on a business
setting or situation. Another example of variation
in a business setting, one of our examples we’ll do
when we’re calculating standard deviation
in this chapter is we’ll look at two
different suppliers and how many days it takes them
to deliver our products to us. If one supplier has
much more variation, the other one has
much less variation. We’re probably going to want
the supplier with the smaller variation, because it’s going
to help us in our business situation to plan more easily. Now I want to go to our
next slide, slide number 16. Here’s something we
already are fluent in, the definition of
population and sample. Population is all the
elements, sample is a subset. And oftentimes, we’re going
to be dealing with samples. Next slide, 17. Terminology– quantitative
data, categorical data. We need to know these terms,
because these different types of data require that we use
different types of charts or numerical measures or
statistical techniques. Now quantitative data
just means number data. Categorical data just
means not number data. Quantitative could
be number of units. Categorical could
be product name. An example of a
numerical measure, where the two different
calculations we make are different is average. For quantitative data,
number, we use mean. Add them up, divide
by the count. For categorical data, when
we’re calculating an average, we use the mode, the one
that occurs more frequently. For charting
quantitative, often if we have continuous quantitative
data, the chart we would use is a histogram. That’s column chart
with no gap width. We’re asked for
categorical data. We need to visually
indicate that it’s categorical by using a column
chart with no gap width. Now within quantitative,
there’s discrete. That’s counting–
1, 2, 3– and then continuous quantitative data. That means there’s no
gaps between numbers, like weight, time, and money. And the actual number depends
on the measurement instrument. Now this is important,
because also we’ll have different calculations,
like in charting. For continuous
quantitative data, if we want to visually
portray frequency, we’re going to use a histogram. But wait a second. If we have discrete,
there’s actual gaps, so we will use a
column chart with gaps. So terminology, very important. Now last slide,
slide number 18, data terminology–
cross-sectional time series. Quite easy,
cross-sectional means we have several different
elements that we’re collecting various data points
on at the same or near the same point in time. So here we have various
stocks in the industry, on this particular day. Time series needs
no explanation. It means we’re collecting
data points over time. Here is Yahoo historical
price over a long time period. All right, that’s it for our
introduction to data and data sets and data terminology. Our next video, number
13, we will briefly look at sorting, filtering,
and Pivot tables for raw data. All right, we’ll
see you next video.

6 thoughts on “Basic Excel Business Analytics #12: Raw Data, Data, Proper Data Sets and Data Terminology

  1. Question about termonology for you.

    You call "raw data" the "data stored in its smallest form"
    I've called "raw data" the "data as it sits in the database"

    I bring this up because I often ask clients for "raw data" or "a data dump" that I then turn into the smallest form.

    The data might sit in the database with middle initials mixed with last names, preventing the ability to sort by last name. Or, regular zip codes and zip+4 in the same column.

    I'm dealing with a project now, and one column of data comes in like:
    ChefsChoice:Veggie
    ChefsChoice:Meat
    SoupFor4:Meat
    SoupFor2:Vegan
    Bread
    Salad
    SoupFor2:Veggie

    What I call "raw data" returns that list, which isn't immediately useful. But that's how it comes out of the database.
    Part of my role is to identify the colon as a delimiter and get everything in the smallest form, in 2 different columns.

    Do you make a distinction between the 2 types of data?

  2. Great video! Never to much talking about proper data sets!
    I am also using the term raw data when I consult.
    The raw data is always the most useful because we can get anything we want out of it 🙂
    Raw data rules!

  3. Mike, how do you clear data from the data model after it changes at the source. I have used several queries using the raw data and adde to the data model. Now the data has changed the data model does not want to remove old data after I have refreshed every possible way.

Leave a Reply

Your email address will not be published. Required fields are marked *