The What, Why and How of Web Analytics
Modern Web Podcast Transcript
Calvin French-Owen, Andy Elliott and I discuss web analytics. We talk about what business metrics to measure and how to select analytics vendors for businesses large and small. We also discuss the technical challenges of analytics instrumentation, such as measuring web apps vs. websites, automation and data schema design.
Show Notes
Panelists
Links
- Segment
- Google Analytics
- Mixpanel pushing for meaningful metrics beyond pageviews
- Adobe Analytics (Omniture)
- Amplitude
- Customer.io
- Amazon Redshift
- Google BigQuery
- Google Analytics autotrack - automated instrumentation
- Heap
- Coverage on “data lake” 1 2
The following is a transcript of the interview from the episode.
Ray Shan
Hi everyone, welcome to the Modern Web podcast. I’m your host Ray Shan. In this episode, Calvin French-Owen, CTO of Segment, and Andy Elliot, Business Analyst at Google, join us to discuss analytics. We talk about what business metrics to measure and how to select analytics vendors for businesses, large and small. We also discussed the technical challenges of analytics instrumentation such as measuring web apps versus websites, automation, and data schema design. Enjoy the show.
Ray
Welcome to the show, Calvin and Andy.
Calvin French-Owen
Thank you.
Ray
Calvin, would you like to introduce yourself please?
Calvin
Yeah absolutely. I’m Calvin French-Owen, CTO and cofounder of Segment. And Segment is kind of a universal api for all of your analytics and customer data. Basically you instrument some code, send data about what your users are doing to us once and then we can send it to any number of services or data warehouses for you all on the fly.
Ray
Cool. Andy would you like to introduce yourself?
Andy Elliott
Sure. So I’m a business analyst with Google. I work in the Play team on our prepaid product that supports customers buying digital contents in the Play store.
Ray
Awesome. Well, welcome. So we had this idea to invite you guys to come on to talk about analytics, um, because we, we think analytics is a subject that all web engineers have to deal with, uh, for, for pretty much every single project because from a business perspective, if you cannot measure, you cannot understand what the way your customers are doing. Um, but uh, in my previous experience as a both on the business side as well as, as a web engineer, analytics has always been a subject that’s been left to the very end of the project before ship. Um, and people just scramble and kind of do whatever to put it together by. I think it will be interesting to talk about analytics from both a business perspective and a technical perspective. Both what to measure, how to measure a and, and how to analyze the data. Um, so let’s start off by talking about what to measure measure. Uh, I think typically the web engineers will get some sort of requirement from, say, product managers or business people. Um, how, how does that typically work? And the, I think you have a lot of experience in this area,
Andy
So I think for a lot of business folks, the KPIs or the key metrics kind of drive your business goals and the actions that you take in the influence. A lot of the behavior that, uh, folks within the business, uh, make around their decisions on how to spend time or their money, um, so they’re, they’re pretty important in terms of identifying your major driver for your business and what you’re going to focus on a success, whether that’s revenue or customers. Um, and I’ve heard from a number of different folks that the success of businesses can be set up entirely by what they’re measuring and by the incentives that drive within the organization. So I think typically for most companies there is a revenue goal, um, and how do you measure revenue? Um, there is a customer goal that’s kind of linked to that revenue goal because customers are driving the top line revenue numbers and people are looking for ways to influence that revenue number and the way that they choose to do that is based on, uh, their marketing initiatives. So they’re looking to figure out if they spend money in certain places, what kind of benefit does that have in terms of revenue? So when they’re setting up metrics, a lot of the times there’ll be a total top line revenue number and they’ll try and disentangle that from the broader organization to look at their group’s contribution and then they’ll try and tie that in some ways to a customer who can be linked to a marketing initiative, uh, so that they have some understanding of their investment and how that influences the top line revenue.
Ray
So how would we translate a revenue goal or a or a customer go into web analytics? So for a lot of web related properties such as Amazon or the Google play store and the customers will go to the website and perform, have a number of activities, how will we go about translating these into the goals that the business would look for?
Andy
So at least from my experience, uh, it’s really been in kind of looking at aggregate numbers to understand various segments. So if you can pull out, you know, typically a business look at its new customers and existing customers and then it will try and segment those potentially by the acquisition channel or certain characteristics that the customer share, um, to identify groups that are of interest that can be responsive to certain initiatives or that may be interested in certain product enhancements or new releases. So once you get sort of the aggregate number, then people are interested in a segmentation on top of that that they can then use to look at the business through.
Ray
Right? Um, I think um, so in order to identify the customer segment, I think there will be some sort of profiling of, of the customers, right? So based on where their customers are coming from and what, what websites they visited, there are some ways to, to generalize about like what kind of a of cohort this customer to.
Andy
Yeah, that’s very true. I think it’s, it’s interesting because the top line metrics are usually pretty easy to identify. Like people can tell where the sales are coming from and they can usually tell where they’re strongest sort of channel is as far as revenue, but the customer piece of it can be a little trickier because you have to track customers across kind of multiple properties. They can do things that show up in interesting ways and your data so they can show up multiple times. They can be sort of very productive for your business and one period of time and less productive and others they can change between sort of products that you offer, so there’s a lot of complexity when you start looking at things from a customer perspective that gets incorporated into both the way that you have to track the data and work with it as from from an analytics perspective as well as a lot of complexity in terms of how you present and explain that to kind of your key stakeholders and what your recommendations are around how they work with kind of those customer segments.
Ray
I think another important thing to to track will be the customer behaviors, like maybe the more granular behaviors. So when the customer. Our rise on your site before they make a purchase or before they basically achieve your, your business, a key performance metric, they would, they will act in a certain way, what they will perform a certain number of actions. I think that’s another thing that’s really important, uh, which, which could be quite challenging their track because sometimes it’s not clear on where it’s happening, uh, or, or, um, which part of APP that you be measuring it have you had a lot of experience dealing with these more granular things to measure.
Andy
So less so in my current role, but in prior roles, I think a lot of the ways that this is viewed as, as a funnel. So you’ll say, you know, here’s my funnel and I’m trying to make sure that consumers drive towards a certain goal and you’ll set that up as an objective and then you’ll be trying to look at where people are falling out in terms of the process of moving towards that goal through your kind of web or app experience. Um, so I think that’s a fairly common approach, but again, that tends to look at things in aggregate and it gets more difficult when you’ve got certain segments that you want to choose a track and I think for a lot of web based businesses, the customers that bring the most value, maybe a, a small kind of segment of the total population, so you’re really trying to track not the majority of your users, but a very kind of smaller, concentrated, smaller segment that is deeply engaged with your product or that, uh, is, is one of the major kind of top line revenue drivers.
Ray
Right? So that comes down to maybe still tracking everybody and doing everything, but then having enough data to be able to filter down to the segment that you’re there, you’re looking for. Yeah. Cool. Well Calvin, so I think you guys mainly deal with a developer audience, right? But do you also maybe deal with business people will potentially talk to your customers about their business requirements and business needs?
Calvin
Absolutely. On our end anyway, the two most common questions that people ask us are, one, what should I be tracking as a business and how do I make sense of my data and then to what tools should I be using to really get the most out of using it? And honestly, I think the reason that it’s so hard for people to make sense of those things is that there’s not really a one size fits all solution to those. Uh, for instance, we’re a B to b business. So we have you everyone basically in terms of the companies that are using us. So we say, oh, HotelTonight is different than Bonobos is different than Warby Parker et cetera. Um, whereas a consumer APP has a completely different business model, right? For them they have thousands or potentially millions of users who are maybe purchasing one off items from an online store or buying subscriptions to a particular service.
Calvin
So for them they have to have a much different set of metrics and a much different set of goals and ways to analyze that data, which typically in both cases will boil down to revenue. But kind of the ways that they get that as very different. So on our end, what we’ll do is we’ll measure kind of the top line things that Andy referred to earlier, like people adding their credit cards or signing up for a plan or upgrading their plan as well as things like signups. But then we’ll also track kind of a long tail of user actions like people loading particular pages or visiting our help documentation or even emailing support. And for us, we try and take some of that expertise that we’ve seen across all of these difference three or 4,000 businesses that are using segments. And we try and say, well, given that you’re either a consumer APP or a marketplace or B to b APP or an enterprise app, here’s what we think are kind of the most important things that we’ve seen. And by filling us in about your business, we can give you better ideas about how to track those. That information.
Ray
That’s pretty interesting because, um,
Calvin
uh, I, I would think a lot of people would just simply think of you guys as a, uh, basically a tool vendor, right? But, uh, so you guys also have to like some sort of consulting service and help help guide your customers towards a certain direction. Yeah. I guess in our case, we don’t view it as much as a consulting service as more just helping people get on board and be successful because that’s ultimately what we want, right? We’re hoping that segment is hoping helping grow your business and as a result we’re growing with you. Um, and so in our case, kind of what we’re mainly focused at is figuring out, okay, for these people, how can we get them initially in and thinking of us as a data router, but over a time, sending more and more of that data into segment. So it becomes this really powerful place where eventually just all of your customer data lives and if you say, Hey, I now want to put all that data into amplitude because I want to analyze it and these new ways you can just immediately funnel it in and replay it there and we’ll take care of the piping it for you.
Ray
Right. I think you also brought up a really interesting point, which is, um, there are many different types of businesses, um, and also many different types of properties that people could be tracking. I think one of the trends that’s been happening over the past five or so years is a lot of websites now this are not really websites anymore. They’re really web based applications. So if you think of g mail is, it’s a single page app, a, as they call it, uh, and it’s, it’s really not a website per se. Right? Um, I think in this area, mixed panel, I believe is one of the pioneers of thinking about instrumenting quote unquote websites in a different manner, but basically more like instrumenting, like, like an application, like an Ios or android app. Um, so when Andy and I, when we were at shutterfly, I think, um, there was probably a mixture of both, I think certain parts of the property, we thought of it more as a website with the different pages, but in certain other parts of the property we will think of them as apps. I know Andy you worked on, on the full photo books sections. Uh, I think for, for the photo book editor, it’s definitely a small application that runs on the website. So I think in that, uh, in that, uh, area, people think of it more as an application. Uh, so we think more in terms of like the, what are customers doing, like will actions that are performing, um, uh, with, with that part of the website.
Calvin
Yeah. That was, uh, that was definitely true. That was where I was referring to sort of the funnel view. We used to look at things in sort of omniture or another
Andy
kind of web tool that would help us get a sense of like where customers were going and where they were having difficulty with the site or how we could try and improve people’s experience to get them from the start to the end of the process. So, um, that I think is always a goal. I think the challenge becomes that in all of these kinds of experiences, you rarely have like an absolute where it’s everyone’s running into this issue and it becomes kind of very obvious. It’s certain groups are challenged by different parts of the site I’m at different times. And so, um, that makes the analysis more challenging to pull out a segment and say that this group of users a is either benefiting from or potentially having challenges with this particular part. So we should try and take some action with regards to it. I think it’s, I think it’s very much like a marketing initiatives when you’re trying to send say email. Um, I mean, one of the great things about emails is very cheap to send, so people just include their entire customer kind of database and every email, um, but clearly if you’re a more targeted you can
Andy
kind of tailor offers to certain segments and I think you see a sort of an evolution in the way that people treat email lists with the fact that they’ll start to get more selective around how they send out those offers. They’ll try and encourage new buyers with a different sorts of offers then existing users where perhaps they’re trying to broaden those users engagement with multiple products or trying different services. So I think, at least from my experience, um, the real challenge has been it starts with sort of this top line aggregation and you try and look for a story there. And I think the business and the business users typically come up with a narrative as to why things look the way they are and what should be done to improve them. But once you’ve kind of made those first large steps or improvements which are likely incremental, then it becomes more difficult to know like what do you do next and how do you kind of make meaningful a meaningful impact cross smaller segments as opposed to the entire audience.
Ray
Right. And I think you brought up another interesting point there, which is I’m a. people typically would definitely want to track what’s happening on the property itself, but even before people reaching the property, whether it’s a website or web app, um, it, it’s still interesting to know like where they came from. So email would be one example where you’d be entering through some sort of url and there will be some sort of like parameter like campaign parameter that you can, you can track. Um, so, uh, so you can think more in terms of marketing campaigns instead of just random people showing them about the site. I’m Calvin, have you, um, worked a lot with helping people to figure out where their traffic is coming from.
Calvin
Yeah, we definitely have to some extent. And in particular, we’ve also tried to tailor our libraries and our API to exactly the problem that you’re describing where a lot of these tools support the idea only a page views and mixed panel sort of pioneered this idea of, oh, we actually want events for things like a single page app. Um, segment itself is kind of a hybrid of those things. We’re not a single page APP, but within the different pages on the website, we actually have different events that happen. So we end up being able to make multiple calls through our api to either say, hey, here’s a page or this is an event that happened separate of a page. Um, I think that helps get around some of the discrepancy that you’re getting at where old style analytics tools only allow for page views, but that’s not really the granular information that you want.
Calvin
Um, when it comes to actually figuring out where your users are coming from. We kind of do the sort of regular, a regular techniques that most tools on the market support right now, whether it’s via url parameters like utm campaigns, uh, refers. So that you can at least get a high level view of, okay, this was the immediately a directing or the site immediately proceeding, the one that I’m on. Um, we also support, uh, some of the IDFA tags on Ios and mobile and just to get a sense for what exactly the user was coming from or whether they are sent via a deep deep link or something like that. Um, but I do think it’s kind of the or it’s the biggest problem that everyone is trying to solve right now, at least from what we’ve seen among our customers in particularly they want to know, okay, it depending on where I’m spending my marketing budget, which I have unlimited amounts, I want to try and be getting the most users coming into my app, the most users coming into my website, uh, and the most profitable users.
Calvin
And so for them it’s honestly really hard to figure it out. If someone saw an ad on their mobile device and then now logged into their site elsewhere or say for at last 10 who’s one of our customers, they own hip chat.com. Jira.com, confluent.com, they have this whole suite of different sites. For them it’s really hard to figure out, okay, for a user who came to hip chat where they get their own set of cookies which are uniquely identified by the domain, can they now also go to Jira and we’ll be able to tie that user together. Uh, so often that’s where we recommend sending the data server side tools and kind of tying together on the email or if you have an analyst having them get kind of the canonical user id and keeping that all on a single database. But it’s definitely a really hard problem today and it’s one that we’re trying to make steps forward to solve in ways that don’t compromise private privacy of the user.
Ray
Right. So we talked about a few vendors already. So of course segment is a vendor. We’re talking about mixpanel. I’m a big fan of Google and use it for years. Thanks to Google for making it free for everybody. Well mostly free for everybody. I think I started a fly. We used omniture, uh, and their various other tools. Um, what, what does Google use internally and the, does google internally actually just use google analytics or does it have some sort of preparatory analytic stack?
Calvin
Especially careful commenting on Google’s a sure tools I think, I think what I would say is that, uh, you know, a lot of what you see Google represent the marketplace is used by, is used by google directly.
Ray
I see. Yeah. I, um, that, that, that seemed to be what I’ve heard from various people is a, google has a very, very nice proprietary stack that does pretty much everything. So then you can just pretty much stay in the ecosystem. But I showed a fly, um, uh, we didn’t actually work on the vendor selection. I think, um, and correct me if I’m wrong, but when we got there we were already using omniture and for some of the self properties, I’m like tiny prints. Um, when we acquired them they were using google analytics. Um, I, I think, um, the way that we thought about vendor selection, I think a big part of it is regarding how much access to data we can actually get a. So there’ll be one downside of using google analytics, at least a free version is you pretty much using it as a, a combination of a instrumentation tooling and a data store and analysis and reporting tools, which is really nice because everything comes in one box. But at shutterfly, because we had also had a lot of proprietary data pipelines, tooling analysis, um, things, uh, we actually wanted to get the data in house. Uh, so we, we used various other vendors, uh, that allowed us to basically set up an etl pipeline to get the data in house so we can do a failure analysis. Um, Calvin, how, how do your customers typically think about vendor selection? Like do people typically pick one or two? Do people pick multiples and you and use you guys to basically spread the data around?
Calvin
Yeah. It’s actually kind of funny that you ask because five years ago when we were first starting segment, the whole impetus for the project was the fact that we were, were basically three college students straight out of college. And we were looking around at these different tools as we were starting to build our startup and we saw that there was google analytics out there, there’s kissmetrics, there’s mixpanel and honestly we couldn’t tell the difference between the three of them and we weren’t sure which one we should be using. And so I think we spent an hour or checking out then the different parts of them and poking around. And finally we just started, the lazy engineer is way out of it where we said, okay, we’re going to write a layer of abstraction and then send the same data to all three of them and we’ll figure out later which one we actually want to use.
Calvin
Um, and that’s kind of consistently been the way that the analytics ecosystem has actually started to going now a particularly with our customers who are using our product because now they can say, okay, instead of buying this massive suite like omniture or a local, uh, uh, or responses are kind of these sort of old archaic tools which require multimillion dollar contracts a year. And I have to buy into everything instead. I can now say, well, if I’m looking for the best in class analytics tool, I can go with a mixed panel and amplitude. If I’m looking for the best email tool, I can just send emails through customer Io. If I’m looking for the raw data, I can dump all that data in s three or be a red redshift and I think our most sophisticated or sophisticated customers understand that if they really want to get these best in class tools, uh, they want to use things that are a lot more focused when it comes to the job that the tool is supposed to do.
Calvin
Uh, they’re sort of Unix like in terms of philosophy, but the catch has always been getting that data into those tools were the big suites, have this advantage that all your data is already existing there. And so at segment is hopefully trying to do is basically allow you to get your data into any of these tools on your terms and use it however you want. Um, and we’ve seen a lot of that in the open source ecosystem as well. Obviously we have our analytics js library, which you can use free of charge. It just gives a single API from the client over these sorts of things. There’s one for Ruby called analytical. There’s one for ios called ar analytics. Basically, no matter what platform you’re using, it’s gotten easier and easier to switch where data is going and we find our customers taking advantage of that were the most sophisticated ones are basically setting up accustomed pipeline that’s maybe sending to 20, 30 different tools all through different avenues. And then different people on those teams are using it. So salespeople are using salesforce support, people are using zen desk,
Andy
customer analytics, people are just using mixed panel and then kind of the deeper analysts are using redshift.
Ray
Right? And how do the business analysts typically think about the tooling and the vendor selections? A, I think a lot of the ones that we mentioned are mainly for instrumentation, which might be more of a engineering requirement. Um, but of course to the two chainz quite long, it’s not just the, uh, the mixpanel was, but it’s also like the various reporting tools. So we’re kind of like business requirements would, what people typically think about when they think about their selection.
Andy
So, uh, remembering our days together, uh, I would say that probably most things start out an excel spreadsheet of some kind or the equivalent in a google doc. Um, and you know, from an analyst perspective, there are certain layers that you ended up going through in order to kind of get to those business insights. And I think the biggest impact that an analyst can have on a business is to help change the perception of what’s driving the business or bring attention to an area that people hadn’t thought about before. Um, in terms of the dynamics of, of what actually makes a meaningful difference in some of those key metrics we talked about earlier. So in order to do that, you have to start with the data. Typically, you know, an analyst will have to spend time understanding what’s there, what’s incomplete, uh, what else does he need?
Andy
Or he or she need to put together, answer the question that they’re focused on. Um, then it’s taking that data and turning it into the dimensions and metrics that you need. From there you can kind of build a dashboard, a, which in most cases starts out as some sort of spreadsheet version and then after you have kind of that dashboard level, you can start to get into the analysis a dig deeper into it. I’m kind of interrogate the data further and hopefully come up with some insights and recommendations, but I think each of those levels, at least from an analyst’s perspective, takes an incremental incrementally larger amount of effort. Um, and so it really depends on the success of an analyst really starts with the data. So within sort of the companies that I’ve been in, there’s often a lot of resources that are put into helping to put that data into a usable format, uh, so that can be like bring it together in a data warehouse or a making sure that you can bring together different pieces of customer behavior.
Andy
So, so those are the things that I guess are most important from an analyst perspective and I, and in terms of becoming more efficient and proficient, the tools, I think as Calvin mentioned, there obviously have been a lot of changes and some of the tools that his customers are working with are likely choices that they made, you know, earlier on when those were best in class. And then once you’ve made that decision, uh, in some sense you’re tied to that ecosystem both from a technology commitment as well as from like a stakeholder like investment in using the interfaces, understanding how to extract the data and how to manipulate it in the tools. So making it easier to switch between things I think is, is very powerful in terms of giving people the freedom to use kind of the best a product, whether it’s to send an email or to do analysis.
Andy
I think, um, the trends that I’ve seen, at least in my career and in analytics, which is really over the last five years has been that, uh, people are more focused on automating analysis and that’s either through like more advanced kind of a dashboarding tools like tablo or, um, other, other tools that we use within google. Uh, I think it’s becoming more programmatic both in terms of fetching the data and then turning that into an analysis. I’m having that be repeatable. So taking what used to be several sql queries that are dumped into a spreadsheet and then pivoted to look at the final output and having that run through a pipeline that starts from extracting the data, doing your transformations on it and then presenting the output.
Calvin
Yeah. And then just add to what Andy said, one of the things that I’ve been most surprised that as we’ve grown and matured our customer base over time is just how much gravity data has. We have a couple of customers who are publicly traded companies and have 15 year old omniture installations. And for them, they’re doing all of their public reporting based upon that data. It’s basically their source of truth for how well the company is doing and that’s what they report to shareholders. And so removing that or tweaking it or swapping out is something that’s very, very sensitive for them to do. Um, so more often than not, they end up keeping whatever they have, but adding tools in as they need them.
Ray
Yeah, that makes sense. I think a lot of what we’re saying echoes and message that, uh, for, for large companies, having access to the raw data is very important. Um, it’s important for the business. It’s also important for the analysts to be able to build custom reporting. Um, how about for smaller companies like smaller startups? So the company that I’m at, we’re about 20 people. Uh, I think we definitely do think about the tooling, the data a little differently and the way when you worked on your, your startup and when you were at a smaller environment, a, did you guys think about analytics and tooling a little differently than how you think about it at a much larger company?
Andy
I think definitely, I think at a larger company you’re more focused on, you know, you have an established business, you understand, um, sort of as, as Steve Blank would say, how to make it a repeatable kind of ongoing enterprise. When you’re in a startup, you’re, it’s very clear I think whether you’re succeeding or not based on are you bringing in more revenue than you’re burning a or do you see a path to that? So you also need in a startup, I think less of some of the nuance that data provides that scale. So in a very large business like, like Google, if your data can help you grow a few percent more each quarter or each year, that has a profound impact on your top line revenue. Whereas in a startup where your, your revenue is much more, a much smaller, that percentage impact isn’t going to be significant enough. What’s going to make a bigger difference is, you know, acquiring a new customer. I guess maybe where there’s an exception to that is if you’re a web based business and you have to think about things like a customer acquisition and you know, if you’ve got a budget that you need to spend, how can you use that most efficiently? So I think that’s probably where a startup would benefit the most from thinking about how to use their data,
Ray
right? Um, for, for us, I think a lot of what we think about is ease of use. I think probably for two reasons. One is just people want to keep it simple. A, everyone has, has a lot of other things to do and tool is just cost. Um, because for a much smaller company there isn’t a team of dedicated analysts, um, uh, or, or even an engineer team dedicated to do instrumentation, which is why we use segment calvin. What do your smaller customers typically ask for?
Calvin
Yeah. Typically most of them will start out basically with just the bare minimum of Google analytics and they’ll send data through to segment, pass it through to Google analytics and they’ll get enough reporting to kind of satisfy their needs. And then over time as they grow, they’ll typically start layering and tools that are maybe a little more specialized, so adding kissmetrics or amplitude for analysis and a customer io or Vero for sending emails, uh, et, etc. And we found a kind of empirically that around the time they hit maybe 25, 30 people, there is questions that none of those tools do a great job answering because they’re so specific. And so specifically tuned to the business and for that, we’ve seen a real increase in adoption of redshift, which is amazon’s hosted data warehouse. We’re basically for under a thousand dollars a year, you can get something that scales two terabytes, a data nicely.
Calvin
Uh, we’ve seen more of them using big query, which is google’s hosted solution, uh, which also allows you to do a very similar thing, but it’s even cheaper because you only pay for what you query. And typically they’ll take those kind of raw tools where they can ask any question of any of the data that they’ve collected and the layer on a visualization layer like looker or mode or tablo. Uh, basically any of these tools for kind of generating and sharing these reports and then they’ll pass those around and honestly that’s the stage that we’re in right now. Um, along with maybe some more slightly mad scientist approach is where we’re also sending some parts of the data into zapier and then connecting it to some custom functions.
Ray
Right. let’s talk a little bit about the actual implementation of the instrumentation. So after having worked with various different instrumentation libraries like google analytics or segment, a lot of it ended up looking very, very similar. Um, which is, I, I’m, I’m sure the idea of building an abstraction layer to route data to different services has been tried. I’m sure many, many different businesses that you guys covered them all. Uh, uh, but, um, what were some of the challenges people typically run into with analytics instrumentation?
Calvin
I’d say by far the biggest one comes down to the schema and how the data is represented. Uh, again, tell you how many times customers have said, oh, we were sending this one field which was really important to monitoring our revenue or our signups or a customer engagement. And then we had some sort of code push that just erased it or mess up the metric or stopped at sending somehow. And it’s a real problem because if you’re, if you’re not collecting the data at the edge layer when they user is first loading that page or performing that action, then at that point you’ve just lost it forever and there’s really nothing that you can do. So I’d say the ability to lock your schema to report when there are anomalies, when suddenly you got a big spike of events, those are kind of the biggest issues with instrumentation that are going through as far as actually installing the code.
Calvin
Uh, most people end up being able to do that fairly easily and in a fairly straightforward manner. Like what you said, I think maybe 95 percent of companies out there that are above 20 or 30 people, they’ve had some sort of half baked implementation of segment internally if they’re not using us already a where they basically kind of multiplex all of this data depending on where it needs to go. So I’d say those are kind of the main ones. The last issue that people run into often is knowing where it makes sense to collect the data from because when when you’re sending data into these tools, you can choose, oh, descended directly from the page where there’s java scripts, you can send it from your back end if you’re sending via a ruby or go or python code anywhere on the back end or you can kind of cue it up and then send it from some sort of delayed background job or process.
Calvin
What ends up being right ultimately depends a lot on the business as well. There’s kind of no silver bullet there either. For the really early businesses. It just makes sense to drop some javascript on the page while you said it’s super easy. Uh, they can track all the events there and the fidelity, it doesn’t matter as much if people are blocking java script or using an ad blocker or the request just fail before they get a two segments for things typically, which you need to know are absolutely right and you need to have very high fidelity of data. We’d recommend running in your server side processes. So those might be payments data, which is only running after the page has submitted signups, logins, those sorts of things.
Ray
Right. I do agree that the actual instrumentation isn’t that technically challenging if anything is, it just ended up being quite a bit of busy work because when that instrument instrument than myself, I just basically have a giant list of things I need to track and then I look sort of code and figure out where those events are happening, where the conversion is happening and that I just add it. So it’s, it’s just quite a bit of busy work, um, which, which is a pretty interesting because you will think that if it’s a very productive, it can be automated. Uh, and there are services who are trying to do something like this. So google analytics just came out with something called, although track, which I have a look too deeply into it, but my assumption would be that it just looks at what events are capping it automatically sends them. Um, so perhaps if the naming convention is good, that you just auto magically have these events show up on your dashboard. Have you, have you, um, have you guys thought about, although tracking where have you guys work with vendors who do this kind of all are tracking?
Calvin
We have a, so the most popular service that is basically staked their claim when it comes to auto tracking is a, this analytics tool called heap a age gap and they’re all philosophy is you should track everything. It should be tracked automatically and then it’s up to you to take bits and pieces of that and say, oh, this click on this button. This actually corresponds to my signup event. Uh, but I can add that after the fact waiter, because he was already collecting every single button click and then, oh, I have to do is figure out and heap which button corresponds to sign up. And we actually have seen really good adoption of their tool, uh, is definitely one of the most actively growing integrations that segment supports. And people really seem to love it for that reason that they no longer have to instrument as many things.
Calvin
We’ve had a lot of internal debates about whether this is a good feature to support. Um, recently we had a John Hopkins who just started as the director of growth at time inc come into segments and talk a little bit about how he viewed tracking and philosophy for a company that’s a several orders of magnitude bigger than we are and host a lot more traffic. And the thing that he said that I thought was particularly interesting is that he said he tried this idea of a data lake that’s now becoming popular where for a long time people thought, okay, you have to very carefully organize your data and put it into a warehouse and make sure it fits a certain schema. The data lake goes against that idea and it just says, hey, dump a bunch of data into this format and then we don’t care exactly what it looks like.
Calvin
Just put it there so you have the raw thing and that at query time will actually put it into her correct schema, clean it, filter it, and we’ll give you something that looks good. But because you’re constructing it from the raw data, you can always get back everything you need and his take on it was basically that the data lake just doesn’t work. Uh, it was like, no, like we basically just end up with a bunch of garbage. Uh, it takes a long time for anyone to make heads or tails of it, no one sure what they want. And so we ended up with 95 percent useless stuff that just adds to cognitive overhead had while we’re trying to get at that real five percent of good data. so for the most part, that’s where we’ve kind of been aligned as well, where we say, hey, the only way that you’re going to get really good data in here is if you actually tie it to the code because the code is representing what’s actually happening.
Calvin
It’s under the surface actually performing those api calls. Maybe you’re tying to ui elements. Chances are those could change if you’re ab testing or modifying the way that your page looks, such that even though the functionality, it doesn’t change the ui does. That said, uh, I know he has done this mixed panel has also added kind of a visual tracker, so you can say for your ios app, you don’t need to add any code. Google analytics has started going this route. Uh, And I do think it’s really powerful, especially for people who, uh, are not developers who are just pms or marketers or analysts and want just to get their data into a tool. I think for them it really shines. So it is definitely a feature that we talked about adding to lower the barrier of entry.
Ray
Right? I think that was a good point about quote unquote data lake being perhaps potentially, um, uh, having too low of a signal to noise ratio. Uh, we, we do something similar. We, we try to basically dump all of our data into redshift, uh, in the hopes that one day we’ll get round to analyzing it, but at the end of the day, I’m going back to what I was saying, we’re just trying to keep it simple. You basically look at the main things that google analytics can report, um, because we don’t have a dedicated team to do something like this. Of course, if we can grow to shutterfly size, then we can afford to analyze all the data, uh, but that is just sort of a dream for, for the eventual future. And let’s also talk a little bit about the data schema design because that, that definitely impacts hoW data go into the backend data store. And from, from a business analyst perspective, where are some of the gotchas that you have to keep in mind when you think about data schema for analytics data?
Andy
Uh, well, I think some of the, some of the things that you learn as sort of a novice analyst is, uh, the, the challenges of doing joins across tables. So you can definitely, if you have the wrong schema, right, and you try and do a join, you can end up with unusual results that show that your users are spending hundreds of times more than they actually are. A, you can also kind of. Yeah, I think that’s where you learned the lesson of sort of validation and triangulation. So I think every good analysts probably checks, uh, their results against another thing or at least tries to thank like does this ratio of revenue to customers make sense? Tries to do some sort of a derived metric to just double check their results. So I think that’s probably the biggest thing. I think the other thing is that if you have a, you know, a very raw schema that shows, uh, that the individual events, for example, that that happened, you can spend a lot of time sort of with each analysis trying to roll those up and have less time for, for analysis.
Andy
So some of the, when, when I do some interviews, one of the challenges that I like to give people is a question around I have these tables, how can I get this particular view? And really the, the challenge there is that you’ve got to create another table from the tables that you’re given to help you bring the data together in a way that allows you to avoid. Some of you know, the problems of either missing data or I’m kind of these joint properties and I think that’s probably a sign that someone has been in the trenches and sort of recognizes some of the challenges of raw data. I’m one of probably the easiest examples is we get a lot of event data, right, which will show this customer did this at this time and then, you know, a few seconds later or minutes later they’ll do something else. And what you’re often looking for is like the last thing that the customer did or the state that they were in, um, at that last time period.
Andy
And so you can either write sequel to try and take like the maximum event or the event at the maximum time or you can kind of create a view on top of that raw data that gives that to you so you don’t have to perform that step every time and the analysis. So I think those are the things that I try and keep in mind, uh, when I’m, when I’m doing analysis. But, um, you know, as was mentioned, it’s always helpful to be able to go back and look at the raw data because sometimes you’ll find that things haven’t come through in the right way, that some pipeline has gotten broken and you’re missing a portion of data and at these higher level tables, it’s difficult to know, um, what exactly caused that. But when you go down to the detail, you can start to see these holes more easily in the data.
Ray
Right? Definitely. Um, I think another thing to think about, uh, when you’re creating these like summarized tables who are derived tables is performance. Uh, and it’s important for reporting tools because we just went, went through something similar at work where, uh, we were literally doing some customer reporting on top of their raw data, uh, and it just simply too large a, it, it becomes both a time constraint and a cost constraint. Um, uh, because, uh, a lot of the data stores now charge you for egress of data. So the simple solution is basically create a summary summary table where you put, pull data from. So everything becomes a lot faster.
Andy
Yeah, I guess the other thing I would say is that, you know, as you work with multiple analysts as well is very helpful to have some summary tables to go off of. Right? So that people are starting from an agreed upon like place as opposed to all the way down to raw level and then you wind up with like different revenue numbers because certain exceptions or being treated differently.
Ray
Right. Kevin, what are your thoughts on data schema design for analytics?
Calvin
And this was something that we, we thought a lot about in the initial version of segments. We basically had, I think it was two different types of calls where we’d say one is just track which tracks and events about your user. So you always pass us a user id in event name that you want to track and then some properties related to that track. And then the second, uh, we called identify as saying, hey, I want to identify this user or tag them with specific what we call traits, which are just things like email, first name, last name account, anything that you kind of want to paint the user with. And we expanded that set of calls over time to include things like the page or the screen that they’re looking at, uh, as well as the group that they’re a part of, if they’re part of a business or an account or something like that.
Calvin
And I think having that fairly limited set of apis that’s also specific when it comes to tying everything back to a user was actually really, really useful in the early days because it was simple enough for people to understand and create a good conceptual model. But it was comprehensive enough in terms of kind of allowing these different types of data. More recently though, uh, one of my cofounders and I have been working on kind of redux of this api that’s a little bit more general because there are certain parts of the day that we can’t really model well these are things like zen desk tickets. If someone sends in an email then do you create an event saying, hey, they sent an email, do you create like a tag on that user is saying they have a ticket open and then you update it later. Or for instance, say they buy a product, do you like ben, embed that product?
Calvin
Object on the user. If they’re like a lead in salesforce, do you say, oh, this is not exactly the user, but it’s related to the user and kind of a weird way. And after thinking about it for a long time, we kind of realized that there’s, there’s sort of two main types of data that analytics analysts seem to care about it. There’s event data where it happens once it’s atomic, just like andy was saying, it basically ties some action to the user. They viewed a page, they bought an item, and then what we found that we’re missing is this idea of what we call object data, which instead of being this kind of change log of items that are happening is just the full representation of kind of the current state of the world. Uh, so you’d say, hey, this object is just a zen desk ticket and right now it’s open or closed and you could create events around it for being created, it being answered, it being closed, et cetera.
Calvin
But it’s the kind of thing where it’s actually very useful to have a state full representation as well as a change log related to it. So I think those are kind of the two types of tables that we, we based most of our analysis off of and now we’re realizing that we have to change our api to fit with those as well. We want to have the staple representation which just contains, oh, okay, here’s all the objects. And then the event log which contains all of the changes that happened as well. So yeah, I think if you have both of those things and you a nice way of tying them together where basically everything is just related via good foreign keys and you’ve written that down, uh, then it’s the kind of thing where it’s actually fairly easy to analyze.
Ray
Right. That’s pretty interesting. I think, um, a lot of your decisions are driven by the fact that you guys cover a lot of different use cases. Um, uh, so I, I think it all goes back to what we were talking about earlier, which is a lot of the decision points are really case by case depending on your type of business, that kind of data that you’re collecting. Um, yeah, uh, there’s definitely a lot to think about.
Calvin
Yeah. Which is definitely why it’s hard because we’re trying to just appeal to everyone, right? Right. So how do we make it both easy to use and understand for a given business, but then also fairly general.
Ray
Right. Cool. Well, um, we’re actually running a little short on time, um, but we, we talked a lot about uh, analytics today, uh, the various different challenges and we talked about what to measure. We talked about how to measure, we talked about the different properties and the different customer types that we’ll be measuring for. We talked about vendor selection, we talked about some of the trends in automated tracking. We also talked about data schema design and we also talked about what the business, uh, and uh, the business partners look for when they think about analytics, instrumentation. Um, so that, that, that makes it pretty fully loaded episode. Um, so let’s wrap up here. I’m calvin. How would people find you on the web if, if they want to follow where you’re working on?
Calvin
Yeah, you can follow me on twitter and it’s probably best at calvin fo, uh, or you can go to my personal blog, which I ran on occasionally called c a, l, d dot inf. Uh, it’s calvin pho. Yeah. Just so the.in the middle.
Ray
Cool. And Andy, how will people find you on the web?
Andy
Oh, probably like a find any good business first if you went to linkedin. Yeah. Cool. Awesome.
Ray
Well, thanks everyone for joining us. If you enjoyed this episode, please rate us on itunes. It really helps. You can also find modern web podcast episodes and meetups on twitter @modernweb_ and on the web at modern-web.org. We’ll see you soon. In the next episode.
I'd love to hear what you think about this essay. Your feedback makes my work better. You can chat with me on Twitter and Hacker News .