April 1, 2020
From startups to big-business: Using functional programming techniques to transform line of

From startups to big-business: Using functional programming techniques to transform line of


It’s my great pleasure to introduce scott havens. Scott works at jet.Com and he’s — One of the things i like is we are starting to bring in People from the industry, it’s not just microsoft talking to You about our technology which we love but you have industry Experts who have been taking this technology to solve real World problems so they can tell doesn’t. Scott works at jet.Com and walmart labs. He’s focused on a Bunch of the data systems. Interesting note, jet.Com is one Of the largest if not the largest employer of functional programmers. If you have anybody — if you or your friends want to spend your Life coding f sharp this is the guy you want to talk to. Walmart in 2016 took a look at this digital transformation and Decided that the best way that if they were not on that bus They would be run over by that bus. They concluded that the best way For them, the best and fastest way for them to get on that bus Was to acquire jet.Com. So now scott and his team are Taking the techniques they brought to jet.Com and applying It to the world’s largest retailer. So let’s give a big welcome to Scott havens.>> [Applause]>> Thank you for that introduction. Jet.Com is a mass market company And we’ve been dealing with scales from our earliest days. In 2016 we were purchased by walmart. After purchase for a few months We had started some early integrations of systems, make Sure that our catalogs can kind of talk to each Other but never any advanced full fledged Integration. There’s politics at large companies, especially When two companies are being merged together. One of the reasons that walmart Bought jet is because our tech looked cool, looked transformative. Not everyone goes into that scenario being convinced that our techniques are going to be Providing real benefits. Well, it wasn’t long before we Were fortunate enough to get the chance to demonstrate these benefits. By fortunate i mean disaster struck. In the middle of march last year a little bit after 3:30 a.M. I Got paged. I woke up, hopped on our bridge For a pager duty, in case anyone wanted to talk about the Problem, and started looking into it. Almost immediately i was joined By coworkers from several other teams. It turned out our production Pocket cluster was down. Cofca for us is the primary Method of communication among all of our back end services. We try to stick with async Messages and it worked for us in that. However in this case it was not working. Everything went to a dead stop. Before long we realized that it wasn’t just down by dead. Every single message in flight was gone. Customer orders, replenishment requests, catalog changes, Inventory updates, warehouse replenishment notifications, Pricing updates, every single one was gone. We were going to have to rebuild the cluster from the ground up. Now this could have been catastrophic. This could have been the end of the grand jet experiment, enough To convince our new walmart compatriots that it sounds good On paper but don’t work in a real enterprise compared to Tried and true systems. What happened with that, you Have to wait until the end to find out. First let’s talk about those tenants, those principals that Have guided us. What does jet do differently? There’s several practices that when i was first interviewing at Jet the parts that made me really excited were that they Were so big into functional programming and f sharp in particular. I had done a lot of functional programming style in c sharp at Other — in other organizations. I was really excited to work With a language that made functional first, made doing the Right thing easier. There has been a lot of spikes In early systems done before i got there that showed that there — It looks like there could be a lot of benefits from adopting F sharp as our language of choice. Another big thing, one of the early principals i learned Joining jet was that we treat micro services as stream Processors and we make everything a micro service. It works really well with the functional paradigm. The third thing that i tried at other places but didn’t have Real good grasp on the exact right way to do it was event Sourcing. I’ll talk more about all three Of these things later. The benefits as we’ll talk about A little bit later include resiliency, the fact that it’s Really easy to add features, makes systems really scaleable, Makes systems really tensable and it gives you a time machine. That’s not me misspeaking. You do get a time machine with this. We’ll talk about what that means in a little bit. So now that we’ve gone through all these concepts let’s talk About them in a real world system. First naming systems. In the early days at jet there were spinning up an e commerce Systems which meants a a lot of new systems and teams. The first names were the obvious ones, batman and superman. Two years into building systems and teams a lot of the big names Were taken. We were in the early stages of Planning out a new system and i tried to think of a name of a Superhero who wasn’t popular enough to be chosen but might Have some cloud a couple years down the line. So ladies and gentlemen, for today’s case study allow me to Introduce panther. I couldn’t get the media rights but this is a better option. This is the tracking and reservation management system. On the supply side it agitates and tracks inventory not just From jet and walmart owned warehouses but partner merchants. On the demand side, it acts as the source of truth renovations Against available inventory. When a customer is checking out The contents of their cart they get reserved. If the inventory is not available at that point the Reservation fails and either the items must be resourced from a Different location or different merchant or the customer is Given the option to choose a different item. Between those two our primary goals are to maximize on site Availability while minimizing reject rates due to lack of inventory. Our secondary goals are to improve the customer experience By reserving inventory earlier in the order pipeline, to Enhance insights for the markets and operations teams by Providing more historical data and better analytics and we Wanted to unify the inventory management responsibilities that It typically spread across multiple systems. Of course with all these business goals our solution had A lot of nonfunctional goals as well like high availability, geo Replication and really fast performance backed up by slas. Now geo replication and performance are a fun problem in This domain because actions taken by any customer in the World can affect the validity. To get the lowest latency we Want to serve our east coast customers data from an east Coast data center and our west coast customers from a west Coast data center. We don’t want one customer in Each region to try to purchase the last baseball sitting in our Kansas city warehouse and both be hold they successfully Ordered it. Now a lot of people recognize This as a fundamental problem of distributed systems. You can get perfect consistency, guaranteeing that the data is Exactly the same no matter which node you check, which is really Good for making sure you don’t lose anything if an entire Region falls into the ocean or you can get a good up time in Latency using all your notes to make sure the customer can hit The site and it’s responsive but you can’t get both. There’s a trade off. A spectrum of consistency to availability. This informs our design choices that it has in many domains that Have people from different geographic regions interacting. Let’s jump into how we built panther and how it solves these Problems. Panther uses the Building block to build the data intensive systems. We call this the command processer pattern. You may have heard this elsewhere. We have a stream of commands That come in that are processed by a process command micro Service. The command micro service and I’ll go into more detail on all of these stages in just a second But as a quick over view the command micro service will pull In whatever state is relevant to execute the command against that State. This execution will produce an Event as out put. The event gets written into the Appropriate event stream and then those events are emitted Via cofca, via event hubs or any other mechanism that you may Choose down to all other downstream systems. The state itself is built up over time with a snapshot Service. You don’t want to have a stream That goes on forever and try to rebuild the state from that Every time. So instead you cash the most Recent state of the stream using a service back into a snapshot Data story that can be used in realtime with lower latencies. We’ll go into details of all of these in a moment. So first let’s dig into the first part. What does it mean to execute a command and produce an event? For us there’s — it’s a several stage process. One you ingest the message. This could be coming in from Cofca, coming in over htp, some kind of combee — Queue. Any would work. It could be in any of those as well. We have tried a few different inputs and a few different Protocols and we’ve been using just native jason zipped and Starting to move towards pro toe buff. We are very heavy users of cofca For our messages and hhtp direct connective when it has to be Synchronized. The next step is that we’ll De deserialize that payload. In f sharp we deserialize into f sharp. This is an example of the kind Of data we about how old see in a command. They are usually not heavy. It has an id like they use for Item potents, whatever the relevant data is to identify What the context is. In this case it would be an Order and a screw or item identifier at which fulfillment Node or warehouse we want to reserve it, how much you want to Reserve and then a little bit of me data — Meta data, who is doing it and at what time. The set of commands or messages that express intent to change State, they’re named with pairtive verbs and then followed By a direct object. Update inventory, reserve, cancel reservation or ship order Is something that you might see in this domain. Now in f sharp we have really nice property of being able to Describe the entire set of commands that could possibly be Processed by a particular command processer via Discriminated unions. If you’re not familiar with Discriminated unions in languages with data types like f Sharp you can do a match against any particular item in a Distributed union, kind of like a switch. Some languages call it a choice Type, tagged union. Here it’s a discriminated union. It works well for message handling when you have a set of Known message types. Like here are a couple of Examples of what a discriminated union might look like if you’re Not familiar with it. For colors you would have three Options, red, green and blue and these are untyped which looks Like an enum. You can add types to your Discriminated unions. Instead of just being a list of String or numbers it can be full types themselves or two-folds of types. So we have an example on the lower half where you might have A union of shapes or rectangle is defined by height and width Where as a circle would be defined just as its Radius. This is what it looks like when You have a set of commands in a discriminated union. You have the different types in the discriminated union each of Which is referring to an entire stand alone type of its own. I showeded you — showed you The reserve earlier. You have Update, ship and cancel. I tried something like this a Couple of different ways with typecasting. I would have a base message Class and then messages would inherit from it and then i tried A message of type t. In both i did typecasting with a Lot of switch statements. What Makes this model really nice is an exhaustive set. You need to handle every single case. If i add a new command to this List and i don’t add the code to handle this case in every place An item from the command discriminated union is used the Compiler will tell me and i know i need to go in and put in the Code i need to have There. So once we’ve desterilized we Move onto step three to retrieve this data from the database. For inventory commands the appropriate identifier is Usually the skew id plus the fnid or warehouse id where That’s being stored. So it will retrieve that point, The current inventory or the available inventory counts for That skew at that warehouse. Also, and this will be important In a moment, it retrieves a logical clock for that state. It increases integer that identifies the last update. That clock is called a sequence number. If there’s been a million Changes to this inventory the sequence number would be 1 million. Raw inventory counts are not the only state in our domain. Some commands like ship order expect a reservation probably Exists as well and will include the order id to retrieve that Matching reservation state in addition to just the skew ability state. You can get as complex as you want with the amount of state That you need to retrieve. Once you have the command that’s Describing your intent and the current state that you want Change, you execute the command against the state. Executing the command itself does not change the state. It executes the business logic that usually consists of Validation rules. In some cases like update Inventory the command is coming straight from the merchant or Straight from the warehouse management service. They’re the source of truth so the validation is trivial. You just reset the quantity at that point. But some other commands may have some real validation rules. Reserve inventory would be a great example but still simple Enough to talk about here. Is the quantity being reserved Less than our equal to the quantity available to sell. Great. In that case it worked. If not then it failed. So if executing the command itself doesn’t change the state What does it do? it emits an event as out put. An event indicates that something has happened. It usually flips the syntex with a past tense of the verb. Instead of update inventory we have inventory updated. Instead of reserve inventory we have inventory reserved and so on. Failures are events too. If the reserve — instead of Getting an inventory reserved if the quantity wasn’t enough you Might get a reserved inventory failed event as out put. The idea is that the events cover every single potential Domain out put from the execution of the command. So with that out put event in hand you commit the event to the Event stream. It’s only provisional until it’s Committed. You can attempt to write the new Event to the stream at the next sequence number. If the write is successful with no conflicts then the event at That point has officially happened. If the right attempt fails due To a conflicting write, another writer wrote an event with that Sequence number innocent this event is invalidated. We know that the state has changed or potentially has Changed so we have to reretrieve the change state and reexecute The original command. Now it should be noted that this Model of processing commands or of processing a stream of Commands lends itself extremely Well to a functional style of coding. You’re starting with your Inmutable inputs, your command which is just an inmutable Message and your current state which is also inmutable. You run it through a chain of stateless functions. Your decode, your retrieval of state, execute of command, Execution of event and then write a new out put to your data store. This works great in a functional style. So now we have an event stream Stored somewhere. How do we build the current State from that? well, it turns out this is also Really well suited for functional programming. There are three things we have to do. We have to define a zero state. In our domain this is really easy. It’s just an unknown skew at an Unknown warehouse with zero on hand. That’s what we use when we are Starting from scratch in the event stream. We need an apply function that takes a state and an arbitrary Event as perimeters and produces a new state as out put. This is where the meat of the business logic usually is in Most domains and that’s the case here. After inventory updated event The on hand, the state, will be reset to whatever the merchant Told us in that event. The count of completed, in other Words shipped reservations we were keeping concerned for Accounting purchases can be reset to zero. We don’t need to worry about Those anymore. We have a new count. After inventory reserve the on hand would stay the same, the Total reserved would increase by some amount and thus the Available to sell will be dekremented by that same amount. Finally we have a fold function that takes the applied function We just defined, a state and a sequence of events to apply. So we can start from scratch, have an arbitrary sequence of Events and keep folding in those events until we get to the Current state. This is exactly what snapshot Event stream does. It reads in a batch of commands, Just committed events from the change feed, retrieves the Latest snapshot from the snapshot store, applies the Entire batch of events and over rides the old snapshot with the updated one. That happens in our domain. It’s tricky because we have Multiple ways of snapshoting a stream, the quantities along With the latest state of the individual reservation in the Stream but the idea is the same. Now while i described doing this After the fact, after the event is written in snapshot event Stream, it’s important to know you could also do this in the Commit command micro service if the snapshots are not up to Date, especially if there are a lot of events or commands coming In that are producing a lot of events. Maybe it’s not always Keeping up with every last event in time for the next one. So the command executer cannot Just get the latest snapshot from the state but it can get The latest snapshot and any events written since that Snapshot. It has a sequence number. Even if it’s at one million you can query the event store for All events greater than one million, apply those three resulting events tot snapshot in Memory and continue on your merry way. We learned a lot in our Practices of what makes good event stream design. We’ve made a lot of mistakes along the way as well. Since event streams are defined as every Event that happens to an agitate over time in a Strictly ordered fashion this is a constraint that has a lot of Real world practical implications as well. Different event streams don’t Always have a global ordering. There’s nothing that forces a Given event from stream a to have happened before or after Any given event from stream b. You may know every single change That’s happened to the item warehouse over here and you may Know every change from the warehouse over here but you have No idea which came first. We want to define an agitate, a Stream as narrowly as possible because imposing order has a Cost. Usually in the form of a minimum Amount of time it takes per event which Inhibits scaling throughput on a given stream. We started in our particular domain with defining all events That happened to a particular item because it was convenient. We eventually realized that item at a given warehouse is more Appropriate. We were conflatting independent I events in the same stream, a reservation on a box of tissues In warehouse a in new jersey has zero impact on whether we should Be permitted to reserve a similar box of tissues in a Different warehouse in california. We found then we could achieve Higher maximum throughput by splitting. If you have your item spread evenly across ten warehouses by Using ten different screw fulfillment node streams instead Of just one skew stream we can get ten times the throughput Because of that. Now because streams are Independent — or once you have streams that are independent It’s good to know that the number of streams should be able To grow and distribute easily. It makes a good fit for storage systems that happened to be Partitioned and achieve their scaleable through partitioning. Given the core flow of the die that intensive micro services — Or set of micro services at jet looks like how does that fit Into a larger architecture? In the middle you’ll see that same core flow with a lot of Extra stuff on the outside. That’s not a mistake. You’ll see this pattern over and over when you look at services At jet. The center core flow with the Command processer event streams Then emitted as events and what you’ll see that’s really Different in the architecture is different inputs and different Out puts. In this case for our doe man we Have synchronized inputs. The ones on the bottom come in through cofca. They are coming from external merchants where there might be Some kind of delay, maybe just a few milliseconds but it works Really well in the fashion. On the topside coming in we have The customers who are actually on the website checking out as we speak. We want to be able to give them an immediate response every time So they can know whether that item has actually been reserved Or not. We don’t want to make them wait Any longer than would be expected. So instead of going in through Cofca we do through http. We have Two different processers. But the domain is Exactly the same. They both work the same way and Write the events to the same data store. Now on the other end, the event is getting omitted from your System may not — everyone is listening to these events may Not care about every single event that you have. They might need some kind of transform into whatever format They desire or they might want to filter it down to just the events they want. You’ll see that in the upper right-hand corner. We have the inventory in a raw form that’s being emitted out as An event over cofca but for our different marks, whether we are Actually selling an item on walmart.Com or jet.Com, on Hayneedle or any other marts they may not care about the Inventory at every warehouse or every single screw. Hayneedle, which is a site that’s selling home goods Probably isn’t going to be carrying our grocery items that We are also tracking. So we filter it out in that way. It also gives you a chance to Put in a buffer or any logic for a downstream consumer. It doesn’t change the core processing flow. Now what were we using — how do we write the events to a data store? Originally and some of the systems at jet still do this, we Were using an open source product called event store. It has a great api for streaming events and being able to read back a stream of events. It’s very clean and it turned out it does not scale to the Performance and geo replicated characteristics that we want. So we ended up with cosmo db. At — Cosmos db. We found it by far the best Choice for designing an event store compared to a lot of the Other competitors that we’ve gotten to look at. We’ve been lucky enough to work with the cosmos db team on the Feature design to make an even better event store. So we use it in several different ways. One is for the event stream storage itself. We’ve written api very similar to the event store api with Which we were familiar with before. The storage is partitioned by a Stream id and teach — each event is stored as a new Document within the name of that document we have an encounter That matches to the sequence number mentioned earlier. We take advantage of the consistency models that cosmos Db offers. The next is called consistent prefix consistency. Without going into technically what that means one very Important guarantee it provides is you can’t accidentally create

1 thought on “From startups to big-business: Using functional programming techniques to transform line of

Leave a Reply

Your email address will not be published. Required fields are marked *