Jupyter Meets the Earth – Community Forum

good morning uh welcome everyone uh to our uh short jupiter mitsuki earth earthquake workshop um we first would like to thank the earthcube organizers in particular lynch schreiber who's been extremely helpful in getting us getting us all set up and i also want to thank thank the team in particular lindsey hagie who's done a huge amount of work on the logistics and the material um to have a successful workshop we're going to try to give you a bit of an overview of what the project from this team has done and and hopefully we'll have a chance to to talk to you folks afterwards so a little bit of the usual zoom etiquette that by now probably everyone is used to um please stay muted unless you're speaking so we don't get too much background noise and remember that we are recording so that if you would rather not be on camera you can leave your camera off um and the link right there is a slack channel that you should all have been invited to but if for some reason you didn't you can go to that and go to that url and join the slack channel where you can post we can post discuss discussion questions um throughout throughout the workshop and we'll do our best to monitor that channel and to respond to those in the in the extended q a session that we have uh later on so for those of you who haven't used slack too much um this is what it will look like if you use the web client many of you probably do use slack but in case you haven't um you will have the the channel that we are using is called jupiter meets the earth you'll see that on the left um you type your chat messages at the bottom and uh and the uh excuse me let me change the size of this window just a little bit so that those google slide controls don't cover the actual speaking area um and uh and if you click on the on the little pin uh push pin icon um you will see uh useful useful links and resources that get posted that will appear on the right hand side so let's let us tell you a little bit about what what this project sort of is is driven by the motivation for this project and in this we're taking very very strong cues from joe hammond and ryan abernathy and others in the pangea project who kind of framed uh used this framing for their design of the pangea project and it's something that we find a particularly valuable way to look at the problem is to think about what drives progress in general in the sciences but some of this is somewhat specific to the geosciences um and the geosciences in principle ought to be a virtuous feedback cycle between the development of theoretical ideas and models in many areas we do have odes and pdes that we consider to be uh kind of fun the fundamental physical on driving mechanisms of geophysical processes atmospheric processes fluid processes geological processes um and these are described by fundamentally mathematical models we know they're imperfect and they are approximate and they leave they leave processes out but they still form sort of one of the cornerstones of our description of the natural world we combine those with observations and data that that feeds into them and obviously we try to model to incarnate those models via simulations and computational techniques that are becoming increasingly sophisticated um the volumes of data that we have are larger and larger the models or uh are increasingly complex they are today they are multi-scale models they we pay much more attention to nonlinear terms and we incorporate noisy stochastic terms um the size of the observations and the data is becoming absolutely astronomical for uh uh for not nobody here that should that should be news as a quick example the the data set for the cmap6 um climate model and comparison project uh in its current iteration is estimated to be on the order of 15 to 30 petabytes that is a lot of data and virtually everyone who's dealing with experimental data uh today is is dealing with uh rates of on the order of terabytes a day and data that is very complex as well it's not just a single sensor anymore you typically need to integrate data from multiple of multiple kinds and to make sense of all of that you need uh today very complex um computational machinery um we're talking about exascale computing at the at the national labs and all of the hpc centers in the world um a lot of folks are doing their work in the cloud but that requires its own set of software engineering skills and devops skills of all kinds and uh and today machine learning is widely used that's its own basically industry with its own um engineering engineering complexities and so this virtuous cycle is one that is sort of gummed up as as joe and ryan um and others framed it in the in their pangea discussions the gears of the engine are kind of starting to grind um and part of this is basically complexity right there is a huge amount of um complexity in integrating all of this we're well past the stage where a single scientist could understand the theory um the physical models gather and gather or get access to some data and run their own codes on their workstation with a little bit of fortran that they wrote on their own this might have worked in the 70s or even in the 90s it's completely infeasible today because so our our framing is how do we get these gears more effectively turning again if we want to focus on geosciences questions we need to bring to the table together effectively teams uh it's not going to be possible to do this as an individual who have expertise in the domain expertise in data science and statistical methods and analysis questions in the protocols of good statistical data analysis data management is its own has become effectively its own complex discipline data engineering um and there's a huge amount of software engineering that comes into play um there's a huge amount there's a huge amount of building tools um and the jupiter meets the earth project is a little bit of basically trying to join the software engineering aspect as a as a first-class partner to this world it was uh it was funded by the nsf through a grant proposal um that we wrote together uh with our copy eyes on the team here kevin paul joe hammond laura larson um lindsay hagee and and myself wrote wrote this grant it was funded by the earthquake program and the idea was to argue that it would be a real partnership between the software engineering and the domain sciences a lot of projects the the federal funding agencies have typically been done and we were all probably familiar with that pattern where there is either a very strong focus on cyber infrastructure and computer science research um and there's a little bit of perhaps uh if i'm if i can be slightly provocative lip service paid to the notion that uh that that uh the computer science research will engage with the domain uh with uh some domain topics um and on the other hand uh we also know of a lot of projects that are funded purely on a domain um on a domain emphasis uh question and where there's sort of a wink-wink not nod somebody somehow will write the software um and instead um of either of those approaches that we feel is perhaps a little bit imbalanced we chose to have a real partnership between the the open source software tools that we're building and the domain scientists um and this comes a lot from our experiences in um at least for myself in how we've built uh project jupiter project jupiter is an as a long running open source software project um but what i want to emphasize for a moment is that this project at this point is much more than software um yes there is a focus on code and when people think about open source software the word obviously is software um importantly software that should be extensible and reusable by others um but on a for a project to have a large impact in a in a community and for a project to really um uh live uh beyond uh beyond the the uh the original intent of for perhaps its first authors it actually needs to consider um a layered layered set of questions that go that go far beyond software um and uh and these layers in a in a nod to kind of a classic masterful hierarchy that we have here include services and content that are presented with the software um they include potentially standards and protocols and ways to interoperate and build an ecosystem and they include a human community that needs to be managed and so i want to spend just a couple of minutes kind of highlighting some of these points from the perspective of project jupiter um some of you may know jupiter and or ipython its predecessor project which began its life as basically a simple interactive environment for experimenting with python code in data analysis and scientific computing workflows on the jupyter notebook originally named ipython notebook which many of you have probably used is an environment that that allows you to combine text code and the results of the code execution accessible through a web browser but what really has brought um jupiter to a very large scientific community is actually the fact that we provide on top of this software that you can download the project actually provides content and services that people use binder is an example of a service that allows you to turn any git repository with one click into a collection if it's properly prepared into a collection of live interactive notebooks envy viewer is a way that a lot is a tool that allows you to take a notebook that is publicly available and provide a web view of it to share with others say a colleague who doesn't have these tools installed installed or on social media or as part of a as part of a website jupyter hub allows you to put all of these tools on the on through a web browser accessible on shared infrastructure whether it's a super computing center a research cluster uh or a cloud computing environment the point is that if these are not things that you download these are things that you can actually where you can actually run services and many of these you can access for free and they have created um a far larger impact from the original software that we had that we had written um then then actually even we had uh we had imagined at the beginning because we realized that it was really the entry point for many people is actually the content is what they want to read is what they want to access is what they want to share with others rather than being focused on the tools themselves as perhaps sometimes the the tool makers tend to tend to think um and below the software um if we think that uh that the software sits on core ideas in the case of jupiter those ideas are fundamentally a computing protocol to do this kind of interactive work yes in python and that's what most of us use in community research but also in languages like julia um in languages like r and ultimately in any virtually any programming language and today there are implementations of the jupiter what is called the jupiter protocol in over a hundred different programming languages including things like c plus plus um so my point with this is to emphasize that in jupiter we took the time to not only write the software but to actually take a step back and ask the question what is the software doing and what of this can we attract and formalize into a standard that others can that others can use for their own purposes even even if they don't use python which is our tool of choice and making the time and taking the effort to engage with other communities to create that standard has been enormously valuable um because now uh there's a much greater interoperability across programming languages regardless um uh of of which one uh which one you're using and you can share content and tools back and forth across uh language communities um and finally as i mentioned uh at the bottom of uh of all of these tools are humans um and it's very important to if you if you think of the role of open source software in science to think about these issues um jupiter is a project where we've taken a lot of time and we're currently undergoing and trying to finalize our restructuring of our governance um but it is a project that even though it began originally ipython was me procrastinating on a ph.d of older today it's a huge project with many many contributors and we have a formalized governance model that includes a body called the steering council we have a formal process for institutional partners um there is a 501c3 called num focus that provides physical spots for the project this takes an enormous amount of time it's not recognized as real work in academic settings for the most part and yet it is critical to maintain a healthy community that can grow that can engage with different stakeholders that can grow more diverse um that can engage different use cases um it's it's effort that we're not trained for well that we're typically not recognized for but that is it's critical to understand that this is necessary in the construction of a high-impact long-lived um scientific computing project uh that is that is based around open source software um and yes we do build tools we do build software um jupiter lab is one tool that has uh that has been at the forefront of of the research um in in jupiter over the last few years and we want to empha i want to spend a couple of minutes uh flagging uh its extensibility because that is part of what we want to engage this community about which is that jupiter lab is basically an evolution of the jupiter notebook interface to go well beyond notebooks um to consider for example the fact that data should be a first class citizen this is an example of the jupyter lab interface not viewing a notebook but rather viewing different data files an image file a csv file and importantly the extensibility means that let's say we have a json file here but that encodes actually geospatial data in this case the locations of uh museums in washington dc um and this file is actually honors the geojson schema then this file can be visualized with a plugin as an actual live moving map using leaflet so the point is in jupyter lab the community can write its own plugins that access data as a first class citizen in in the way that is most appropriate to that particular data format or that data modality um this kind of extensibility has been uh really taken very successfully advantage of um by this community of neuroscientists at columbia university um in the group led by oral lazar and i want to show this extremely short video um let's see if this works where you will see the jupiter lab interface that has a notebook um on the left but that also has a webgl 3d 3d view of of the fruit fly brain neuronal circuits in the fruit fly are simulated as electrical circuits and those are viewed on the right on the lower right hand panel those circuit simulations are run on the gpu cluster genomics data about the fruit fly is accessible from a custom database and is visible on this panel on the right this is still the same jupyter lab that you installed you can see the notebook right in the middle but that team built a custom interface that added their own plugins and their own tools to to effectively turn uh turn the default generic scientific super lab interface into a custom interface for uh for the study of the data that is most prominent and we want to use this as a motivation to inspire all of you we're building some of these things for our own use cases and we want all of you to think of this as a tool that you can extend and mold to your own scientific needs uh pangea i'm realizing that i'm already running late so i'm going to finish rather quickly pangeo which is the other leg of this project this project is the collaboration in the jupiter team and the pangea team is the union of jupiter for interactive computing with the das high level system for distributed computing um along with the x-ray um numerical array uh project that basically takes something like the the the traditional numpy arrays and blends them with the net cdf data model to build a platform so that scientists can use interactively um these tools to do very large scale data analysis this is a quick example from a blog post by scott henderson where a scientist zooms into a small data set what looks like like a figure and as some little collar bars wiggle at the bottom the image zooms that seems like not a big deal but what's happening is that zooming requires running over 100 gigabytes of landsat data in the state of washington and running a big distributed computing cluster by doing this on pangeo the scientists can just log in with a browser and do the zooming and pangea will orchestrate that cluster will schedule the jobs automatically to make that happen um and for and for the scientists to be able to focus on on their data exploration rather than effectively becoming an amazon or a google software cloud software engineer um so pangeo is a project that by joining uh by joining these tools tries to make the idea of having interactive data analysis done in the cloud with data that is ready uh that is ready for analysis a reality uh we've left two links in here to two talks by ryan abernathy and joe hammond uh that tell you a lot more about pangaea we could spend we could spend the whole day talking only about pengio and the impact it has had um and uh and joe will talk a little bit more uh later about how you can get involved with pangea so in this project as i said earlier our our perspective is to join research use cases um in four specific areas and climate data analysis specifically the cmf6 data cryosphere science hydrology and geophysics um to drive developments in the pangea and jupiter ecosystems especially especially around interactivity data discovery and infrastructure both in the cloud and hpc and this is the team um the team of scientists and and developers who are working on the project um and so today what we hope to do is get you give you a little bit of an overview of these projects um and present avenues for all of you to get involved as well as having time to discuss with you on slack in the q a session and hopefully to remain engaged so that we understand your needs and questions better so that we get better ideas from all of you for this project to be as impactful as possible with these four areas being only what were what was the scientific focus of our team but our intent being much broader a much broader impact so i'm already five minutes late um but we do have a little bit of slack and i'm going to try to end quickly we're going to continue with scott henderson's uh overview of pangea then a talk by kevin paul and we'll have a short break uh uh at uh uh at 8 30 and then we'll have a sequence of lightning a five minute lightning talks and we'll conclude uh with a longer session of community q a we're probably gonna lose a few minutes into that but we we knew that would be the case so just to remind those of you who may have come in a little bit later if you're not on slack this is the url for you to join the conversations ask us questions etc and uh i will stop here and we will continue with scott henderson right away i will stop my sharing so that scott can can hop on all right um just a second everyone all right good morning everyone can people see my screen now yes looks good okay great i'm just gonna jump right into it to keep us on schedule i'm scott henderson i'm a research scientist at the university of washington i work at the e-science institute and i also work in the department of earth and space sciences so in the next 10 minutes i'm going to try and give you a bit more detail jumping off of what fernando mentioned about the pangeo project which i've been involved in for the last year or so so um if you're new to pangeo i highly recommend going to the websites pangio.io this screenshot is taken just from the website and you'll see straight off the bat that bangio is a community platform for big data and geoscience and if you read further down the first statement i've copied verbatim here um is that pangio's first and foremost a community promoting open reproducible and scalable science so i really like that maslow diagram that fernando showed where community was at the bottom kind of the foundational aspects to all other work happening in this project and um there are a lot of venues which we'll talk about later this afternoon for getting involved in this community but i've just pointed some arrows here to various like online forums that are really central to bringing together scientists and software developers as part of this community effort so bangio got started around early 2017 from earth cube funding and since then has grown to involve a lot of other funding sources and bringing in scientists and software engineers from a bunch of different institutions i got started working with this community as a result of the nasa grant that's focused on developing capabilities for analyzing earth observation data from nasa which is starting to move to aws for hosting and why is this effort happening now um as already stated there's this need with growing archives and technologies for better scalable tools for doing scientific computing with large data sets so for the case of nasa you can see this plot in the lower left where we're just looking at the archive growth of nasa over the years and there's a pretty big step change in archive size due to some new satellites that are launching in the near future here one of them being nicer that's slated now postponed a bit but for 2021 and then on the right you're seeing a size of cement global climate model outputs so these data sets are becoming cumbersome to work with and agencies hosting those data sets are also starting to indicate the need to host these data sets on central servers potentially cloud providers for improved access so this move i'm going to focus on the nasa case for a minute here to cloud server is a big deal because it changes a lot of the typical workflow we're accustomed to in doing scientific computing so this schematic on the lower left is just a redrawing of the schematic we saw from fernando where we're illustrating kind of the envisioned architecture that um this big data platform the platform aspect or computing aspect of fangio is advocating for and this platform is really kind of centered by a jupiter hub system so a server running in the same data center where large data sets are stored so in this case we've got a cloud here this could also be an hpc system but the idea is to start allowing people interactive access to these large data sets without having to download those data sets to their local computers so if we instead move algorithms to the data this will improve the current state of the art and on the right here i've just listed a few kind of key aspects to this style of computing some benefits where we have instant access in the case of cloud commercial cloud we often don't have to deal with queues we can fire up as many computers as we want on demand we also democratize access because people only need a web browser on their personal laptop to engage with these larger computing resources downloading is avoided which is often a bottleneck for scientific workflows these days we have scalable power and computing resources we can plug in gpus to our workflow if we need them when we don't we're not using them and by packaging everything up to run on these different heterogeneous systems so commercial cloud providers where we tend to be improving the reproducibility of workflows because we're making data sets accessible over networks and we're also containerizing all the software that runs to analyze those data sets so that's that's the vision there we go there are a lot of concerns around this vision so i'm putting these points down here to kind of spur some discussion later in the day i think but we have an unfamiliar cost model in this case for many scientists people aren't used to cloud-based infrastructure and there's a steep learning curve if setting up that infrastructure yourself there's concern over commercial management of public data this is these are things we've heard over the last year from scientists getting started in this project and there's potential vendor lock-in so if you develop all this infrastructure that only runs on the aws system it's not very portable but those advantages are really big and i've taken a slide here from shell gentleman's keynote from the esip meeting that just happened last week there's a link to it here it's recorded so i highly recommend folks taking some time later in the day to look at some of the recordings from the ecip meeting which was really great but the ultimate goal really with this rethinking of infrastructure is to reallocate time of scientists so the traditional timeline at the top there is the 80 percent of figuring out where your data lives and doing all this work to organize it and then very little time at the end of the day actually writing your paper and we really feel that with this cloud-based approach you can kind of flip those time allocations the computing architecture again we saw this from fernando but our idea is to use jupiter um and to give people kind of curated python libraries there's sets of python computing in environments that facilitate distributed computing and so there's been a large focus in the pangeo project so far on python in particular using the library's das and x-ray to work with these large climate model data sets or large cubes of satellite imagery foundational to this architecture is the storage of data in an appropriate format so we've been advocating a lot for the tsar format or cloud optimized geotiff but you need uh data sets to be stored in some sort of tiled fashion to facilitate distributed computing and then a lot of the work we've done so far has really been um borrowing from jupiter's kind of guidelines on just on setting up jupiter hub on a kubernetes system for cloud providers this allows us to run things on google cloud aws azure different systems i'm going to skip over this these are just some of the libraries we typically highlight in presentations but for this community i just want to draw attention to the fact that um the the platform of pangeo is really a collection of platforms that have been supported through cloud credits from cloud providers google amazon these are jupiter hubs that people can simply log into with the github username and immediately have access to the das configuration for distributed computing so again as fernando mentioned this would be the services component of what we're doing which has been hugely important for getting people up to speed on using these software sets and the style of computing so we have jupiter hub running on aws jupiter hub running on google we also have binders set up running on both of those systems and again this is this is a binder that is attached to these das clusters to do distributed computing the best way to get familiar with how this works is to go to this website gallery.ngio.io and you can try out some interactive examples here um there's a tutorial for getting started and then there are more sophisticated workflows the increasing complexity the one other thing i want to touch on in this presentation is the role of hack week so at the university of washington e-science institute we've been running week-long hack weeks over the past several years um these are really important for supporting like community training getting people up to speed on on these python tools so hack weeks are really a welcoming environment designed to facilitate this building of a research community and they're really intentional um really well-designed projects to get people working on developing software and contributing to open source projects while creating connections in their research community so a typical hack week will include many uh community building activities over the course of the week hands-on tutorials and these hands-on tutorials are done using one of these jupiter deployments that we've put together also project time to advance some sort of research project and just recently we hosted the um icesat-2 focus hack week which is specific to a nice nasa satellite that recently launched as the first 100 virtual event and we had 80 over 80 participants for this event you'll see some familiar faces on the screenshot in the lower right here um it was very successful and uh important aspect of its success was having this centralized jupiter hub environment that participants from all over the world could log into and start sharing documents and working together so i'll just point out that we've put a lot of effort over the last year into making these these fancy jupiter hubs deployable such that other research groups and people putting on hack weeks can do this themselves there's a schematic over here of what the website looked like when people logged into this environment and how jupiter hub partitions machines to different people when they log in this slide i put in here because i just want to draw attention to the fact that part of the beauty of this project is that everything is kind of out in the open from the ground up so there is a certain level of complexity in deploying and then keeping track of these systems when you set them up these services on cloud providers but we've we've been trying to kind of lay this out so that other groups can do it and i'd love for people to who are interested who i suspect are on this call to get involved in trying to deploy these systems yourself so this is a nice blog post that sebastian alves wrote one of the co-eyes on the project at the university of washington and setting up this infrastructure for the icesat-2 hack week and i'll just end with a couple questions to kind of again spur thoughts and discussions for later in the day but just reflecting on the last year we recently had to give a report to an nasa technology infusion workshop and we're asked what's the biggest challenge of this project so far um and i've put two uh things in response to that question about what's what's the biggest but one being availability of data this whole motion to move things to the cloud or the data format is key and the availability of data on the cloud is key to the success of this and also this wariness over long-term costs who supports these services when they're no longer supported by cloud credits how to overcome these challenges we think these hack weeks are really key to getting the community accustomed to cloud computing but the long-term funding and support of these systems is a unsolved problem still so thanks everyone those are my slides please check out those links in there in your spare time later today thank you so much uh scott for that presentation um i we're taking we're having a few questions coming through through slacken we encourage all of you to post your questions on slack and we'll have a brief q a session um right before the lightning talks but uh we'll continue with kevin paul from ncar uh next um scott thank you option there you go thank you um i need to be made a co-host i think you should be good to go now okay thank you very much okay where am i there we go all right thank you everybody uh all right thank you so uh thanks to the uh wonderful introduction by fernando and uh scott i think everybody is uh fairly familiar with what pangaeo is and the fact that uh pangea is had a lot of success being deployed in the cloud but and it's also been mentioned that pangeo can be deployed uh on high performance computing systems uh such as nursk and uh our super computing system at uh the national center for atmospheric research where i work with my colleague anderson von hearway we'll be speaking one of the lightning talks a little bit later scott's already given a little bit of uh introduction to this but deploying on the cloud has a lot of advantages uh there are a lot of benefits to deploying pangaea on the cloud but you can deploy it on hvc and there are obvious differences to what that deployment looks like stemming mostly from the fact that hpc has a number of limitations uh that don't necessarily exist in the cloud such as limited access it's very difficult to just spin up a server for example on an hpc system and convince your sysadmins to make it public to the internet uh hbc is usually bare metal so there's no virtualization usually not even containerization although that's changing hpc limits uh resource access um this is also i guess connected to the fact that it's bare metal but it means that the standard user just doesn't have uh sysadmin access so you can't use things like docker which a lot of uh the pangea cloud stack is built on as a result uh since you're not going to be using docker or kubernetes or anything like that hpc uses job schedulers which many people are familiar with such as pbs or lsf or slurm and that's sort of the common way of launching large jobs and sharing resources for that but the goal of pangeo on hpc is exactly the same as it is for pangea on the cloud and that's centered around this idea of a common user interface which is jupiter that's built on this architecture involving jupiter hub which spawns for you at jupiter lab and then provide access to canonical software stack via custom kernels and the goal obviously is to try to make the user experience the same in fact i think ideally we would say that the goal would be to try and make it so the user doesn't even know whether or not they're running on an hpc platform or if they're running in the cloud we're not quite there yet there are differences usually having to do with things like authentication sometimes there are functionality differences some supercomputers don't provide direct internet access through the compute nodes for example which can limit what you can do there are also differences in the software stack by necessity for example as was already pointed out in one of scott's slides there's a difference between desk job queue and das kubernetes which is just being able to launch your desk or cluster which provides you your parallelism in an hpc environment versus in a kubernetes-based cloud environment but instead of going through just a bunch of bullet points i think it's uh useful to try and do a demo i say try i hope this works i'm going to take you to ncar's jupiter hub here we have a selector page which allows you to select which of our systems you want to run on the cheyenne supercomputer or our data analysis and visualization cluster i'm going to do this on cheyenne there's obviously an authentication page because we can't just open the things up to the internet and this requires duo so there's a little bit of a delay here and then as it usually works in hpc you have to select exactly what kind of resources do you want this is actually becomes a job submitted to the pbs scheduler all i have to do is basically give it that you'll notice that the rest of this is defaults to just asking for resources to run one process which is all i need to run jupyter lab and so now i've submitted a job to the queue and it takes a little bit to spawn it up i and start my jupiter lab session running usually it doesn't take very long this is where everything could have fallen apart but it didn't yay so this is what you typically run into this is the default uh in a lot of ways this is very close to the default uh jupiter lab um this is what you would there's some something similar to what you would see if you just launched jupiter lab on your laptop i've got a notebook here that is a good demo of a fairly large data set and i'll show you that you'll see i land in my home directory on the supercomputer this is a this example here is an example of a 100 member ensemble containing precipitation and temperature data my thanks to anderson bonnie hearway for giving this to me um he has the usual boilerplate where you do some imports and there's a little bit of setup up front so it takes a little bit but then the next step here is creating and launching a task cluster now you remember i just asked for resources for one node one process but what i'm going to do is i'm going to use das job queue here to actually request from the pbs job scheduler resources to launch a bunch of das workers and that's what desk job cube does for you so i launched this and you can see i've requested 72 workers there are 36 cores on each cheyenne compute node so basically i'm asking for two compute nodes and it takes a little bit for each one of those jobs to make it through the queue but you can see i've done it and i even get a link here to the desk um dashboard and we have this das lab extension uh installed in here so i can see things like um the progress window the dash progress window nothing's happening so this is basically empty right now and i can also take a look at things like memory use and i can spread this out like so give myself a little bit more real estate so i can show you what i'm doing and you can see you can even ask deskjobq to tell you how it requested the workers through the previous job schedule and then i can connect a client to this and start looking at my data so that didn't take too long to actually get to a point where i'm actually looking at real data this is a czar store that's held on ncar's storage platform glade uh it's about 1.7 terabytes um you'll notice didn't take very long to open it because all it was basically doing was reading metadata you'll also notice that it's a czar format which as scott mentioned is ideal for cloud but it also turns out to be really nice for hbc environments with a parallel file system as well i what i'm doing is i'm reading this into an x-ray data set x-ray can print out information about the data set in an easy to read format like this you can see it has precipitation variables and temperature mean temperature variables it's got a 100 member ensemble and it's got a lot of time steps you can take a look at what the size of one of the some information about one of those variables for example the mean temperature and you can see that das has distributed this data across all of the workers it's chunked it up into chunks a one per ensemble member and 366 time steps per chunk and now i can start using x-ray to do some actual computations this is the computation that's sub-selecting a small amount of that 500 you'll notice up here this is a five half a terabyte basically for each one of these variables i can sub select a particular time step and do some computation on that obviously by sub selecting a time step the amount of data i'm using is actually much less but uh all the operations are lazy until i actually do something with it like compute or plot in this case and what i can actually do is i can actually tell task and you can see the windows up here start working uh and i end up getting a plot of the standard deviation the errors uh which is pretty reasonable uh it didn't take very long that's partly because we selected only one of the 13 500 time steps uh and we can do something more significant and use the entire data set where i'm actually going to take the temperature mean and do a mean across time and then compute the spread across the ensemble members so i should get back something that only depends upon latitude and longitude again this is a lazy computation so it's not doing anything yet until i actually tell it to and now you can see over here in the right as it starts to turn along in its computation uh you can actually start to see it uh moving fairly quickly uh it's doing a lot of tsar reads uh it's computing the mean and time i don't spend too much time on this but this is giving you an example of exactly how long it takes to process half a terabyte with 72 workers um two nodes of our supercomputing cluster and just to give you an idea of exactly how much that's actually used of our supercomputing cluster and it's done uh that was uh there are 72 cores that i'm using and there are about 150 000 cores in the supercomputer and now i can immediately get that data and plot it that's all i'm going to show i will go back to my talk now we are not the only players in the game obviously there are a lot of hpc centers that are devoted to jupiter um this is a slide from colleagues at uh nurse um trey australia and roland thomas who've done some amazing work with jupiter uh on the nurse systems uh this central image here shows uh essentially that chooser page that you saw at the very beginning of the demo i that in this case it's particularly customized to each user after authentication and it only allows you to choose things you actually have access to which is pretty cool um the top right of this slide is showing a jupiter nb viewer deployment that they have local to their own disks so that users on nurse can can provide a static view a statically rendered view of a notebook with colleagues and then they can download that notebook along with the environment that's needed to run it into their own home space on nurse that's just a handful of things that the nurse folks have done you can see down here a laundry list of amazing things that they've actually done i'm actually quite jealous of all the things that they've done and i'd like to get a lot of this stuff implemented at ncar as well um what's next there are obviously a couple of different things that we can't do on hbc yet that you can do in the cloud such as binder we're working on that we're also hoping to kind of build up some data discovery capabilities using existing data search capabilities at ncar and link those in through an extension to the jupiter lab those are just a couple things to think about thanks to anderson for help with the demo thanks to the folks at nurse and thanks to the nsf and that's it thank you so much kevin um for that presentation and anderson for the materials we actually have a couple of questions and one of them is directed to you kevin um so perhaps if you could briefly uh show again the i don't know if you still have it open the loading of data on x-ray uh sort of how if you could show that part of the notebook and kind of illustrate how xray loads the data sure yeah um that is uh obviously i i couldn't do a lot here um i had very little time but i i can show you this is the notebook here where i have actually loaded the data okay so xr here is just xray and it has a function called open czar which allows me to open up a czar data store and it's stored actually on my colleagues scratch space on our glade storage system so this is a parallel file system it's gpfs uh it's obviously as you can see here it's about a 1.7 terabytes and just in the same way that czar works uh in the cloud when you open up the data set you're not actually reading any data except for coordinate and metadata information which doesn't take very long which is why it took less than a second to actually open this but x-ray then gives you loads this whole data set into something called an x-ray data set and then allows you to view information and metadata about each thing so i can get information for example about t mean down here directly from this interface which is fantastic interface uh can get information about precipitation uh this is telling you how it's been chunked by desk behind the scenes so that when you actually do start reading you're reading in parallel and that's it does that answer the question or is it everything after this was basically doing lazy computational or selection sub selection operations with uh x-ray um i hope that answers the question sure and uh there was a several people asked also whether this notebook could be made publicly available is that is that something you could post on github or we can post it as part of the materials when we wrap up i think so um yeah i it's uh it only works on because this store is only available on on glade you can only run it actually on glade but if you go up to the top of this there is a version of this that is freely available on the cloud and i can post that link for everybody via a binder thank you kevin um another question that came up i think for more than one person was and this might be something for especially for scott and joe as well um where the the type of resources that you showed could be cost prohibitive and so is this something that is only accessible to universities and government entities so if scott and joe have common takes on on that on that question yeah i was just typing a quick response in the slack channel um i would say no um it's definitely the cost scales with your number of users and one good aspect of the cloud deployments now after a year of iterating or so is the costs are actually quite minimal just as for a baseline infrastructure but it depends on how many users you have because everything's dynamically scaling so joe did you want to add anything to that no i think that's basically it i think the nice thing about the cloud is you only pay for what you use and so you can scale up your deployment to take care of a specific task or do a workshop or to do whatever computation you're using and then you can scale it down and so i think there's actually a pretty good counter argument for the fact that the cloud costs more which is um you don't pay for idle time on your machines yeah i would second um perhaps that's a third uh the uh this the whole platform uh is fairly uh easy to deploy and the other costs definitely do scale uh on on cloud uh the biggest uh i think cost for small groups would be storing data in the cloud but i would say or i should maybe that's not the biggest but it's a chunk and i would say that there are efforts by institutions that actually own a lot of the data such as ncar and nasa that are actually hosting the data in the cloud themselves so getting access to that data and using it in the cloud is usually very very low cost if not free so um i would say that aspect of the costs is uh something that hopefully we're uh we're not going to um not going to incur to most of our users ryan says he has a comment and i just want to mention um a very sort of um uh low-key announcement but we we recently received uh some notice that we'll be getting some new earth cube funding to provide data hosting for the earth cube community so in terms of who will pay for the cost of hosting data i'm happy to be able to say that going forward we have a sort of strategy to to provide that to the community and our our funding includes the the hosting costs and it also includes working with alternative storage providers like wasabi cloud and open storage network that can provide cloud style storage at a bit lower price point than say amazon excellent news and congratulations on the on the funding that actually dovetails with another question that perhaps you can comment on which is whether nasa has any plans to make data available in certain formats and and as part of these arrangements any of you and who are well plugged into kind of the pangea and uh kind of nasa data availability that's one for scott and joe probably yeah so so nasa's kind of in a multi-year transition over to hosting some data sets on aws and there is no single storage format that's been identified as the go to format but everything's on the table right now so hdf czar geotiff are definitely some formats you'll be seeing nasa data sets in thank you and this one perhaps back for uh kevin and others close closer to the hpc side on the cloud side whether chunk size is important on hpc when using czar and and perhaps if you can comment on how the chunking decisions are made anderson do you want to field that one sure um so i i think in this particular case the original data set was actually initiative format and the way that i created the chunking i was just looking into what i was going to do with the data so i would say that actually matters um but um that in itself may cause a few problems like if at another time you actually want to do some other operation that maybe kind of deal with dimensions that have been actually chunked and you don't want actually them to be chunked uh but there's a new package uh from the pangea project called rich anchor i think ryan can actually uh speak to that uh i just know that it's basically meant to uh address these issues where nowadays you actually have to spend so much time thinking about how you're going to chunk your debtor but that in itself is not really a guarantee that things are going to work smoothly because at some point you may actually have to re-chunk it and the whole process of re-chunking is it can can be quite expensive depending on what you're trying to do but the re-trunker project is trying to address that uh i don't know if ryan has a comment on that because i'm not quite familiar with exactly what the rechunk is actually doing well i mean i'll just say that yeah this is something that's fundamental to large multi-dimensional array analysis there's not really any way around it um as long as there's some correspondence to reading data you know physical proximity on disk um and its impact on read performance so we've always had chunk data in one form or another whether it was spread over many net cdf files or or whatever using tsar kind of makes that totally explicit um which is probably good for thinking about your workflow but it's definitely the case that some workflows will be optimized to certain chunk structures and some will fail hard on the same chunk structure so um rather than trying to say that there's one universal chunking scheme we've moved towards the view that you should have the ability to rapidly kind of temporarily rechunk your data into a format into a scheme that fits your analysis best and the rechunker package is a tool that tries to implement that thanks ryan uh and in order to keep on schedule perhaps anderson can pick up pick up the mic again uh and take us into the lightning talk section where you'll open anderson will open up with a talk on intake all right give me a second okay okay can you see my screen yes looks good okay good uh let me know uh should i start or should i wait go for it okay good all right uh give me a second okay so okay uh hi everyone uh thank you for the opportunity to speak uh my name is anderson vanihier i work as a software engineer at ncar and today i'm just going to talk about intake and intake esm uh so i realize i only have five minutes so i don't really have any slides prepared i'll actually show you a live demo of what those two tools do um so integ is a python uh data cataloging and data discovery tool and intake esm is a plugin is an integ plugin designed specifically for earth system model outputs um and let's now actually switch gears and actually look at the demo um so what i have here is i basically have a notebook where i have two use case and the first use case uh is about the sea surface temperature uh data that is actually provided by nowhere so if you actually look at where this data come from at least the way it's actually hosted on this nova website the way it's basically structured you basically have a bunch of directories uh by year and month as you can see it's quite a lot of directories and it goes all the way back to 1981 and if you actually want to get like a specific file you basically have to go to a specific directory and you have to pick whatever file you want to get uh and work so in most cases you may actually have to download the data you can't just access the data but intake actually allows us to to bypass a few of the pain points one of them is we can actually basically define a catalog in this case i basically have this yaml file where i'm basically defining a few things uh so i have like these parameters here uh so i know that okay in this case i could actually change that to 81.

um so i know the time range for the data um and i basically have this argument here which is the path uh which as you can see it's basically like a pattern of what of how that is actually structured on the server and what i'm basically saying is uh take whatever parameter the user provide and then fill those in here and then retrieve the data so in this case i'm actually using this new feature in sdf you can now uh basically request an sdf over http directly so this is basically what this is actually doing because i didn't actually want to go through the opened up server um and uh one more thing so i'm also telling intake to actually cache the data on the first access locally since the request may actually be expensive and basically once i'm done with that i can basically say okay give me the data correspond to this year this month and this day and then what i get back is basically an xray data set so this is not really fun so what we can actually do we can actually tell we take his intake to basically retrieve a bunch of those so in this case i'm basically defining like a range this is actually like an entire year and i'm basically using das to basically retrieve a bunch of those files uh in parallel and also caching them as well and what i get is basically again a single xray data set and once you have that data set you can basically just do some interesting stuff so you can just go on and actually do some um interactive visualization you can build uh dashboards with this because as you can see the only thing that you really have to provide is just these parameters so this is something that you can easily turn into a dashboard um so that's enough about intake um now let's actually talk about intake esm so so intake doesn't really um force you to have your catalog as a yaml file you can basically define your own catalog format and then you can actually uh build a plugin on top of intake which is what we did for earth system motor outputs because these they tend to be huge so basically a yamo is not really the right way to do it and also the hierarchy and how things actually structured in this case the use case is going to be the cmip which as most most of you probably know it's really like an international uh effort like from all these different countries and different institutions um and an end car would basically have a subset of that data so in this case i have my catalog as a json file and in the json what i basically have i have a pointer to a csv file which in the csv i basically have a table where in in the table each row corresponds to every single necessity of file and the metadata associated with that necesita file um and when i basically read it what i get is basically um a data frame so i read my csv into a data frame uh and as you can see basically i have close to 1.7 million items here and then within the gsm then i can actually now start actually querying that catalog so in this case i'm saying give me this variable i'm only interested in this particular experiment id for this particular time frequency and for a bunch of other stuff here what i get back is basically the same object as the original object but then it only has a subset of the data and then once i i'm satisfied with that request i can now basically tell intake esm to load the better into x-ray objects which what i get basically here is basically a dictionary of data sets and that dictionary of datasets basically has so we basically take the 78 files and we actually group them into compatible groups um so we only end up with only six x-ray data sets even though we actually had 78 nested files and then once we have that we can basically just do regular science so in this case i'm basically just computing the mean across time um for all the data sets that i was able to retrieve as you can see there's some activity going on here with dusk um and then once i'm done computing the means i can basically just do the plotting so basically after this step where you actually tell intake esm to load the data into ksm just gets out of the way um so after cell number two some cell 12 you should basically just do the regular stuff that you do with uh uh xray and dusk and uh with that let me see if the plots actually show up uh in a second um uh hopefully it shows up soon but yeah so yeah so basically this is what i got uh so and as you can see i was able to do that with as i mean the amount of code that i have to write here is really small and as you can also see throughout this notebook this you don't see any path or any url which this also kind of uh helps with uh disability and things like that you can easily share this and if let's say someone was actually working in the cloud they can actually point intake esm to some catalog that actually points to data in the cloud and they basically just run exactly the same code and with that i'll basically hand it to you to the next speaker hopefully i didn't go over five minutes thank you anderson that was that was lovely and i really i really enjoyed the demonstration of the task cluster um on on the right kind of watching the activity as the computation was running um next we have um scott dale peckham from cu boulder who will be talking about widgets and interactive interfaces so scott please go ahead and you have the floor can you see my screen yes we can and we can hear you as well thank you okay okay great so um i'm going to give a just a short talk about a um jupiter notebook that has that uses ipi widgets and ipi leaflet to create interactive gui and map for people to select data sets and the the project this is part of here is this logo um it's an earthquake project called balto which is an acronym for brokered alignment of long tail observations we thought it was cute since balto is the name of a famous sled dog from iditarod fame we thought it was cute to put him in because he has a long tail also but um put that aside so these are my i my uh copies on the balto project that's at the top of this and there's a little table of contents in this notebook this is on my github um repository which uh you i think there's a place to find that later but if i scroll down to where the action is here um there's code behind this that i've written for both the gui and for some plotting and i'm using primarily four packages once ipi widgets for the widgets ones i apply leaflet for interactive maps in in the notebook another one is pidap for accessing a a server that has the opendap protocol that supports that protocol to access some data another one is matplotlib for the graphics and so if i run this this little section here it starts up a tabbed gui hopefully you can see that okay it's got five tabs it's the first one is to browse some data so we've got a default uh opendap url in here and i'll hit the go button and it goes out and it searches that url and gets this list of all the files that it finds there which is this is a test server that uh the opendap people use where they have lots of variety of data sets and so you can you can go and first choose a data set and i'll choose this one called sstmonthlymean.nc.gz and then it will based on what it finds inside that file it will give me a list of the variables that are possible as a in a drop list here so i'll go down to sst and everything will update to show me the units or degrees c the dimensions are time lapland here's the shape of that array um which is you know it's kind of big in time not very big in space and two-byte signed integer so all the information it finds on this puts here and then any attributes search with this variable which would normally be part of an scdf file are put in the drop list for for quick reference to be able to look at and if i had chosen a different file up here it would be different variables and different attributes populating these lists so now that i've chosen the sea surface temperature monthly means data set i don't want to download the whole thing i want to subset it so i'm going to go to spatial extent and because i happen to be in puerto rico right now in my house in in tokyo i'm going to just zoom into that but i keep activating my dictionary for some reason so i'm going to zoom in this is using ipy leaflet now for this interactive map and at the bottom here i've implemented some of the different maps that are part of that so i can just toggle between s3 world street map or openstreetmap.mapnic or many others but i'll stick with this one and there's also a some tools you can optionally include in your ipi leaflet window besides the zoom which is this full screen option which is kind of cool and then you can go back to when you're done looking at things to the smaller view but now i'm going to zoom out just a bit to get more of the ocean around puerto rico and the research question is okay has surf how has sea surface temperature been changing over the last hundred years around puerto rico in the in the waters of the caribbean so now i'll go to the uh the date range and this data set goes back to 1854 but i'll show a smaller portion of it so i'll change this to 1908 to get a nice 100 years and then i go to the download data and hit download and it's really fast because it has already subsetted the data by space and time on the server before downloading it and so instead of downloading that large data set i just download the little part that i need for this this thing and if i scroll down through these instructions i have a few things i can print out about the data set that are loaded into the balto gui object looking for no data values looking for things like that but then there's interest of time i'll just jump down to the plot here where now i'm using balto plot which is another set of routines based on matplotlib that i've you know plot up the sea surface temperatures within the region with one of the corner pixels of the region that i just selected and sure enough there seems to be a a trend towards greater uh temperature over time over the last hundred years and these these wiggles are are the annual cycle or you might think that they're an annual cycle because this was monthly data so to confirm that we'll plot a subset of that and sure enough there's about 12 dots or plus signs per oscillation here so that's the 12 months of the year before it goes to the next cycle so that's the basic idea just shows how kind of a cool way to uh blend together or glue together ipi leaflet ipi widgets pi dapp and matplotlib to create a tool that could be easily modified like if you wanted to modify this to do something else to go to some other type of server or do something else you can you can look at the code in in balto gui dot py and see how the panels are set up and see how the events are processed and and you know just go from there it's all open source and that's that's it excellent thank you so much scott i really enjoyed your presentation at the previous earthquake meeting so i'm glad you were able to join us much appreciated no problem continue being in contact about these things um so next we have um adam marge from uh from uc berkeley who will be talking about applications in hydrology so and um you have the floor yeah okay let me share this screen [Music] i do see my screen and hear me talk perfect yeah okay um thank you for the opportunity to speak the hydrological use case of the punjab project and i am adam uh i on high apologize colored models and their uncertainty in talking about the original android use case i'm going to start with three brief motivational statements um the good bad and ugly in hierarchical data processing particularly in is of intensely monitored worst [ __ ] so the good news is uh funding for highly instrumented worksheet for long term [Music] hierarchical data collection has been a priority and this funding comes from sf and other certain federal agencies so there is a concert to refer to monitored west ships where [ __ ] data is to support thyroid you can't waste in diabetic understanding um here is a map of two of the major networks of biological data observation particularly intensively monitored which is the red one represents the critical zone of the visory dataset and the green one represents the long-term irish collection with ships and some other data sources from public uh universities and national laboratories so within each website there is uh many subsets of watershed within each single ceo or ether which it would have multiple switches where data has been collected and with subwoofers there are stations uh that will collect numerous technological and quantitative variables with these sensors and usually these sensors collect data at hourly and some hourly uh frequencies so um [Music] it's uh maps and this will share a good opportunity for developing generalizable hierarchical principles and behaviors about these wealth shares and we can develop some theories or generating principles based on this data but the reality is such principles do not yet exist in higher orders and then to bad news we don't have a synthetic understanding that emerged from this data that would support higher logical understanding or models and to come to the eyesight these datas are generated from this uh which share networks data be generated and they do not have common timestamps and they do not have they do have very big gaps or different gaps and they are really difficult to access so you see the ugly parts of data collection from these interestingly monitored issues in short data collected from this uh west is organized and naturally not ready to use to model development or scholars and then to the storage what we are proposing and working on is a punjabi use case for performing common hydrological downloading and processing tasks so we intend to develop a jupiter-based um there are protesting skin that will acquire the data and make the data ready to use from this intensively monitored research data processing scheme we employed includes primarily four standards for standardizing the first stage being data downloading and acquiring from these different sources and websites and second stage is where we do body control and cleaning where we attend clean and fan of outliers as well as unrealistic value since called enrolled and the second phase is i mean the surface is the aggregation where we aggregate the sub hourly and hourly times a daily time and first stage is where we do data processing where we feel misinformed and in filling the missing values with three stages primarily the first one being based in published where we feel short shorts uh lenses missing values primarily less than week or less on a day in hourly day assets and then we go to regression to fill the missing values and in reaction we do have two types of regression one is the sasha regression the first space where we call different uh stations and second is the climatic catalog method or the client catalog regression approach where we borrow and regress time data so these are the data processing schemes we employ in very interactive manner in jupiter and the envelope from this is for this uses a very organized data from thirty websites across the us so i showed you earlier and the data contains discharge precipitation uh snow sweet soil moisture soil temperature and isotope data from these switches and we do have so far sheets and the data record language ranges from five years or two years there are forms for some variables and 20 years for some of the premium environs and the data format pleasing this is an htf format and it will be hosted in the piano cloud and going forward we are hoping to expand some multiple waste shapes hopefully across the us in more places and going as have the u.s and curry uh developed us our plan going forward so that we an open interactive platform anywhere any researcher can contribute to this um often reproducible networks and they can use our jupiter tool to clean and produce to a standard format unorganized program so with that i will conclude my last name thank you so much adam for that presentation um we'll move on to eric sindell who's joining us from sweden and talking about jupiter book eric um please go ahead i don't know that we're hearing you eric nope no audio we can see your screen but i can't hear any audio fernando do you want to move to the next one and we can while you test your audio uh eric if that's okay with you since we're a little bit tight on time georgiana would you uh mind hopping on um to the jupiter hot presentation and we'll try to debug with eric um separately sure let me show you screen you can share your screen and we can hear you georgiana yay okay so uh hi everyone thanks for inviting me here so i'm a georgian dolo can and i'm currently working as a jupiter hub and binder contributor in residence and today's presentation will be about um we'll cover some basics about jupiter hub and the quick demo so um when you first hear the words jupiter the first thing that pops into mind is probably the notebook with all the the beautiful visuals the code and the text um but then when you say jupiter hub it might be a bit confusing at first to know the difference between the two i mean for me it was at first so um this presentation will will try to explain a bit um these differences so let's say uh we have a user that has a jupyter notebook and then um this user has some uh some team members that they do want to use the jupyter notebook um to be able to um work on the same data set or just share the the compute environment and for this we have jupiter hub and jupiter hub is made out of three main components which is the authenticator to like make sure that just the right person accesses the hub the spawner um that creates each user each user's notebook and then we have the proxy to know how to route each user to their um their own um jupiter notebook server um and all of these components are configurable um you can choose from some that are available from the community so for the authenticator you can have the form the palm authenticator the native one you can log in with github um google bitbucking and others the same for spawner you can have some options you have some options there and for the proxy there are only two right now the uh configurable http proxy and the traffic proxy and all of these are available under the classic notebook interface or as jupiter lab and um to make the um jupiter hub deployment easier um we have like two superhero light projects uh the little jupiter hub and zero to jupiter hub and um the first one the little jupiter hub as the name says it's um mostly suited for small groups of people because um you have just one server and you have um the users and the hub running on on just that server this can be a bare metal one or just in the cloud and for zero to jupiter hub we have multiple servers again in the cloud and all of this operates under kubernetes laws so that's super cool and for the demo um i want to show how easy the little jupiter hub deployment is and this is pre-recorded because it takes more than five minutes to actually install it so this is uh deploying the little jupiter hub on digitalocean this everything here is in the documentation so um right here i'm just choosing the data center and then in the user data you just have to copy from the documentation just a small comment to run and then everything gets installed and here you just set the uh the admin the first admin that will be able able to access the hub the first time okay so we are not enabling enabling backups because this is just a demo uh so while the droplet um builds uh after this you'll have an ip address there um when you first access the ip address you will get a 404 because and this means that the literature hub wasn't yet installed but you can always ssh into that server and see the logs there um if something went wrong and you can actually see when um when that was ready so here i'm just showing how the log the logs look like so you just see all the steps here and then when everything is done you'll actually see a message with that and afterwards you just copy the ip address and then paste it in a browser and you have the jupyter hub login again logging in with the first admin i just set or delete a jupyter hub uses first use authenticator and this means that the the the password you set when you first log in that will be the password associated with your account you can always change the authenticator afterwards um okay so as an admin you can always um add more users um from the admin interface you can add just regular users or other admin users and um these admin users will have um we have the will have the uh the rights to install different packages because all the admin your users under jupyter hub under the little jupiter hub are also sudo users so this is just accessing the terminal from from an admin account and well here i'm just installing on the numpy package you just have to use the minus c option to make sure that it gets installed into the right environment which is a user environment okay so once we have this installed um you just like open a python notebook and you can import that package and hopefully there'll be no errors in my demo there were there weren't so so that's basically it the little jupiter hub is very easy to install thank you very much for listening to me i hope you liked it thank you so much for that lovely presentation and very impressed that you were able to do this off of an actively running video um that's very good timing much appreciated um eric it seems like your audio is now back on so we'll move on quickly to you in the interest of keeping time and leaving at least a little bit of time for the win for the q a thank you this lightning talk is about jupiter book jupiter book is a tool that allows you to create beautiful websites from notebooks and markdown files and since we use them so often this is a tool that can come very useful so jupiter book does something like this it takes a collection of content in the form of notebooks and markdown files and it converts it into a book a book can be a website or it can be a pdf file and other things but we'll focus on creating websites so you you have a collection of content you create a website from it in jupiter book it's the tool that does it for you so getting started of using jupiter book is something like this it's a terminal tool so first you need to install it to install it use the python package manager called pip so pip install jupyter book and you're done to have a book you can get a starting point with some boilerplate files so you run the create command here if you run the create command you get these files the folder and some files in it let's look at the files first you have a configuration file this is where you set the title of the book and various other settings then you have table of contents this is where you define the structure of the book so what if you have multiple notebooks in what order should you view them on the websites for example and then of course oh those files here the configuration file and the table of contents it says in the file format yml it's for jaml and yaml is something that is very useful to learn it's like json but it's more human readable so if you spend time learning this it's something you won't regret in the folder you also get some demonstrative content so markdown files and notebook which becomes content for the website now you have just provided content so far you have not yet got any html or something like that this is what the build command does and by default you get html a website running this build command gives you output in a build directory the build directory will for example have index.html and this is a file you can open up on your local computer then it will look something like this in your browser this is the standard what you get out of the box jupiter book publish as a website uh but yes it's on your computer right now of course you would like to publish this online in the app so for doing that it's useful to have some git and github knowledge but the documentation for jupiter book is so great that you don't you can get by without having any previous experience i think the documentation is available at jupiterbook.org and to the left you can see here get started you go through the overview building your book and publishing it online and i have had such a great experience of publishing books online using github pages and github actions this is something this is an example of a web address that you can get if you publish your book online with github pages which is free and github actions is a tooling that allows for any change you do in the git repository to automatically build and update your book online so when you have set this up according to the documentation you don't have to use the tool jupiter book anymore because it's done automatically for you and that is really good because it's enabled one very useful feature you can go to the configuration file and points to where your book is on github and if you do you can get such a button here on your website and this button allows any users visiting your website to to find their way to the github repository where the book is defined and say oh you have a spelling error here or i suggest that you rephrase this part of the book like this and if you accept the change suddenly it's updated okay so now writing the book it you want something more than just a set of contents you want to cross-reference etc so you have some features for example perhaps you want to hide code code blocks of your notebooks and just so show graphs or certain sections you can use mertadorta inside of the notebook and in this case hide the code method.tag to hide code and while you write these books i said you can do it using markdown and notebooks but the markdown is a bit extended with the label called mist and mist provides two additional parts to markdown roles and directives these are like functions for markdown to the left we see an example of how for example to to cite a reference so if you define a reference you can cite it and you get it nice formatting directives are bigger functions so say you can do something like inserting a figure as an example here is how to insert a note which renders like this on your website so this is the markdown in mist flavor and here is the results from having such marker one of the most important things of jupiter book i think is this ability if you have a set of notebooks you want perhaps to have a one very big notebook that you just want to generate a figure from and then you want to use that figure somewhere this is what you can do like this you can use the glue function to save an object from a notebook and then you can insert it into markdown somewhere else in your book and over here it's a directive as we just have seen okay that's about it jupiter book can help you publish a website and the documentation is just so good that i believe everybody can do this it doesn't take five hours it takes 30 minutes or one hour and jupiter book is created as part of the executable book project and here are the team members of executable books and it has a rich community and we are all welcome to join i am one of the person happy to join this community of contributors and with that said please go ahead and visit jupiterbook.org to learn more and if you want to have these slides you can find them here i'm consideration github and chris holmgreff is in this meeting and is one of the co-founders of jupiter book thank you thank you eric much appreciated and we will close the lightning talk section with joe from pangea whom we've already referenced a few times so joe yeah the floor all right i'm just pulling up my screen here right i think you should be able to see things now audio and video all good great thank you well hi everyone my name is joe hammond and um uh lindsey and fernando asked me to just kind of wrap things up with a short lightning talk on how to connect with the pangea project and that's what i'm going to talk about uh just briefly um i'll just say who i am my contact information is here i'm a scientist at incar and and i also work at the a new nonprofit called carbon plan um and i've been working on the pangio project for about three three or four years now um and i also contribute to a bunch of open source projects so uh just remembering back to scott's talk an hour ago or so um pangio is first and foremost a community of people working collaboratively and i just want to highlight a few of the places that those interactions are had and where you might find pingio people around doing their thing so um the first is uh on online so almost all of the interactions and events that uh pingio uh coordinates or is part is part of um happen online so github is kind of the primary central uh place where you'll find things so github.com ping data we've got a chat room in on gitter and that's for kind of short quick messages and coordination things we have a discourse forum where you can find and post questions about how to do things or um we we coordinate uh regular meetings and whatnot there and of course there's a there's a twitter account as well um uh highlight a few of the kinds of uh of meetings um that we have on a regular basis so on tuesday and thursday mornings right now we've been doing what we call the covet 19 coffee breaks um these are open to anyone there's no agenda it's just like an opportunity to see another human um we talk about pancho obviously but you know baking or baseball whatever you want to talk about that's that's there so um just thought i'd mention that really quick there's a weekly uh developer meeting it's kind of a mix of scientists and software developers it's on wednesdays the timing alternates i didn't put it on the slide between an early time and a late time for different time zones and we do a lot of coordinating on kind of ongoing development in the open source uh scientific python world this is just a screen grab from a few weeks ago um and it's a good way to keep up with kind of what's at what the present day uh activities are um on the pangio project i spend most of my time on this slide uh this is uh we we have this kind of thing that on the surface looks a little uh bureaucratic but we have these things called technical or topical working groups and it's not meant to be bureaucratic at all it's meant to provide a space where we can have more topical discussions go a little deeper so right now we have four of these working groups there's a data working group that does things like that talks about things like data formats schemas best practices and performance so if things like you know talking about net cdf versus our versus tile db versus cloud optimized geotiff or in your wheelhouse um this is a this is a good working group did we lose joe i think we might have let's give him a few seconds but it's 9 10 in our agenda but it's actually 9 45 in the real world and so we don't have a huge amount of slack we have negative 35 minutes of slack right now so unless joe's internet returns happily within a few seconds we may need to wrap up um we will post uh folks um links to all the slides okay joe actually dropped out so his computer may have crashed um lindsey we were wrapping up and i think you were gonna run the what we have left of time with the q a session so i might just hand it off to you um and and unfortunately we may have had to cut joe's talk short uh no problem we'll hopefully get uh joe slides but at least they get a bit of a flavor of some of the places where you can interact with the pangea community and as a part of the q a session we'll also be inviting folks uh to engage on discourse uh to answer some big picture questions that we have posed and that we were hoping to speak to but we probably won't have the time to get into in depth um so what i'm going to do is i'm going to share my screen here all right and can folks still see that perfect okay excellent um so the first thing that we're going to do is actually invite a couple people who have posed some questions either before the session or um during the session uh to ask their questions so i'm going to start off on by inviting phil austin uh to jump in with his uh question and sort of call for some community input sure can people hear me yes all right so just a little bit of background i'm um at the university of british columbia where i chair the atmospheric science program in a department that's sort of broadly earth science so we've got about 50 faculty we have about 350 graduate students and i think we actually are kind of typical for university infrastructure in that we have a ton of pretty capable three thousand dollar desk side linux machines that uh you know you just occur uh into budget grants where you have to buy something to lock in your um your budget uh before it disappears that kind of thing and so i i believe that there's a missing middle i'd call it between zero to jupiter hub which does great i mean it's it's for bringing up a meeting to teach it's perfect um but you leave little the littlest jupiter hub and then you hit zero to jupiter hub and it's not a learning curve it's really a learning wall i mean uh you need uh pitons to get up uh kubernetes i would say just giving you my personal experience and so what i'm volunteering to do is is just share my own journey which is to provide this intermediate stage where you can use these desktop machines or bare metal like little little stupid hub but you've got a pathway that involves docker and then just one other thing i think for outreach and training trying to figure out where graduate students are going uh experience with docker and you know how to actually manage the cloud is something a graduate student probably needs and i think having a graduate students practice on something that's free as opposed to practicing something you pay for is just a huge win thanks bill um does anyone else want to chime in or have follow-up thoughts on on phil's comments here i'll jump in there really quickly phil having deployed these things both single systems and at cloud scale i i agree that there's a middle ground um and and you you sort of already mentioned sort of docker things like docker swarm are just such nice eevee easy infrastructure to install on a number of managed machines so if there was a model of installing the jupiter hub to that kind of infrastructure i agree there'd be a lot of use for that yeah just to borrow something i wrote on this jupiter book ticket there's nothing more eloquent than a working example and being able to just do docker compose up and have something actually work where you can also ssh in and figure out where the jupiter hub config is stored and all that kind of stuff play around with a reverse proxy for those of us who are using that for the first time and then look at these look at these um ways of spawning notebooks uh for a certain type of person and a certain type of graduate student it's it's just really important to be able to get in and watch the pieces actually move and so to continue conversation i know you posted on the pangea discourse is that the best place for if folks are interested in in conversing with you on this would that be the best place to get started uh sure i i visit that pretty frequently and then also uh they'll i'll uh there's also this pangio outreach right poets which has been pretty quiescent but i think i'll try and bootstrap uh something there and i'll just start posting my own sort of um i'll for for anything i learn i'll actually put an executable book together and i will have the executable book itself will be a docker compose github repos so you'll be able to run my executable books with a jupiter hub and a web server and we'll just see how that goes excellent thanks phil next up we have a question from lisa alisa are you online i think so can you hear me yes great um yeah so as i said here i'm a phd student in the energy resources group at berkeley and a master's student in the computer science department here as well um but before this i worked at an environmental consulting firm um and did a lot of data analysis and data work with cmit 5.

so the cmip6 has seemed really interesting to me and i still contract for that company so i'm basically predicting i'm going to have to give some pitch to them in the next year or two about how we're going to work with cmip6 because we have nowhere near the resources um to do that locally we were pretty pushed to the limit even with cement 5. so sort of curious i think we're mostly academics and government people but was sort of interested in like the infrastructure permissions costs for working with some of this potential cmip6 infrastructure from a private company setting just curious people have done it or thought about what that might look like joe is this something that you would have experience with your back you might not actually be on the call he probably would if he was back i think yeah i think joe is not back i don't see him at all on the on the list um so he may have had a an internet mishap scott did post some helpful links though um on slack i think yeah yeah go for it oh um yeah the links are on slack replying to lisa's question but um for sure private companies have been setting up the same infrastructure we've shown today for the cloud deployments um i think it comes down to whether or not you have personnel to dig into the details as philip yeah and it's a bit of a learning curve and so there are companies now as well that are starting to provide these services for you know with a fee um catering towards specifically companies rather than like educational government academic groups and so that if you're going if you're doing work for a company i would possibly recommend looking at some of those yeah start some of those startups cool yeah that makes sense yeah i think overall just this stuff i mean as a speaking from a graduate student perspective i think the comment before this was really relevant and this the cmip6 and other stuff is really relevant so it's exciting to see as a student well thanks scott and lisa it looks like uh joe has just rejoined us sorry we lost you there but we did actually get a question that is quite relevant to your talk asking about uh if the working groups are open to the public yes somehow zoom totally hard crashed on me and i'm back now so all right and the answer is yes they're all open to anyone so the website that i had up there painting dot io slash meeting notes has information about joining um any of the five or six regular regularly scheduled meetings that we have for the working groups what i can do here joe is i'll stop screen sharing if you'd like to share share those slides if you still have them up or if there's any other comments you wanted to to get in before before we lost you um yeah i don't know if i still have them oh i do have them up still so um i'm not a host anymore though so i think it's okay i was on this my second last slide it's um it's not a big deal okay um were there any other questions uh for joe on engaging with the community joining in on working groups anything anyone would like to bring up there i also wonder if joe had any comments on lisa's questions because he's worked i know he's worked with google on making the cmap6 data sets available through at least through google cloud and i don't know if that intersects in any way with lisa's concerns on access to those data and their usage so the joe the question that's on the screen right now yeah so there's actually some kind of movement there there's actually going to be a mirror of this data also on aws so we're kind of seeing the data um proliferate a bit um at least between google and amazon um and it's they're in public uh public uh data buckets so there's no egress charges for using them outside of the cloud but you'll find that the performance of using them in the in the same region cloud region would be the best so um you know i think to the there's there's no there's no nothing that limits you to using uh that limits us to just having this be kind of government and academic anybody and actually there are a few companies out there that are using these these data resources from the cloud um just just like the png like they have kind of their own private version of pango running um and accessing these data so it should be no problem if you can sort out how to pay for it yeah that makes sense uh were there any other questions that folks wanted to bring up pause for a moment i'm curious about the the size of data that you would typically use work with because if you work with cmip6 cmx cmip6 is so big that you don't work with the whole dataset at once but how big chunks do you typically work with in your workflows lisa do you have uh oh that's for me sorry i'm just gonna say that for me um yeah so i think i think cmip6 will probably be an interesting um like new challenge for that question with cement 5 we were probably at largest working with i mean we were storing maybe six or seven terabytes on disk um we're working with the loca downscaled cement five product um so it was a 16th of a degree um and i think we used 14 gcms um so yeah somewhere around there so so sort of as you mentioned the challenge hasn't been although the speed with actual it's like there's two challenges right there's the storage challenge that we were facing because we're sort of pushing the envelope on the data processing of the company and then there's also the processing side of things and both of them have started to become bottlenecks for my work you know the things that i'm asked to do are not particularly complicated like i can do them but the company resources are just bottlenecking both the processing and the storage um so it kind of depends on what the client wants um i think cmipsix will be a new challenge because everything is not really an option anymore thank you any other questions okay well i know we're coming up in the hour so what i would like to do is show there's a couple of questions that we've prepared um for the community that i think would be interesting uh to really collect some community knowledge on and see what we can do um to potentially build upon ideas and so what i've done is we've posted these on discourse um so i'll read them out now and we can we can think on these and hopefully continue some of the discussion on um on discourse actually so eric and i assembled these together so um perhaps i will let eric uh read through this one and we can alternate here so what does your interactive computing workflow look like today and what you envision it will be in five years so we hope to get input on your visions on how to improve the workflow and understand better how how your workflow looks today the next question we have relates a lot back to eric's talk um how would you like to publish and share your computational research and where can improvements current where can improvements be made um so thinking sort of again both about this how is your current workflow implemented and what would you envision it and what ideally would it look like for you and how do you stay up to date with the evolving open source ecosystem in other words how do you learn about the new tools that you may want to use and how would you like to learn and how would you like to keep up to date on these tools it's a lot of tools how do you learn about them and then two if there's other ideas for community projects i think phil prompted some nice questions of you know what can we be doing sort of in between um the littlest jupiter hub and jupiter zero to jupiter hub but there's certainly other ideas out there um and so feel free to to post your questions and uh we can look at you know finding uh getting in touch with the right communities to take action on those um so we posted and eric posted the link in slack we've got the discourse post that we've created uh for this session and so we posed those questions there so we would invite you to contribute your ideas um what we would like to do is we'll be writing up a short blog that's a bit of a summary of of this meeting um and we also would really like to include ideas that folks have posted and posed and so we'll we'll work on synthesizing uh what you share with us um and so with that um we just want to say thank you for joining us today uh it's been certainly an interesting and enlightening session i really appreciated all of the the speakers who took time to prepare material um and uh invested their time and effort in in sharing their knowledge with all of us uh so big shout out thank you thank you to all the speakers and thanks for everyone for spending time with us today i hope it has been a useful session for you and we look forward to your input um fernando i don't know if there's anything else you would like to to close with usual you're muted on zoom i echo i echo your your thanks to all the speakers team to the earth cube team as well who hosted us and provided coordination and support on on slack we very much appreciate it um and we will post as soon as the recording of this meeting is ready uh we will uh post it online on on discourse as well um and all of the slides and materials from the presentations that we have are also available and so we will we will post that uh we will post those links both on discourse and on slack so thank you everyone and uh we'll try to wrap up now so that people can go on to their 10 a.m meetings which i'm sure everyone has their next zoo meeting to jump into so i'll stop the recording now um thank you everyone it was a pleasure and we thought it was useful for for folks i'll stop recording

As found on YouTube

PEOPLE – SERVICES – IMPACT

Jupyter Meets the Earth – Community Forum

Cancel reply

Disclaimer

Terms & Conditions

Privacy Policy

Accessibility Statement

Disclaimer

Terms & Conditions

Privacy Policy

Accessibility Statement