Vertical Solutions by John LaGue

Vertical Solutions

Below are dashboards / reports for the deliverables asked for as well as a sample of other directions I exlpored.

You'll find dashboards, some statistical analysis, and custom visualizations that I believe offer some interesting content. I created the dashborads in Tableau (similar to Power BI), I uesd R and R studio for statistical analysis, and finally I created some fun visualizations using JavaScript (d3 api/library) and HTML/CSS. The visualizations are admittedly less practical, but still interesting.

Each dashboard has a brief explaination attached and I have included a summary of my findings at the bottom of the page.

Assumptions about data:

Times worked that are negative have signifigance and should be included in calculations

Foremans "Justin" and "Justin Zimmerman" are two seperate people (they likely are the same person; however I treated them as two sepearte individuals for the purposes of the below)

Total lunch time and Report total time variables are in minutes, total work time is in hours

Finish% means wood finishing and does not mean the % of the total project that is complete

The letters of walls (A,B,C, etc) has signifigance as I group them together several times

Blank Task or Project values in the Labor Report are not purposeful and are treated as Null or "job not listed".

Notes: Labor Report came with a row that appears to have a sum in the bottom row, the sum is incorrect and was excluded. There also appears to be a missing employee (Employee #6).

Dashboards

Above are two charts, both showing the Status of each resource over time. The top chart shows the status (in % completed) of each resource (Pads, Finish, Steel, etc) and the bottom chart shows the status in Sqft remaining. You can filter for each project (Woodpine, Cherrtown, etc) over the two month period available. Each resource is a unique color and each project location has it's own row. You can filter the resource statuses on the right side by time and Foreman as well. You can group the information my month (May or June), by 1 or multiple weeks, by the exact date, or you can use the Time slider to create a custom date range.

The first observation I made was about the time it takes for each resource to reach 99-100%. It appears that while Pads are always started last they are usually completed the quickest. Steel also is always first (which makes sense as it is the structure). I'm assuming Rough% means some sort of rough skeleton that will support the Wood along with the steel because it always comes sequentially by steel and is followed by an increase in Wood. Cherrytown had the least amount of Sqft to complete and only 1 wall, understandabley the sections all completed much quicker. No other project was marked as fully complete in the time-frame given. However, Walls C and E saw an interesting trend in which there is more overlap of tasks/resources worked on at a time. This brings me to the question of does working on tasks sequentially rather than concurrently decrease the time to complete a project or would concurrancy (with obvious limitations) improve the speed of completion?

Important: Some days had multiple reports. So, each day is represented by the final report for that day or the "max" status % for that day. Example: if a crew started a day at Finish% = 23% , had Finish% = 50% mid day and ended at Finish% = 75% Then you would see 75% as the value represented for that day. With no filtering it will appear as if some Walls were started much earlier than they were in reality.

This dashboard looks at the hours worked on each task and filters by Project, Employee, and Time. Use the dropdown menus on the right to filter the data. You can see which employees where spending time on which tasks and on which project. You can also see in the bottom chart how the hours worked on a task change over time. You can group the information my month (May or June), by 1 or multiple weeks, by the exact date, or you can use the Time slider to create a custom date range. Someone more familiar with the build process / what the values mean may be able to quickly spot tasks that are using unnecessary amounts of time. Using the information from the previous dashboard we can also see that there is a general spike in hours worked among most tasks when a project begins the Finish, Steel and Wood stages of building.

Important: NULL means the employee did not enter the task name they were working on or it was not recorded

Other Dashboards

This is an additional dashboard to view the connection between the Foreman and various job updates. You can see which foremen have the most delays, noteworthy perfomances, client feedback, and the average (median) hours worked. You can also filter by time, so if you would like to see all those values on a specific date or week you can select those days in the filter on the right side. It appears that delays rarely come from the Owner and are instead Other. Based on how there seem to be delays where there are also Equipment requests then I would say the delays that happen tend to come from needed equipment. Predicting/foreseeing these incidents could help prevent future delays. Speaking with Jeffery who has experienced these "Other" delays the most and Richard who has had no delays but the most equipment requests per report would be useful. Assuming Noteworthy performances are positive, it would also be useful to speak with Richard because he has the highest number and ratio of Noteworthy Performances and likewise, Jeffery has the highest number of Client Feedback (more feedback is generally good, but feedback can be negative as well - I'm not sure what feedback refers to in this case).

Analysis

A more in depth look at the relationships between the different tables and their values.

Before analyzing the data, or even creating the dashboards, I cleaned the data and created several joined tables. Tableau has an easy to use interface for joining tables and in R I used the 'dplyr' library to perform SQL like joins on the data.

The relationship between the Sqft remaining/ Total beginning Sqft and the time to complete a wall

Looking at the dashboards it is not immediately apparent as to whether the total Sqft of resources needed on a project has a strong effect on the time taken to complete a job (which it intuitively would). Intuitively, one would think that it does, but there may be other - more important- factors. Explporing this will help exlpore possible room for improved efficiency.

Below is a table showing the times taken (in days) for each wall to be completed and the total starting Sqft for each resource.

Wall	Wood Sqft	Wood Start	Wood End	Wood Completion	Steel sqft	Steel Start	Steel End	Steel Completion
Wall A Woodpine	1189	06/18/19	Unknown	Unknown	1189	05/23/19	05/30/19	7
Wall B Woodpine	NA	Unknown	Unknown	Unknown	NA	Unknown	05/28/19	Unknown
Wall A Azmar	1360	05/16/19	Unknown	6	292	Unknown	05/16/19	Unknown
Wall B Azmar	Unknown	Unknown	05/31/19	Unknown	Unknown	Unknown	05/28/19	Unknown
Wall P Cherrytown	840	05/07/19	05/12/19	7	840	05/01/19	05/14/19	13

See full table HERE.

I formalize this relationship with a correlation analysis in R to understand the relationship between nubmer of days taken to complete a portion of wall and the starting Sqft.

However, the data set is not compelte enough to make a confident conclusion. It appears that that the Sqft remaining do NOT have a large correlation with the time taken to complete a job. However, it does appear that the number of walls on a given project does increase the time taken for completion (intuitively this makes sense).

Is it possible to predict Employee Hours on a given task?

Using only the tables provided I generated models to predict the time each employee would spend (in hours) on any given task.

I began by exploring the relationship between Task/Job, Division, Department, Project, Foreman, etc and Hours worked by an individual through several linear regression models.

However, linear regression models were not explaining much of the variance in the data (thus not predicting well) likely in part because most of the variables are non-numeric (ex: Project name or task name).

Here is an example of what the linear model looks like with only one variable (Steel Status) used to predict the number of hours. In actuality I used about a dozen dimensions in the model and it wouldn't be practical to visualize. As you can see though the prediction is not matching the data very well.

Below is a very small sample of some of the predictions along side the real values for the linear models.

After getting poor results via linear regression I applied a 'random forest model' to see if there would be an improvement in performance. If you want to learn more about Random Forests please check out this link.

Here is a sample of some of the predictions along side the real values for the random forest model.

The random forest model did explain more of the variance than the liner regression models; however, the prediction was no better than simply guessing the average number of hours for each task. More information would be required to improve the performance further. Other models could be tested as well such as SVM.

Below is a listing of what the model thought were the most important variables (columns) for predicting the hours worked. The most important are at the top. Since the random forest model was no better than guessing the average, the importance ranking should be taken with a large grain of salt.

Is it possible to predict Noteworthy Performances or Materials requested?

Similar to the above analysis I began with by exploring general linear models. This time using logistic regression models since this is a classification problem. But I have not included the results here as I ran into similar limitations with logistic regression as I did with linear regression (see R code for more).

After exlporing logistic regeression I moved on again to Random Forsets which performed incredibly well.

Using almost all possible factors, the random forest model was able to correctly 'predict' whether there would be Noteworthy Performaces or Materials Requested more than 90% of the time!

Below is a screen shot of the important variables of importance for predicting Noteworthy Performances.

Importance chart for random forest noteworthy perfomances

And here are the importance variables for predicting Materials_Requested.

You can notice that both models have the other variable listed as very important (the model predicting Materials_requested lists Noteworthy Performances as very important and vis versa for the other model). These tells us there is a strong link between Noteworthy Performances and when Materials are requested. The other important variables that models share are Steel Status and Employee Count. So the number of employees on a task and the current remaining amount of steel are fairly correlated with material requestes and performances.

To view R scripts used to clean, organize, and analize the data please click here.

Visualization

Here is a visualization that shows some of the same information as above (Hours worked by employees) but in a different manner. This is just a way to explore the data a little bit more and is less practical than a dashboard type solution.

To prevent slow loading I put the visualization on another page, HERE

After submitting the report I started exploring more climbing data and thought I would showcase my scraping ability as well. I scraped all the climbing gyms from www.mountainproject.com and made a searchable map HERE.

Summary of findings

I found it was difficult trying to predict how much time a specific employee would work . Further work could be done on predicting how long each project and wall will take if there is more data available.
There is a weak correlation between the total amount of Sqft required for a project and the time it takes to complete that project. However, there appears to be a strong correlation between the number of walls on a given project and the total Sqft. Would need to get the full project data in order to confirm.
We did get great results for predicting when materials will be requested and if there are noteworthy performaces. The important variables here being:
- Steel Status
- Employee Count
- Materials Requested and vis versa for Noteworthy Performances
Looking into exactly how these variables effect the materials requested and noteworthy performaces is necessairy to try to optimize for fewer material requests and more noteworthy performaces.
Speak with foremen Jeffery and Richard to understand Delays, Noteworthy Performances, and Equipment Request frequencies.
Working on tasks concurrently appears to show a faster overall completion time; however, I would need data from several more projects to confirm that.

Areas I would explore next

Learning the process for wall building to understand what processes are immutable and which are open for optimization. Having a direct understanding of the wall building process would help answer the question of whether tasks should be worked on concurrently or all focused centralized.
Learning what "Other" stands for in Delays. This will help create a model to view where delays can be mitigated or reduced.
Getting the remaining pieces (dates before and after) to these tables would help in creating a predictive model to improve the prediction of how long it will take to complete a project.
I'm also quite curious about customer acquisition. Does a majority of revenue come from new gyms? From existing partners/gyms expanding to new locations? Or maybe from existing gyms/partners looking to update walls or expand within an existing location. If most revenue comes from existing customers, it would be interesting to dive a bit deper in the Customer Feedback data and see what correlates with positive experiences to help build a returning customer base.
Adding Functionality: Allow user to filter all dashboards at the same time. Update Null handling based on information from V.S. Update custom visualization to include sorting by project and seeing project broken down by employee.