Below are dashboards / reports for the deliverables asked for as well as a sample of other directions I exlpored.
You'll find dashboards, some statistical analysis, and custom visualizations that I believe offer some interesting content. I created the dashborads in Tableau (similar to Power BI), I uesd R and R studio for statistical analysis, and finally I created some fun visualizations using JavaScript (d3 api/library) and HTML/CSS. The visualizations are admittedly less practical, but still interesting.
Each dashboard has a brief explaination attached and I have included a summary of my findings at the bottom of the page.
Assumptions about data:Notes: Labor Report came with a row that appears to have a sum in the bottom row, the sum is incorrect and was excluded. There also appears to be a missing employee (Employee #6).
Above are two charts, both showing the Status of each resource over time. The top chart shows the status (in % completed) of each resource (Pads, Finish, Steel, etc) and the bottom chart shows the status in Sqft remaining. You can filter for each project (Woodpine, Cherrtown, etc) over the two month period available. Each resource is a unique color and each project location has it's own row. You can filter the resource statuses on the right side by time and Foreman as well. You can group the information my month (May or June), by 1 or multiple weeks, by the exact date, or you can use the Time slider to create a custom date range.
The first observation I made was about the time it takes for each resource to reach 99-100%. It appears that while Pads are always started last they are usually completed the quickest. Steel also is always first (which makes sense as it is the structure). I'm assuming Rough% means some sort of rough skeleton that will support the Wood along with the steel because it always comes sequentially by steel and is followed by an increase in Wood. Cherrytown had the least amount of Sqft to complete and only 1 wall, understandabley the sections all completed much quicker. No other project was marked as fully complete in the time-frame given. However, Walls C and E saw an interesting trend in which there is more overlap of tasks/resources worked on at a time. This brings me to the question of does working on tasks sequentially rather than concurrently decrease the time to complete a project or would concurrancy (with obvious limitations) improve the speed of completion?Important: Some days had multiple reports. So, each day is represented by the final report for that day or the "max" status % for that day. Example: if a crew started a day at Finish% = 23% , had Finish% = 50% mid day and ended at Finish% = 75% Then you would see 75% as the value represented for that day. With no filtering it will appear as if some Walls were started much earlier than they were in reality.
This dashboard looks at the hours worked on each task and filters by Project, Employee, and Time. Use the dropdown menus on the right to filter the data. You can see which employees where spending time on which tasks and on which project. You can also see in the bottom chart how the hours worked on a task change over time. You can group the information my month (May or June), by 1 or multiple weeks, by the exact date, or you can use the Time slider to create a custom date range. Someone more familiar with the build process / what the values mean may be able to quickly spot tasks that are using unnecessary amounts of time. Using the information from the previous dashboard we can also see that there is a general spike in hours worked among most tasks when a project begins the Finish, Steel and Wood stages of building.
Important: NULL means the employee did not enter the task name they were working on or it was not recorded
This is an additional dashboard to view the connection between the Foreman and various job updates. You can see which foremen have the most delays, noteworthy perfomances, client feedback, and the average (median) hours worked. You can also filter by time, so if you would like to see all those values on a specific date or week you can select those days in the filter on the right side. It appears that delays rarely come from the Owner and are instead Other. Based on how there seem to be delays where there are also Equipment requests then I would say the delays that happen tend to come from needed equipment. Predicting/foreseeing these incidents could help prevent future delays. Speaking with Jeffery who has experienced these "Other" delays the most and Richard who has had no delays but the most equipment requests per report would be useful. Assuming Noteworthy performances are positive, it would also be useful to speak with Richard because he has the highest number and ratio of Noteworthy Performances and likewise, Jeffery has the highest number of Client Feedback (more feedback is generally good, but feedback can be negative as well - I'm not sure what feedback refers to in this case).
A more in depth look at the relationships between the different tables and their values.
Before analyzing the data, or even creating the dashboards, I cleaned the data and created several joined tables. Tableau has an easy to use interface for joining tables and in R I used the 'dplyr' library to perform SQL like joins on the data.
Looking at the dashboards it is not immediately apparent as to whether the total Sqft of resources needed on a project has a strong effect on the time taken to complete a job (which it intuitively would). Intuitively, one would think that it does, but there may be other - more important- factors. Explporing this will help exlpore possible room for improved efficiency.
Below is a table showing the times taken (in days) for each wall to be completed and the total starting Sqft for each resource.
Wall | Wood Sqft | Wood Start | Wood End | Wood Completion | Steel sqft | Steel Start | Steel End | Steel Completion |
---|---|---|---|---|---|---|---|---|
Wall A Woodpine | 1189 | 06/18/19 | Unknown | Unknown | 1189 | 05/23/19 | 05/30/19 | 7 |
Wall B Woodpine | NA | Unknown | Unknown | Unknown | NA | Unknown | 05/28/19 | Unknown |
Wall A Azmar | 1360 | 05/16/19 | Unknown | 6 | 292 | Unknown | 05/16/19 | Unknown |
Wall B Azmar | Unknown | Unknown | 05/31/19 | Unknown | Unknown | Unknown | 05/28/19 | Unknown |
Wall P Cherrytown | 840 | 05/07/19 | 05/12/19 | 7 | 840 | 05/01/19 | 05/14/19 | 13 |
See full table HERE.
I formalize this relationship with a correlation analysis in R to understand the relationship between nubmer of days taken to complete a portion of wall and the starting Sqft.
However, the data set is not compelte enough to make a confident conclusion. It appears that that the Sqft remaining do NOT have a large correlation with the time taken to complete a job. However, it does appear that the number of walls on a given project does increase the time taken for completion (intuitively this makes sense).
Using only the tables provided I generated models to predict the time each employee would spend (in hours) on any given task.
I began by exploring the relationship between Task/Job, Division, Department, Project, Foreman, etc and Hours worked by an individual through several linear regression models.
However, linear regression models were not explaining much of the variance in the data (thus not predicting well) likely in part because most of the variables are non-numeric (ex: Project name or task name).
Here is an example of what the linear model looks like with only one variable (Steel Status) used to predict the number of hours. In actuality I used about a dozen dimensions in the model and it wouldn't be practical to visualize. As you can see though the prediction is not matching the data very well.
Below is a very small sample of some of the predictions along side the real values for the linear models.
After getting poor results via linear regression I applied a 'random forest model' to see if there would be an improvement in performance. If you want to learn more about Random Forests please check out this link.
Here is a sample of some of the predictions along side the real values for the random forest model.
The random forest model did explain more of the variance than the liner regression models; however, the prediction was no better than simply guessing the average number of hours for each task. More information would be required to improve the performance further. Other models could be tested as well such as SVM.
Below is a listing of what the model thought were the most important variables (columns) for predicting the hours worked. The most important are at the top. Since the random forest model was no better than guessing the average, the importance ranking should be taken with a large grain of salt.
Similar to the above analysis I began with by exploring general linear models. This time using logistic regression models since this is a classification problem. But I have not included the results here as I ran into similar limitations with logistic regression as I did with linear regression (see R code for more).
After exlporing logistic regeression I moved on again to Random Forsets which performed incredibly well.
Using almost all possible factors, the random forest model was able to correctly 'predict' whether there would be Noteworthy Performaces or Materials Requested more than 90% of the time!
Below is a screen shot of the important variables of importance for predicting Noteworthy Performances.
And here are the importance variables for predicting Materials_Requested.
You can notice that both models have the other variable listed as very important (the model predicting Materials_requested lists Noteworthy Performances as very important and vis versa for the other model). These tells us there is a strong link between Noteworthy Performances and when Materials are requested. The other important variables that models share are Steel Status and Employee Count. So the number of employees on a task and the current remaining amount of steel are fairly correlated with material requestes and performances.
Here is a visualization that shows some of the same information as above (Hours worked by employees) but in a different manner. This is just a way to explore the data a little bit more and is less practical than a dashboard type solution.
After submitting the report I started exploring more climbing data and thought I would showcase my scraping ability as well. I scraped all the climbing gyms from www.mountainproject.com and made a searchable map HERE.